Hello,
I want to convert an Anndata object to TileDB Experiment directly on an S3 bucket. This works nicely for smaller datasets (<10GBs). However, for larger ones (21GBs in this case) RAM usage becomes the bottleneck. I use tiledbsoma.io.from_anndata
and tiledbsoma.io.from_h5ad
, both of which expand to 250Gb in memory. My hope was that with tiledbsoma.io.from_anndata
, I could read the anndata object in backed='r'
mode so as not to load the entire object in memory but this also does not work (ie. significant expansion takes place).
What I try to do is actually identical to this tutorial: TileDB
When I run that, the 7GB file is also inflated to more than 30GB. Is there a way to reduce RAM usage? Can you help me with this?
Thank you in advance!
Benoit
Hi Benoit,
Thank you for the question!
TileDB is optimized to query data on a filesystem-agnostic backends just like S3.
Starting from either an AnnData
or h5ad
object you’d transform that to a tiledbsoma.Experiment
as your first step. The result of that transformation would then physically sit in your s3 backend. This is not the step where are we are trying to optimize memory usage, we are primarily focused on ingesting data into the TileDB format. Following this transformation, users will gain the performance and efficiency they expect from TileDB during accessing that data.
To then query that data, you would use tiledbsoma.Experiment.open
to push the query down to the on-disk object, rather than first having to load into memory the entire dataset.
We’ve recently launched TileDB Academy, we have several tutorials for tiledbsoma
on there such as performing multi-slot queries on an Experiment
.
Thanks,
Spencer
Dear Spencer,
Thanks for your quick reply!
Ah I see okay but then there must be a way to ingest one observation at a time into an existing tiledb experiment on S3? This should not use too much RAM?
Thank you!
Benoit
Once you have an existing tiledbsoma.Experiment
on S3, you can append data such as adding new obs
data. Check out this tutorial on how to do so: TileDB
Spencer
Dear Spencer,
Thank you this helps a lot! When I try this way: first registering the anndata object and uploading the differences, I get an error:
ValueError: internal coding error: id_column_name unspecified
I deleted obsm and varm, and in both cases, the error is the same.
Thank you,
Benoit
Hi @bputzeys can you share a repro here so I can take a look?
Was this specifically during tiledbsoma.io.from_anndata
? Did you specify values to obs_field_name
and var_field_name
to tiledbsoma.io.register_anndatas
?
Thanks,
Spencer
Hi Spencer,
I try to recreate it and will let you know how you can do it too.
Yes, I get the error after registration during the tiledbsoma.io.from_anndata
step. Yes I specified obs_filed_name="obs_id"
and var_field_name="var_id"
in tiledbsoma.io.register_anndatas
just as in the tutorial.
Ah I think it was because I had .uns
values. I don’t think I need them so I can just do it without.
Thank you.
Hello Spencer,
I was able to reproduce my original issue: GitHub - bputzeys/tiledb-issue
If you could tell me that you can reproduce it that would be very much appreciated!
Thank you.
Dear @spencerseale ,
Did you have time to have a look and maybe recreate the issue?
Thank you!
Benoit
Hi Benoit,
We expect these processes to take advantage of most available RAM. The transformation from the AnnData
having many different objects to a tiledbsoma.Experiment
representing many different arrays on disk requires that transformation. When we open X
and transform that data to an array, this is where you’ll see the memory usage spike while writing it to its compressed format in TileDB. Our most compute intensive operations typically involve ingest. Looking at your code, I cannot speak to why your memory utilization is increasingly rising without doing a detailed analysis of your setup.
The query of this data is where TileDB brings the memory benefits via out-of-core query operations. We push the query down onto the object on disk and only when explicitly requested do we transform the object into a memory object.
On our multi-omics platform TileDB, we have compute infrastructure that many of our customers use to parallelize the transformation across custom resourced instances for these ingestions. This means you could ingest many h5ad files to separate or a single tiledbsoma.Experiment
in the time it takes to do one in your current iterative approach. If you have a commercial use case involving TileDB-SOMA, I recommend checking out the platform as I am suspecting what you’re encountering has been resolved there.
Spencer
Hi Spencer,
let’s say that I have a big anndata (1Tb) composed of 100 anndatas (each of 10Gb). I can load it into memory using the anndata_read_h5ad(backed). I want to convert this into a TileDB experiment. Can you point me to a notebook that help me do this? I want to know if I can just open the big anndata in backed mode and just do soma.io.from.anndata() and just wait. I have 500GB of ram so loading the X matrix into memory makes my machine crash.
Best !
Hi Mariano,
Thanks for the question!
You can certainly give that a shot, although it won’t be the fastest method!
TileDB arrays have multi-reader/writer support, so ideally you’d chunk the original AnnData into smaller partitions and ingest each partition into a tiledbsoma.Experiment
in parallel.
If say you had 100 AnnData objects, each of 10GB, you could use any one of those single AnnDatas (assuming the schema is the same across all AnnDatas, as it should if they exist as one larger version) to first init a tiledbsoma.Experiment
using tiledbsoma.io.from_anndata
with the ingest_mode="schema_only"
.
Once you have your empty tiledbsoma.Experiment
with the appropriate schema, you can append all data (including the AnnData you used to init the experiment with its schema) into the destination tiledbsoma.Experiment
. This tutorial in TileDB Academy demonstrates that process. Just know the registration mapping needs to contain all of the new input data. Then you can pass that registration mapping to individual processes running in parallel to ingest each AnnData.
Given the size of the original AnnData, it may be beneficial to split up your large AnnData into separate h5ad files representing those smaller AnnData chunks for a more performant ingestion process. Then each ingestor can append those into your destination tiledbsoma.Experiment
using tiledbsoma.io.from_h5ad
in the same way you would do that with tiledbsoma.io.from_anndata
.
This is a common problem with Single Cell data is dealing with large, in-memory objects. TileDB-SOMA solves that problem and you’ll then be able to query out smaller data chunks for iteration/analysis.
If you want to talk in detail about this and share your use case, i’m always available at spencer@tiledb.com or you can respond here as well!