Conversion of large Anndata object to Experiment takes a lot of RAM

Hello,

I want to convert an Anndata object to TileDB Experiment directly on an S3 bucket. This works nicely for smaller datasets (<10GBs). However, for larger ones (21GBs in this case) RAM usage becomes the bottleneck. I use tiledbsoma.io.from_anndata and tiledbsoma.io.from_h5ad, both of which expand to 250Gb in memory. My hope was that with tiledbsoma.io.from_anndata, I could read the anndata object in backed='r' mode so as not to load the entire object in memory but this also does not work (ie. significant expansion takes place).

What I try to do is actually identical to this tutorial: TileDB
When I run that, the 7GB file is also inflated to more than 30GB. Is there a way to reduce RAM usage? Can you help me with this?

Thank you in advance!
Benoit

Hi Benoit,

Thank you for the question!

TileDB is optimized to query data on a filesystem-agnostic backends just like S3.

Starting from either an AnnData or h5ad object you’d transform that to a tiledbsoma.Experiment as your first step. The result of that transformation would then physically sit in your s3 backend. This is not the step where are we are trying to optimize memory usage, we are primarily focused on ingesting data into the TileDB format. Following this transformation, users will gain the performance and efficiency they expect from TileDB during accessing that data.

To then query that data, you would use tiledbsoma.Experiment.open to push the query down to the on-disk object, rather than first having to load into memory the entire dataset.

We’ve recently launched TileDB Academy, we have several tutorials for tiledbsoma on there such as performing multi-slot queries on an Experiment.

Thanks,
Spencer

Dear Spencer,

Thanks for your quick reply!

Ah I see okay but then there must be a way to ingest one observation at a time into an existing tiledb experiment on S3? This should not use too much RAM?
Thank you!
Benoit

Once you have an existing tiledbsoma.Experiment on S3, you can append data such as adding new obs data. Check out this tutorial on how to do so: TileDB

Spencer

Dear Spencer,

Thank you this helps a lot! When I try this way: first registering the anndata object and uploading the differences, I get an error:
ValueError: internal coding error: id_column_name unspecified
I deleted obsm and varm, and in both cases, the error is the same.

Thank you,
Benoit

Hi @bputzeys can you share a repro here so I can take a look?

Was this specifically during tiledbsoma.io.from_anndata? Did you specify values to obs_field_name and var_field_name to tiledbsoma.io.register_anndatas?

Thanks,
Spencer

Hi Spencer,

I try to recreate it and will let you know how you can do it too.
Yes, I get the error after registration during the tiledbsoma.io.from_anndata step. Yes I specified obs_filed_name="obs_id" and var_field_name="var_id" in tiledbsoma.io.register_anndatas just as in the tutorial.

Ah I think it was because I had .uns values. I don’t think I need them so I can just do it without.
Thank you.

Hello Spencer,
I was able to reproduce my original issue: GitHub - bputzeys/tiledb-issue
If you could tell me that you can reproduce it that would be very much appreciated!

Thank you.

Dear @spencerseale ,

Did you have time to have a look and maybe recreate the issue?

Thank you!
Benoit

Hi Benoit,

We expect these processes to take advantage of most available RAM. The transformation from the AnnData having many different objects to a tiledbsoma.Experiment representing many different arrays on disk requires that transformation. When we open X and transform that data to an array, this is where you’ll see the memory usage spike while writing it to its compressed format in TileDB. Our most compute intensive operations typically involve ingest. Looking at your code, I cannot speak to why your memory utilization is increasingly rising without doing a detailed analysis of your setup.

The query of this data is where TileDB brings the memory benefits via out-of-core query operations. We push the query down onto the object on disk and only when explicitly requested do we transform the object into a memory object.

On our multi-omics platform TileDB, we have compute infrastructure that many of our customers use to parallelize the transformation across custom resourced instances for these ingestions. This means you could ingest many h5ad files to separate or a single tiledbsoma.Experiment in the time it takes to do one in your current iterative approach. If you have a commercial use case involving TileDB-SOMA, I recommend checking out the platform as I am suspecting what you’re encountering has been resolved there.

Spencer