Conversion of large Anndata object to Experiment takes a lot of RAM

bputzeys · November 11, 2024, 2:23pm

Hello,

I want to convert an Anndata object to TileDB Experiment directly on an S3 bucket. This works nicely for smaller datasets (<10GBs). However, for larger ones (21GBs in this case) RAM usage becomes the bottleneck. I use tiledbsoma.io.from_anndata and tiledbsoma.io.from_h5ad, both of which expand to 250Gb in memory. My hope was that with tiledbsoma.io.from_anndata, I could read the anndata object in backed='r' mode so as not to load the entire object in memory but this also does not work (ie. significant expansion takes place).

What I try to do is actually identical to this tutorial: TileDB
When I run that, the 7GB file is also inflated to more than 30GB. Is there a way to reduce RAM usage? Can you help me with this?

Thank you in advance!
Benoit

spencerseale · November 11, 2024, 4:19pm

Hi Benoit,

Thank you for the question!

TileDB is optimized to query data on a filesystem-agnostic backends just like S3.

Starting from either an AnnData or h5ad object you’d transform that to a tiledbsoma.Experiment as your first step. The result of that transformation would then physically sit in your s3 backend. This is not the step where are we are trying to optimize memory usage, we are primarily focused on ingesting data into the TileDB format. Following this transformation, users will gain the performance and efficiency they expect from TileDB during accessing that data.

To then query that data, you would use tiledbsoma.Experiment.open to push the query down to the on-disk object, rather than first having to load into memory the entire dataset.

We’ve recently launched TileDB Academy, we have several tutorials for tiledbsoma on there such as performing multi-slot queries on an Experiment.

Thanks,
Spencer

bputzeys · November 12, 2024, 8:06am

Dear Spencer,

Thanks for your quick reply!

Ah I see okay but then there must be a way to ingest one observation at a time into an existing tiledb experiment on S3? This should not use too much RAM?
Thank you!
Benoit

spencerseale · November 12, 2024, 3:40pm

Once you have an existing tiledbsoma.Experiment on S3, you can append data such as adding new obs data. Check out this tutorial on how to do so: TileDB

Spencer

bputzeys · November 18, 2024, 3:58pm

Dear Spencer,

Thank you this helps a lot! When I try this way: first registering the anndata object and uploading the differences, I get an error:
ValueError: internal coding error: id_column_name unspecified
I deleted obsm and varm, and in both cases, the error is the same.

Thank you,
Benoit

spencerseale · November 18, 2024, 4:14pm

Hi @bputzeys can you share a repro here so I can take a look?

Was this specifically during tiledbsoma.io.from_anndata? Did you specify values to obs_field_name and var_field_name to tiledbsoma.io.register_anndatas?

Thanks,
Spencer

bputzeys · November 18, 2024, 4:23pm

Hi Spencer,

I try to recreate it and will let you know how you can do it too.
Yes, I get the error after registration during the tiledbsoma.io.from_anndata step. Yes I specified obs_filed_name="obs_id" and var_field_name="var_id" in tiledbsoma.io.register_anndatas just as in the tutorial.

bputzeys · November 18, 2024, 4:58pm

Ah I think it was because I had .uns values. I don’t think I need them so I can just do it without.
Thank you.

bputzeys · November 20, 2024, 12:18pm

Hello Spencer,
I was able to reproduce my original issue: GitHub - bputzeys/tiledb-issue
If you could tell me that you can reproduce it that would be very much appreciated!

Thank you.

bputzeys · November 22, 2024, 8:32am

Dear @spencerseale ,

Did you have time to have a look and maybe recreate the issue?

Thank you!
Benoit

spencerseale · November 22, 2024, 5:08pm

Hi Benoit,

We expect these processes to take advantage of most available RAM. The transformation from the AnnData having many different objects to a tiledbsoma.Experiment representing many different arrays on disk requires that transformation. When we open X and transform that data to an array, this is where you’ll see the memory usage spike while writing it to its compressed format in TileDB. Our most compute intensive operations typically involve ingest. Looking at your code, I cannot speak to why your memory utilization is increasingly rising without doing a detailed analysis of your setup.

The query of this data is where TileDB brings the memory benefits via out-of-core query operations. We push the query down onto the object on disk and only when explicitly requested do we transform the object into a memory object.

On our multi-omics platform TileDB, we have compute infrastructure that many of our customers use to parallelize the transformation across custom resourced instances for these ingestions. This means you could ingest many h5ad files to separate or a single tiledbsoma.Experiment in the time it takes to do one in your current iterative approach. If you have a commercial use case involving TileDB-SOMA, I recommend checking out the platform as I am suspecting what you’re encountering has been resolved there.

Spencer

Mariano · April 16, 2025, 6:11pm

Hi Spencer,
let’s say that I have a big anndata (1Tb) composed of 100 anndatas (each of 10Gb). I can load it into memory using the anndata_read_h5ad(backed). I want to convert this into a TileDB experiment. Can you point me to a notebook that help me do this? I want to know if I can just open the big anndata in backed mode and just do soma.io.from.anndata() and just wait. I have 500GB of ram so loading the X matrix into memory makes my machine crash.

Best !

spencerseale · April 16, 2025, 6:36pm

Hi Mariano,

Thanks for the question!

You can certainly give that a shot, although it won’t be the fastest method!

TileDB arrays have multi-reader/writer support, so ideally you’d chunk the original AnnData into smaller partitions and ingest each partition into a tiledbsoma.Experiment in parallel.

If say you had 100 AnnData objects, each of 10GB, you could use any one of those single AnnDatas (assuming the schema is the same across all AnnDatas, as it should if they exist as one larger version) to first init a tiledbsoma.Experiment using tiledbsoma.io.from_anndata with the ingest_mode="schema_only".

Once you have your empty tiledbsoma.Experiment with the appropriate schema, you can append all data (including the AnnData you used to init the experiment with its schema) into the destination tiledbsoma.Experiment. This tutorial in TileDB Academy demonstrates that process. Just know the registration mapping needs to contain all of the new input data. Then you can pass that registration mapping to individual processes running in parallel to ingest each AnnData.

Given the size of the original AnnData, it may be beneficial to split up your large AnnData into separate h5ad files representing those smaller AnnData chunks for a more performant ingestion process. Then each ingestor can append those into your destination tiledbsoma.Experiment using tiledbsoma.io.from_h5ad in the same way you would do that with tiledbsoma.io.from_anndata.

This is a common problem with Single Cell data is dealing with large, in-memory objects. TileDB-SOMA solves that problem and you’ll then be able to query out smaller data chunks for iteration/analysis.

If you want to talk in detail about this and share your use case, i’m always available at spencer@tiledb.com or you can respond here as well!

spencerseale · May 2, 2025, 6:01pm

Hi @bputzeys,

Apologies, we’ve been quite busy! I’ve created a new issue in your repo with code samples: TileDB Response · Issue #1 · bputzeys/tiledb-issue · GitHub

Thanks,
Spencer

neb1neuron · June 25, 2025, 8:54am

I also have a question regarding a large .h5ad file. It has 21 GB of ram and when I want to ingest it it fails, I assume from running out of memory. Is there a way to ingest data in chunks from an h5ad file?

spencerseale · July 23, 2025, 10:24pm

@neb1neuron, sorry for the late reply .

I recommend for large h5ad files where ram availability is a concern to chunk up the large h5ad into smaller h5ad files. You’ll need to register all of the chunked h5ad files in a single step and then subset the registration mapping for each h5ad and pass it off to parallel ingestion workers. Each worker will require only enough ram to process that individual h5ad.

Our current bottleneck is the registering of all of the h5ad files to id where those h5ad files should slot into the existing experiment. That is only because we need to understand all the novel data and reshape the destination SOMA arrays a single time and assign dimension values for novel cells.

This tutorial documents the bulk ingestion with multiple h5ad files.

neb1neuron · July 24, 2025, 5:21am

Thank you for the reply. It would be great if the the convert method would have the logic to chunk the data inside and there would be no need for the user to do the extra work. Maybe in the future.

Topic		Replies	Views
TileDBError: [TileDB::S3] error storing h5ad with SOMA format	16	930	September 1, 2023
TILEDB Data Frame in R	6	921	January 7, 2022
How can we tune TileDB s3 bandwidth utilization on 3D dense array reding operation?	6	1000	June 28, 2021
How to use TileDB for big, fast moving highly dimensional datasets in earth science	1	1301	December 18, 2019
Importing parquet file as a 2d dense array	2	102	January 15, 2025

Conversion of large Anndata object to Experiment takes a lot of RAM

Related topics