Slow reads on large number of attributes

Hey there.

We are storing single cell expression data in tiledb arrays.
The dimension is of type string and represents the barcode_id or cell_id/

The attributes are the genes + cell annotations.

The number of attributes is ~50,000

The number of cells is ~100,000

Reading whole data of one attribute is very fast. But reading all attributes of one cell_id is very slow, can you help me improve performance, explain why this happens and comment on the array architecture that we chose ?

Hi @royassis, have you checked out GitHub - single-cell-data/TileDB-SOMA: TileDB implementation of the single-cell SOMA API (we are co-developing that repo with CZI)? We are storing the expression data as a 2D cell x gene sparse matrix and choose Hilbert as the cell order. That allows to balance performance when slicing only cells vs. only genes vs. any combination of the two. Happy to share more info. cc-ing @aaron and @johnkerl who may chime in as well.

Hey @stavros

I’ll check SOMA out.

I would like to hear more about that.

This is an important initiative for us and we are putting in a lot of resources. Happy chat on a call and learn whether SOMA can capture your data model. This is intended to be community-driven and will remain open-source. Feel free to reach out to when you are ready.