Performance on blobs stores with small arrays

Hi,

I am working with samples of heterogeneous data. One may consider a dict of arrays with various shapes.

I am considering using a group represent my dict and then storing all of the different arrays independently.

The issue then is that there may be lots of tiny arrays, thus read times for my group shoot up.
The workaround I have for now is to use a sparse array to store the smaller data and keep track of the names and shapes in the metadata of the array. This feels hacky. Has anyone encountered similar concerns?

Side question:
Are there force_load methods for group where data would be force loaded in parallel and efficiently? (using python)

Hi @p-yv,

Apologies for the delayed response here. It would be helpful to have a more concrete description of what the data schemas look like and how many arrays you have in mind. But in general, yes – using sparse arrays sounds like a good approach. Note also that TileDB supports string dimensions for sparse arrays, so you may be able to model the separate items with string indexes rather than metadata if that is a better fit for the problem.

Are there force_load methods for group where data would be force loaded in parallel and efficiently? (using python)

You can use a ThreadPool to run tasks concurrently - each open will release the GIL. Here’s a similar example from one of our tests. I’m not sure if this will help very much for local arrays, but it will probably help for remote arrays.

Best,
Isaiah