Performance on blobs stores with small arrays

p-yv · June 15, 2023, 1:39pm

Hi,

I am working with samples of heterogeneous data. One may consider a dict of arrays with various shapes.

I am considering using a group represent my dict and then storing all of the different arrays independently.

The issue then is that there may be lots of tiny arrays, thus read times for my group shoot up.
The workaround I have for now is to use a sparse array to store the smaller data and keep track of the names and shapes in the metadata of the array. This feels hacky. Has anyone encountered similar concerns?

Side question:
Are there force_load methods for group where data would be force loaded in parallel and efficiently? (using python)

ihnorton · June 27, 2023, 2:04pm

Hi @p-yv,

Apologies for the delayed response here. It would be helpful to have a more concrete description of what the data schemas look like and how many arrays you have in mind. But in general, yes – using sparse arrays sounds like a good approach. Note also that TileDB supports string dimensions for sparse arrays, so you may be able to model the separate items with string indexes rather than metadata if that is a better fit for the problem.

Are there force_load methods for group where data would be force loaded in parallel and efficiently? (using python)

You can use a ThreadPool to run tasks concurrently - each open will release the GIL. Here’s a similar example from one of our tests. I’m not sure if this will help very much for local arrays, but it will probably help for remote arrays.

Best,
Isaiah

Topic		Replies	Views
Optimizing the reads for sparse arrays	9	747	June 27, 2023
Managing Large Geospatial Arrays with TileDB	3	693	September 29, 2023
Usage help -- disk space, parallel writes	5	833	June 21, 2022
Dense or sparse array for incrementally generated dense data	6	1058	October 21, 2021
How can we store variable length binary blobs?	2	456	September 11, 2023

Performance on blobs stores with small arrays

Related topics