R: Optimising Read Performance - Range within a dimension

I have a sparse array written on disk - 6 dimensions (INT32) with a single attribute. Ranges for each dimension as follows;

1:60k, 1:100m, 1:2, 1:2, 1:1m, 1:10k

Queries work well in most cases - a single value for a dimension say, return all results fast - but ranges are much slower - The array is about 6GB uncompressed - and to load the whole array in, for example of a query in the limit, takes a long time (over 10 minutes on a 160GB RAM, 40 core linux machine) - consider that R’s ‘data.table’ package (using fread function) can load the csv equivalent in seconds.

I have played around with config, to no effect, and fear I am misunderstanding something more fundamental .

All help appreciated guys, thanks in advance
M

Dear MOR,

Thanks for bringing this to our attention. Which versions of the R package, and of TileDB do you use?
(You can see via, respectively, packageVersion("tiledb") and tiledb::tiledb_version().)

Could you possible provide a mock-up of your data via, say, a generating function producing a similar schema (possibly smaller, of course) exhibiting similar performance so that we can take a more detailed look?

Thanks, Dirk

Thanks for the quick return Dirk - Let we get on that and come back - Thanks - M

Hi there,

Some more questions:

  1. Is the array stored locally or on S3?
  2. What is the value of vfs.num_threads and sm.num_reader_threads in the config you pass to the context (they should be set equal to your number of cores, as they are currently defaulted to 1).
  3. How many fragments are there in the array (i.e., number of __* subfolders inside the array folder)?

I am about to merge some very important optimizations to dev (pertaining to both read parallelization and handling of numerous fragments). Therefore, in addition to potential tuning you can do with the config parameters above, please stay tuned to try out the upcoming improvements.