OK I have following setting. 5000 samples and 600k Loci. and My tile configuration was 1 X 600k previously. Which means one sample would be contained in one tile spanning 600k cells. And there would be total of 5000 tiles(one tile/ sample). I write by sample and read across the sample.
Which means that given this configuration, I would be able to write very fast(which I was). However, my read time suffered because to read one loci information, i have to jump 5000 tiles for each sample.
Array with this configuration had a disk space of 25Gb. Which is fine.
Now, I wanted to improve the read times and wanted to make sure that I don’t at least have to read across 5000 tiles just to read one loci information. Now to be able to do this. I changed my tile dimensions to (10 X 600k) instead of (1 X 600k ) and what I am seeing is that size of array on the disk is substaintally high. What I am failing to understand is that its the same amount of data orchestrated differently. Why is there a huge difference in the diskspace. Or I am doing something wrong?
Is this a dense or sparse array (sending us the entire schema would help a lot)?
In the dense case, TileDB stores integral tiles on disk. That is, in the 10x600K tile case, even if you write a single cell in that space tile, TileDB will fill the rest 10x600K-1 values in the tile with special empty/dummy cell values and write the whole tile on disk (see related docs). Therefore, you need to make sure that each of your space tile contains useful information before writing it to disk. This means that the subarray you set to the write should contain an integral number of space tiles.
Your problem sounds familiar. We solved a similar problem for a variant (sparse) dataset where the rows were samples and cols were loci, the updates happened on a row-by-row basis, but the reads were on a column basis. You need to do the following:
Define space tiles that are “columnar”, that is, elongated along the rows - you may want to play with a larger row tile extent and a smaller column extent. It depends on how much RAM you can spare for the writes.
Make sure that you populate fully your space tile, i.e., make sure that you are setting a subarray to the write that coincides with that particular tile (or contains any integral number of such space tiles). This will prevent space inflation during the writes.
Define col-major cell order in the schema. This will allow you to do rapid “columnar” reads. The writes may suffer a bit as TileDB will have to re-organize your write input layout from row-major to col-major internally, but probably that’s fine (you can write to TileDB the tiles/subarrays in parallel anyway).
I hope the above helps.