I’m experimenting with tiledb for persisting some data. I have about 10^7 integer values to save, but these will be generated by a number of workers in significantly smaller batches. I see that in another question on the forum there’s talk of dropping sparse writes to dense arrays. Does that mean I should use a sparse array here? What does that mean for reading the data once it’s fully populated?
Hi @PaulRudin, can you please share some information around the schema of the array that you intend to store your data in (e.g., number of dimensions, types, etc). We can also help with designing the appropriate schema (including the decision on whether to store in a dense or sparse array) if you could explain a bit the patterns in which the data will be written by the workers and read back later.
Regarding specifically dropping the sparse write support for dense arrays, I don’t really think that this will impact you. That was allowed in the past in case someone which to overwrite a handful of “random” cells in a dense array. So far we have not seen a compelling use case for that, as in most use cases the users typically update entire tiles.
The suggested way for writing to an array from multiple nodes is the following:
Sparse arrays: You can write any cells with any ad hoc coordinates and there will be no issues. What can lead to optimized performance is (a) writing batches of 100MB-1GB per write operation (to create non-negligible fragments) and (b) try to collocate the coordinates of the cells in each write as much as possible. For instance, if you try to imagine a big bounding box for the coordinates of each fragment (produced by a separate node), the bounding boxes across the nodes should not overlap too much. That will enable TileDB perform the reads later more rapidly.
Dense writes: You can define a massive array domain upon the array creation without any impact on performance. Then, each node can write any dense partition of this domain. Again, similar suggestions to the sparse array above: (a) each write should be 100MB-1GB and (b) the subarrays you are writing into should not overlap much.
For reading, you don’t need to know much, other than the fact that reading a sparse array returns only the non-empty cells, whereas reading a dense array will return even the unpopulated portions of the array you are slicing (TileDB returns special fill values for those cells).
I hope that the above is helpful but please let me know if you need further help and examples.
Thanks for the reply.
At the moment this is just experimental. I was looking at putting ~10^7 integers in a 1-d array but, as I say, these data will be created in batches of maybe 5000 at a go, by multiple workers.
In the normal course of events there shouldn’t be any overlap in the parts of the array written by different workers. Once the array is fully populated, then I’ll want to read it into a numpy array in different workers for further calculations.
From your response it sounds like in this case I might as well try with a dense array.
Yeah, here is what I would do:
- Create a dense array
- Set the tile extent to whatever you batch size should be (ideally more than 5K, maybe something closer to 100K)
- Define the array domain as something huge, e.g., [0, 10bn] or sth like that
- Coordinate the workers to write to separate subarrays (you can do that with TileDB)
- Consolidate the fragment metadata after finishing with your writes.
- After writing, you can slice any subarray and the result will be a numpy array in TileDB-Py
Please let us know if you have any issues and we will provide an example.