Dense or sparse array for incrementally generated dense data

I’m experimenting with tiledb for persisting some data. I have about 10^7 integer values to save, but these will be generated by a number of workers in significantly smaller batches. I see that in another question on the forum there’s talk of dropping sparse writes to dense arrays. Does that mean I should use a sparse array here? What does that mean for reading the data once it’s fully populated?

Hi @PaulRudin, can you please share some information around the schema of the array that you intend to store your data in (e.g., number of dimensions, types, etc). We can also help with designing the appropriate schema (including the decision on whether to store in a dense or sparse array) if you could explain a bit the patterns in which the data will be written by the workers and read back later.

Regarding specifically dropping the sparse write support for dense arrays, I don’t really think that this will impact you. That was allowed in the past in case someone which to overwrite a handful of “random” cells in a dense array. So far we have not seen a compelling use case for that, as in most use cases the users typically update entire tiles.

The suggested way for writing to an array from multiple nodes is the following:

  1. Sparse arrays: You can write any cells with any ad hoc coordinates and there will be no issues. What can lead to optimized performance is (a) writing batches of 100MB-1GB per write operation (to create non-negligible fragments) and (b) try to collocate the coordinates of the cells in each write as much as possible. For instance, if you try to imagine a big bounding box for the coordinates of each fragment (produced by a separate node), the bounding boxes across the nodes should not overlap too much. That will enable TileDB perform the reads later more rapidly.

  2. Dense writes: You can define a massive array domain upon the array creation without any impact on performance. Then, each node can write any dense partition of this domain. Again, similar suggestions to the sparse array above: (a) each write should be 100MB-1GB and (b) the subarrays you are writing into should not overlap much.

For reading, you don’t need to know much, other than the fact that reading a sparse array returns only the non-empty cells, whereas reading a dense array will return even the unpopulated portions of the array you are slicing (TileDB returns special fill values for those cells).

I hope that the above is helpful but please let me know if you need further help and examples.

Thanks for the reply.

At the moment this is just experimental. I was looking at putting ~10^7 integers in a 1-d array but, as I say, these data will be created in batches of maybe 5000 at a go, by multiple workers.
In the normal course of events there shouldn’t be any overlap in the parts of the array written by different workers. Once the array is fully populated, then I’ll want to read it into a numpy array in different workers for further calculations.

From your response it sounds like in this case I might as well try with a dense array.

Yeah, here is what I would do:

  1. Create a dense array
  2. Set the tile extent to whatever you batch size should be (ideally more than 5K, maybe something closer to 100K)
  3. Define the array domain as something huge, e.g., [0, 10bn] or sth like that
  4. Coordinate the workers to write to separate subarrays (you can do that with TileDB)
  5. Consolidate the fragment metadata after finishing with your writes.
  6. After writing, you can slice any subarray and the result will be a numpy array in TileDB-Py

Please let us know if you have any issues and we will provide an example.

Stavros

The array creation seems to be working OK, I actually have two numbers per row, so I’m making a dense array 10 million x 2. (I’m not able to create as many workers as I’d hoped due to GCP quota limits but hopefully that will be resolved soon.) For various reasons I’m not able to write blocks as large as you suggest, but I’ll see how that pans out, presumably post-consolidation it doesn’t actually matter? The workers are writing non-overlapping contiguous blocks with data from a numpy matrix of size about 10000 x 2 at a time.

I don’t quite understand why you suggest making the array domain so large since I know I’m not going to need more than about 10 million?

I wonder if there’s an obvious heuristic to tune consolidation and vacuuming to avoid OOM kills. I know roughly how much memory is available, and can find the number and size of fragments.

Hi @PaulRudin,

  • Regarding the schema, I’d probably make the array 10m x 1, and use 2 attributes instead. This is because you’ll be able to extract one of the two numbers if needed, without sacrificing performance for getting the number you don’t need. Alternatively, I’d create one attribute setting cell_val_num to 2, so that you store two values per cell in this attribute.

  • Regarding the domain, sure thing, if you don’t exceed 10m, then you can set it to that. My point was mainly that it doesn’t matter how big the domain is, since TileDB does not allocate that space a priori.

  • Regarding consolidation, this is something we are currently working on very actively, especially on better memory management. If you can give us some more information regarding how many fragments you are attempting to consolidate (I am assuming 1000?), then we can provide you with a script that tunes the config in a way that performs consolidation in a step-wise fashion and makes more sane use of the memory.

Thanks!