OK, so here are the basic concepts of writing, consolidation (of fragments and metadata) and vacuuming.
Writing: In TileDB, all files/objects written are immutable. This is because (1) we want TileDB to work very efficiently on cloud object stores, where objects are immutable, (2) we want to allow parallel writes without any locking and synchronization, and (3) we want to support data versioning and time traveling. Therefore, TileDB introduces the concept of “fragments”, which are essentially standalone timestamped subarrays organizing all new files in a timestamped subfolder inside the array folder.
Problems with immutable fragments: The best way to achieve excellent IO performance is to perform parallel reads of large blocks stored contiguously in the backend. If you perform way too many small random reads, performance degrades because there are a lot of fixed latency costs per request that start to add up. Therefore, if you have way too many fragments (i.e., if you split your array data into way too many separate files) and each query touches a lot of those files, you will end up experiencing the poor performance.
Solution #1: Perform big writes: One naive solution is to try to perform larger writes, therefore reducing the number of fragments. Although doable in certain applications, we cannot force the user to do that in every application. It is worth noting it here though.
Solution #2: Fragment consolidation: This essentially means merging smaller fragments into big ones. That should work great, but you need to tune it. To understand this, let me describe how consolidation works. It reads the entire array (in a nice batched way given some user-configured memory budget) and writes the entire array back into a single fragment. If you have 1000s of unconsolidated fragments that you attempt to consolidate in a single go, your performance will suffer from the read step, as it will perform an excessive number of small random IO operations as discussed above.
To address this, we provide a lot of configuration parameters. One of those is sm.consolidation.step_max_frags
. This caps the number of fragments to consolidate at a time, therefore reducing the number of small random IOs. For example, experiment by setting this to “2”, and also sm.consolidation.steps
to 1. Consolidation with those parameters will consolidate exactly 2 fragments, without removing the older 2. If you want to remove the consolidated fragments, run vacuuming after consolidation. That’s exactly what vacuuming does: it deletes the consolidated fragments. We have made this a separate step, since you may still want to time travel to older fragments, even after consolidation. If you don’t care about that, always vacuum.
With the above in mind, you can keep on running consolidate+vacuum multiple times to reduce the number of fragments to a point where reads become much faster. What we are missing today is a way to consolidate “specific fragment partitions”, which we are adding soon. That will enable you to run consolidation from multiple machines in parallel, with each machine focusing on a disjoint fragment partition, effectively making parallel consolidation process-safe. Please stay tuned, this is coming soon.
Unfortunately – or, thankfully, depending on how you see it – there is more tuning that can be done. For example, you can tell TileDB to merge fragments only if they are of similar size, to perform the merge in multiple steps, to set the memory budget, etc. Again, we can help more if it comes to that.
Solution #3: Fragment metadata consolidation: Suppose you consolidate everything in a single enormous fragment. When TileDB performs a read from that fragment, it fetches some indexing information into main memory, such as the R-tree for effective tile pruning, the offsets of the tiles in the files, etc. If the fragment is enormous, then this metadata may consume significant memory, so the first time you read, you may experience some extra IO overhead, and after that you may notice the TileDB memory consumption to be high. Therefore, the best situation is when the fragment sizes are “reasonable”, e.g., something like 500MB-10GB (admittedly, these numbers are rather empirical). In the general case, your array will consist of more than one fragments, even if you periodically consolidate.
Now, in the presence of multiple such larger fragments, TileDB will still need to fetch some very small but useful metadata from the fragments upon opening the array. This metadata is called “fragment metadata” and contains the “non-empty domain” of the fragment. The latter is super useful in determining whether the entire fragment can be ignored for a given slice that does not intersect the non-empty domain. But still, you need to fetch this information upon opening the array from every fragment, and if you have many fragments you may experience the performance penalty.
To mitigate that, TileDB allows you to consolidate the fragment metadata into a single file. When opening the array, instead of performing O(F)
IO operations (where F
is the number of fragments), it instead performs 1. Therefore, opening the array becomes rapid. However, this optimizes only the array opening. If your read must touch all fragments (e.g., if it is an entire array read), you will still have to perform O(F)
IO operations. But if the non-empty domains of your fragments are rather disjoint (i.e., each write populated a different part of the array, not interleaved with another write), TileDB will use the loaded fragment metadata and specifically the non-empty domain of each fragment, and successfully eliminate a huge portion of those IO operations, as it will correctly ignore many of the “irrelevant” fragments. Therefore, ideally, the fragment non-empty domains should be disjoint. We will be adding functionality to consolidation that attempts to produce fragments with disjoint non-empty domains soon.
I hope the above helps. We’ll keep on optimizing this path and make it easier to use. So please stay tuned and keep the questions coming.