Write confirmation (Question)

Hi,

I’m creating a tiledb database on a centralized hard drive that is accessed by 10+ multiple machines. I use the python interface and created a sparse array. When I’m testing my algorithm and tiledb with a single machine, everything is fine. Nevertheless, when I give different jobs to multiple machines, I’m seeing that some of my entries are not being written. From my log files I came to the point that it is entering the function to write to tiledb which does the following:

dict_val['done']=1
c_id= dict_val.pop('id')
c_test= dict_val.pop('test')
with tiledb.open(table_name, 'w') as A:
    A[c_id,c_test] =  dict_val            
tiledb.consolidate(table_name)  

(The reason I’m consolidating after each write is my write files are very small and it was making the read time very long after many writes, I’m not sure if it would affect the missing values )

My code does not give an error in the function and continues with the rest of the algorithm. But as I mentioned at the end, some of my values are missing. Is there any way to confirm the write has happened, so that I can return it and if it’s not written I can try again. I tried to read and confirm each time, but it is adding too much overhead. Or is there a way that I can debug why tiledb doesn’t write these values, for example internal log files?

And on another note, I realized that after vacuuming the read time is much faster than after consolidation. The reason I was not vacuuming is because it is locking. Could I ask if there is anything else I can do to read faster without vacuuming, that might also help with checking if something is written or not.

1 Like

Hi there,

The issue here is that you run consolidation in parallel and probably there is a process-safety issue here (which we will fix). I would advise the following:

  • Consolidate the fragment metadata, not the fragments, after all writes. That will speed up performance.
  • Run consolidation after all writes complete. It may take a bit of tuning to make it work, especially if you have tons of fragments. We will help with that if issues arise.
  • We will soon build some more functionality to parallelize consolidation across multiple machines, so eventually that will be very fast in your setting. Please stay tuned.

Hi Stavros,

Thank you for quick help. I will try your suggestion and follow-up.

Can I ask meanwhile though, how much consolidating the metafragments help with the reading? In documentation, I read that consolidating the fragment metadata, merges the footers for fast read, but I couldn’t understand why then consolidating actual fragments still necessary if read performance is solved with metafragment consolidation.

For some specs, during initial development, when I didn’t consolidate after each write, what I observed was, towards the end of my runs, the read time was around 6 mins and space the db was taking ~7GB. After consolidating, which took 42mins, the read time decreased to couple of seconds and after vacuuming the space was 4.6M.

1 Like

Hi,

I tried with consolidating the meta fragments and it solved the missing value problem during the parallel runs. Thank you very much for that suggestion.

Nevertheless, one of the jobs gave the following error during meta fragment consolidation.

Traceback (most recent call last):
  File "main.py", line 132, in insert_tiledb
    tiledb.consolidate(table_name, ctx=ctx)
  File "tiledb/libtiledb.pyx", line 5497, in tiledb.libtiledb.consolidate
  File "tiledb/libtiledb.pyx", line 480, in tiledb.libtiledb._raise_ctx_err
  File "tiledb/libtiledb.pyx", line 465, in tiledb.libtiledb._raise_tiledb_error
tiledb.libtiledb.TileDBError: [TileDB::IO] Error: Cannot read from file; Read exceeds file size

And the full code snippet which the error belongs to is the following:

dict_val['done']=1
c_id= dict_val.pop('id')
c_test= dict_val.pop('test')
with tiledb.open(table_name, 'w') as A:
    A[c_id,c_test] =  dict_val            
config = tiledb.Config({"sm.consolidation.mode": "fragment_meta"})
ctx = tiledb.Ctx(config)
tiledb.consolidate(table_name, ctx=ctx)

Could you help me understand this error, and what I can do to prevent it?

About the reading time, when I read the last version of the db which has 2 dimensions, 28 attributes and 18750 samples, it took ~40 mins (the reason is that each sample is written individually for the communication between parallel runs, so there are many fragments). Is there a way to accelerate reading further after meta fragment consolidation.

And lastly, you mentioned that you could help with finetuning the final consolidation after all of the parallel runs finish. It took 51 minutes, could I ask how I can improve that time?

1 Like

I’ve also been having a lot of issues with balancing consolidation and read/write performance.

In my case (many fragments from live data), consolidating only meta data negatively affected read/write performance. So, your read time went up from a few seconds for a fully consolidated array to ~40min when consolidating only metadata?

It appears to be a challenging issue, but I’m sure the TileDB will find a solution! : )

The error is probably because consolidation is not thread-/process-safe. You need to invoke it only once from a single machine. We can investigate further of course if this is reproducible even from a single process.

I will include a more comprehensive description of consolidation below in a separate message shortly.

OK, so here are the basic concepts of writing, consolidation (of fragments and metadata) and vacuuming.

Writing: In TileDB, all files/objects written are immutable. This is because (1) we want TileDB to work very efficiently on cloud object stores, where objects are immutable, (2) we want to allow parallel writes without any locking and synchronization, and (3) we want to support data versioning and time traveling. Therefore, TileDB introduces the concept of “fragments”, which are essentially standalone timestamped subarrays organizing all new files in a timestamped subfolder inside the array folder.

Problems with immutable fragments: The best way to achieve excellent IO performance is to perform parallel reads of large blocks stored contiguously in the backend. If you perform way too many small random reads, performance degrades because there are a lot of fixed latency costs per request that start to add up. Therefore, if you have way too many fragments (i.e., if you split your array data into way too many separate files) and each query touches a lot of those files, you will end up experiencing the poor performance.

Solution #1: Perform big writes: One naive solution is to try to perform larger writes, therefore reducing the number of fragments. Although doable in certain applications, we cannot force the user to do that in every application. It is worth noting it here though.

Solution #2: Fragment consolidation: This essentially means merging smaller fragments into big ones. That should work great, but you need to tune it. To understand this, let me describe how consolidation works. It reads the entire array (in a nice batched way given some user-configured memory budget) and writes the entire array back into a single fragment. If you have 1000s of unconsolidated fragments that you attempt to consolidate in a single go, your performance will suffer from the read step, as it will perform an excessive number of small random IO operations as discussed above.

To address this, we provide a lot of configuration parameters. One of those is sm.consolidation.step_max_frags. This caps the number of fragments to consolidate at a time, therefore reducing the number of small random IOs. For example, experiment by setting this to “2”, and also sm.consolidation.steps to 1. Consolidation with those parameters will consolidate exactly 2 fragments, without removing the older 2. If you want to remove the consolidated fragments, run vacuuming after consolidation. That’s exactly what vacuuming does: it deletes the consolidated fragments. We have made this a separate step, since you may still want to time travel to older fragments, even after consolidation. If you don’t care about that, always vacuum.

With the above in mind, you can keep on running consolidate+vacuum multiple times to reduce the number of fragments to a point where reads become much faster. What we are missing today is a way to consolidate “specific fragment partitions”, which we are adding soon. That will enable you to run consolidation from multiple machines in parallel, with each machine focusing on a disjoint fragment partition, effectively making parallel consolidation process-safe. Please stay tuned, this is coming soon.

Unfortunately – or, thankfully, depending on how you see it :slight_smile: – there is more tuning that can be done. For example, you can tell TileDB to merge fragments only if they are of similar size, to perform the merge in multiple steps, to set the memory budget, etc. Again, we can help more if it comes to that.

Solution #3: Fragment metadata consolidation: Suppose you consolidate everything in a single enormous fragment. When TileDB performs a read from that fragment, it fetches some indexing information into main memory, such as the R-tree for effective tile pruning, the offsets of the tiles in the files, etc. If the fragment is enormous, then this metadata may consume significant memory, so the first time you read, you may experience some extra IO overhead, and after that you may notice the TileDB memory consumption to be high. Therefore, the best situation is when the fragment sizes are “reasonable”, e.g., something like 500MB-10GB (admittedly, these numbers are rather empirical). In the general case, your array will consist of more than one fragments, even if you periodically consolidate.

Now, in the presence of multiple such larger fragments, TileDB will still need to fetch some very small but useful metadata from the fragments upon opening the array. This metadata is called “fragment metadata” and contains the “non-empty domain” of the fragment. The latter is super useful in determining whether the entire fragment can be ignored for a given slice that does not intersect the non-empty domain. But still, you need to fetch this information upon opening the array from every fragment, and if you have many fragments you may experience the performance penalty.

To mitigate that, TileDB allows you to consolidate the fragment metadata into a single file. When opening the array, instead of performing O(F) IO operations (where F is the number of fragments), it instead performs 1. Therefore, opening the array becomes rapid. However, this optimizes only the array opening. If your read must touch all fragments (e.g., if it is an entire array read), you will still have to perform O(F) IO operations. But if the non-empty domains of your fragments are rather disjoint (i.e., each write populated a different part of the array, not interleaved with another write), TileDB will use the loaded fragment metadata and specifically the non-empty domain of each fragment, and successfully eliminate a huge portion of those IO operations, as it will correctly ignore many of the “irrelevant” fragments. Therefore, ideally, the fragment non-empty domains should be disjoint. We will be adding functionality to consolidation that attempts to produce fragments with disjoint non-empty domains soon.

I hope the above helps. We’ll keep on optimizing this path and make it easier to use. So please stay tuned and keep the questions coming.

1 Like

Exactly, if it was fully consolidated it was few seconds and with only meta data consolidation it is ~40 mins.

In my case meta data consolidation didn’t affect negatively, but it doesn’t improve either. I just got results without consolidation recently to compare, and it was also around ~40 mins. Db size is also comparable between two runs, so I think it is okay to conclude that meta data consolidation was creating similar performance for the final db. I’m not sure about the intermediate read times with and without meta data consolidation throughout the run though, since I didn’t log those.

Thanks a lot for explanation. I see what you mean, previously I misinterpreted that if I consolidate meta fragments, it would be okay to do in multiple machines. Then, I believe you only suggested it because it is faster than consolidating all fragments, so it wouldn’t cause data to be missing, since it’s finalized faster. Would it be a correct summary?

I would need to change my setup then to consolidate from a single machine, since individual jobs do not know at which stage of other jobs are, they just need to be up-to-date with respect to the processed data.

Given the current status then, there isn’t much for me to accelerate the read time for this kind of application, unless consolidation is thread/process-safe, is that right? I will follow the updates on that topic.

Thanks a lot for this detailed explanation. I will experiment with these and follow up if necessary.

Related to solution #3: theoretically, fragment metadata consolidated db can give the exact read time as non-consolidated db, if my read touches all fragments as you mentioned, is that right? That could explain what I’m seeing in https://forum.tiledb.com/t/write-confirmation-question/305/9?u=gsr . Nevertheless, it is not expected to worsen the read time, is that right? So, would consolidating the meta fragments always be a safe operation?

Yes, it is not expected to worsen the time. The “fragment metadata” consolidation helps when opening the array if you have numerous fragments and your data is on a backend where a request has some non-negligible fixed latency (e.g., AWS S3). TileDB is multi-threaded and performs parallel IO, but if you have thousands of fragments, you’ll probably notice the time to open the array increasing with the number of fragments. The read time after opening the array should be identical whether you consolidate the fragment metadata or not.

1 Like

Here is what I would do. As data gets written, I would periodically perform “fragment consolidation” always on a single fixed machine. I would set the parameters in a way that each consolidation touches 2-10 fragments. I would also experiment a bit with the frequency to run consolidation and the other config parameters I mentioned. That way you will always end up with a relatively fixed number of fragments as you perform a combination of writes and consolidation operations.

1 Like

Thanks for putting the time into this detailed write-up, Stavros! :+1:

What we are missing today is a way to consolidate “specific fragment partitions”, which we are adding soon. That will enable you to run consolidation from multiple machines in parallel, with each machine focusing on a disjoint fragment partition, effectively making parallel consolidation process-safe.

This sounds like a valuable addition!