Hi,
We are evaluating to use TileDb in one of our projects.
During this analysis we have encountered some difficulties, in our testbed, to improve the TileDb read/write performance, so, we would share with you our findings to allow you to give us back your precious suggestions.
Workstation
Processor: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz, 3000 Mhz, 18 Core(s), 36 Logical Processor(s)
OS Name: Microsoft Windows Server 2019 Standard (Version 10.0.17763 Build 17763)
Installed Physical Memory (RAM): 128 GB
SSD: 1 Tb
TileDB Organization
Type: Dense Array (ROW Major, ROW Major)
Array Size: 500 Rows, 500000 Columns containing 1 float attribute (for each cell)
Tile: 1 Row, 500000 Columns
Test scenario
We have filled completely 23 arrays using the C++ API (version 2.2.9) and we have measured the write/read performance.
After each fragment write we have consolidated and vacuumed the array to boost the reading performance.
Writing Performance (at the 500th element of each array):
WRITING SINGLE FRAGMENT (milliseconds): 49
CONSOLIDATING (milliseconds): 13473
VACUUMING (milliseconds): 237
For each fragment we have written [1 Row, 500000 Columns] to match the tile size (GLOBAL_ORDER).
Reading performance
We summed up the elapsed time required to retrieve the first [1 Row, 500000 Columns] for each of the 23 arrays (GLOBAL_ORDER layout). The elapsed time is ~14 sec.
Questions
- During the writing we have noticed that the consolidation time increase linearly with the array size growth.
We think that this situation is completely normal because the bigger the array size is the worst will be the performance.
Anyway what we have noticed that after the first 10 fragments, the consolidation takes 349 msec, while, at the last fragment, it tooks 13sec. Is this large gap [349msec - 13sec] expected? How can we reduce it? - The total time to read the first [100 Row, 500000 Columns] of each array was ~14 Sec (so to read the whole content of the 23 arrays we would spend ~25 Min). Is this a good value in this scenario? How can we reduce it?
We have tried to put in place all of your suggestion related to the performance improvement (e.g. dumping the TileDb statistics we can read the following “Percentage of useful cells read: 100%”) but probably we underestimated some factors as the read and write performance are a little bit out of our expectations.
That being said, we will appreciate any of your suggestion to improve our performance.
Thanks for your support and the time that you will spend analysing this topic.
Regards,
Giordano