I recently saw @stavros pointing to a tutorial for the new dataframe interactions.
I really like the new .df way of directly returning a dataframe from a TileDB array,
and especially how TileDB now takes advantage of arrow!
uri = 'BTC_USDT'
t1,t2 = (np.datetime64('2020-12-20'),np.datetime64('2021-12-25'))
with tiledb.open(uri,'r') as A:
df = A.query(attrs=('price','volume','is_buyer_maker',
'first_trade_ID','last_trade_ID'
),
dims=('date',),index_col='date',use_arrow=True
).df[:,t1:t2]
This will read 9612287 rows × 5 columns in 6.31 s.
Array-Schema
ArraySchema(
domain=Domain(*[
Dim(name=‘agg_ID’, domain=(0, 9000000000000000000), tile=1000000, dtype=‘int64’),
Dim(name=‘date’, domain=(numpy.datetime64(‘1980-01-01T00:00:00.000000000’), numpy.datetime64(‘2100-01-01T00:00:00.000000000’)), tile=86400000000000 nanoseconds, dtype=‘datetime64[ns]’),
]),
attrs=[
Attr(name=‘price’, dtype=‘float64’, var=False, filters=FilterList([ZstdFilter(level=9), ])),
Attr(name=‘volume’, dtype=‘float64’, var=False, filters=FilterList([ZstdFilter(level=9), ])),
Attr(name=‘first_trade_ID’, dtype=‘int64’, var=False, filters=FilterList([ZstdFilter(level=9), ])),
Attr(name=‘last_trade_ID’, dtype=‘int64’, var=False, filters=FilterList([ZstdFilter(level=9), ])),
Attr(name=‘is_buyer_maker’, dtype=‘int8’, var=False, filters=FilterList([BitShuffleFilter(), ZstdFilter(level=9), ])),
Attr(name=‘is_best_price_match’, dtype=‘int8’, var=False, filters=FilterList([BitShuffleFilter(), ZstdFilter(level=9), ])),
],
cell_order=‘row-major’,
tile_order=‘row-major’,
capacity=10000000,
sparse=True,
allows_duplicates=False,
coords_filters=FilterList([ZstdFilter(level=-1), ])
)
Stats-dump1
TileDB Embedded Version: (2, 1, 3)
TileDB-Py Version: 0.7.4
==== READ ====
-
Number of read queries: 3
-
Number of attempts until results are found: 9
-
Number of attributes read: 15
- Number of fixed-sized attributes read: 15
-
Number of dimensions read: 3
- Number of fixed-sized dimensions read: 3
-
Number of logical tiles overlapping the query: 167
-
Number of physical tiles read: 3006
- Number of physical fixed-sized tiles read: 3006
-
Number of cells read: 19516358
-
Number of result cells: 17023669
-
Percentage of useful cells read: 87.2277%
-
Number of bytes read: 163402447 bytes (0.15218 GB)
-
Number of read operations: 10186
-
Number of bytes unfiltered: 693603523 bytes (0.645969 GB)
-
Unfiltering inflation factor: 4.24476x
-
Time to compute estimated result size: 0.043015 secs
- Time to compute tile overlap: 0.0366245 secs
Time to compute relevant fragments: 0.00434861 secs
Time to load relevant fragment R-trees: 0.00656483 secs
Time to compute relevant fragment tile overlap: 0.0247598 secs
- Time to compute tile overlap: 0.0366245 secs
-
Time to open array: 0.038626 secs
- Time to load array schema: 0.000264519 secs
- Time to load consolidated fragment metadata: 1.986e-06 secs
- Time to load fragment metadata: 0.0113568 secs
-
Total metadata read: 229077 bytes (0.000213345 GB)
- Array schema: 413 bytes (3.84636e-07 GB)
- Fragment metadata: 175552 bytes (0.000163496 GB)
- R-tree: 4224 bytes (3.93391e-06 GB)
- Fixed-sized tile offsets: 48888 bytes (4.55305e-05 GB)
-
Time to load array metadata: 7.5849e-05 secs
-
Time to initialize the read state: 0.000125849 secs
-
Read time: 5.36799 secs
- Time to compute next partition: 0.0339634 secs
- Time to split current partition: 0.00350552 secs
- Time to compute result coordinates: 4.82893 secs
Time to compute sparse result tiles: 0.000779751 secs
Time to read coordinate tiles: 0.0181086 secs
Time to unfilter coordinate tiles: 0.243352 secs
Time to compute range result coordinates: 3.89938 secs - Time to compute sparse result cell slabs: 0.0495635 secs
- Time to copy result attribute values: 0.400744 secs
Time to read attribute tiles: 0.0356251 secs
Time to unfilter attribute tiles: 0.272747 secs
Time to copy fixed-sized attribute values: 0.0487621 secs - Time to copy result coordinates: 0.0172377 secs
Time to copy fixed-sized coordinates: 0.0142504 secs
-
Total read query time (array open + init state + read): 5.36812 secs
==== Python Stats ==== -
TileDB-Py Indexing Time: 6.25372
- TileDB-Py query execution time: 6.13007
TileDB C++ Core initial query submit time: 0.964374
TileDB C++ Core incomplete resubmit(s) time: 4.40392
TileDB-Py incomplete buffer updates: 0.730236 - TileDB-Py buffer conversion time: 0.119553
- TileDB-Py query execution time: 6.13007
But when increasing the buffer size, I get almost 100% read performance improvement!
cfg = tiledb.Ctx().config()
cfg.update(
{
'py.init_buffer_bytes': 1024**2 * 400
}
)
tiledb.default_ctx(cfg)
The default recommended parameter of 1024**2 * 100
had no improvements, but setting it to 1024**2 * 400
did the trick for me, reducing read-times to 3.27 s.
Stats-Dump2
TileDB Embedded Version: (2, 1, 3)
TileDB-Py Version: 0.7.4
==== READ ====
-
Number of read queries: 1
-
Number of attempts until results are found: 1
-
Number of attributes read: 5
- Number of fixed-sized attributes read: 5
-
Number of dimensions read: 1
- Number of fixed-sized dimensions read: 1
-
Number of logical tiles overlapping the query: 90
-
Number of physical tiles read: 540
- Number of physical fixed-sized tiles read: 540
-
Number of cells read: 9634194
-
Number of result cells: 9624052
-
Percentage of useful cells read: 99.8947%
-
Number of bytes read: 117662004 bytes (0.109581 GB)
-
Number of read operations: 10099
-
Number of bytes unfiltered: 472129351 bytes (0.439705 GB)
-
Unfiltering inflation factor: 4.01259x
-
Time to compute estimated result size: 0.00702653 secs
- Time to compute tile overlap: 0.00654431 secs
Time to compute relevant fragments: 0.000313903 secs
Time to load relevant fragment R-trees: 0.00271106 secs
Time to compute relevant fragment tile overlap: 0.0034636 secs
- Time to compute tile overlap: 0.00654431 secs
-
Time to open array: 0.0323722 secs
- Time to load array schema: 0.000369341 secs
- Time to load consolidated fragment metadata: 2.077e-06 secs
- Time to load fragment metadata: 0.00840535 secs
-
Total metadata read: 230241 bytes (0.000214429 GB)
- Array schema: 413 bytes (3.84636e-07 GB)
- Fragment metadata: 176396 bytes (0.000164282 GB)
- R-tree: 4320 bytes (4.02331e-06 GB)
- Fixed-sized tile offsets: 49112 bytes (4.57391e-05 GB)
-
Time to load array metadata: 7.5414e-05 secs
-
Time to initialize the read state: 0.000172829 secs
-
Read time: 3.3564 secs
- Time to compute next partition: 0.000490527 secs
- Time to compute result coordinates: 2.98954 secs
Time to compute sparse result tiles: 0.000242224 secs
Time to read coordinate tiles: 0.00611595 secs
Time to unfilter coordinate tiles: 0.0957798 secs
Time to compute range result coordinates: 2.38976 secs - Time to compute sparse result cell slabs: 0.026101 secs
- Time to copy result attribute values: 0.319114 secs
Time to read attribute tiles: 0.0188366 secs
Time to unfilter attribute tiles: 0.203411 secs
Time to copy fixed-sized attribute values: 0.0476533 secs - Time to copy result coordinates: 0.0125715 secs
Time to copy fixed-sized coordinates: 0.0122299 secs
-
Total read query time (array open + init state + read): 3.35657 secs
==== Python Stats ==== -
TileDB-Py Indexing Time: 3.46947
- TileDB-Py query execution time: 3.36533
TileDB C++ Core initial query submit time: 3.35662
- TileDB-Py buffer conversion time: 0.101548
- TileDB-Py query execution time: 3.36533
Anyway, great work and looking forward to future improvements!