Benchmarking tiledb read performance

The original data contains 11499018 rows and 25 columns which takes roughly 900M disk space if stored as a parquet file with the backend of pyarrow.
df.head(10) result:

  OBJECT_ID S_INFO_WINDCODE  TRADE_DT CRNCY_CODE  S_DQ_PRECLOSE  S_DQ_OPEN  ...  S_DQ_ADJFACTOR  S_DQ_AVGPRICE  S_DQ_TRADESTATUS  S_DQ_TRADESTATUSCODE              OPDATE  OPMODE
0   -100000       600605.SH  19950427        CNY           5.56       5.40  ...       18.365348         5.2604                交易                  -1.0 2002-06-07 18:07:31       0
1   -100001       600605.SH  19950428        CNY           5.19       5.35  ...       18.365348         5.2965                交易                  -1.0 2002-06-07 18:07:31       0
2   -100002       600605.SH  19950502        CNY           5.29       5.25  ...       18.365348         5.2647                交易                  -1.0 2002-06-07 18:07:31       0
3   -100003       600605.SH  19950503        CNY           5.30       5.30  ...       18.365348         5.3132                交易                  -1.0 2002-06-07 18:07:31       0
4   -100004       600605.SH  19950504        CNY           5.30       5.30  ...       18.365348         5.1982                交易                  -1.0 2002-06-07 18:07:31       0
5   -100005       600605.SH  19950505        CNY           5.18       5.24  ...       18.365348         5.2177                交易                  -1.0 2002-06-07 18:07:31       0
6   -100006       600605.SH  19950508        CNY           5.20       5.24  ...       18.365348         5.2270                交易                  -1.0 2002-06-07 18:07:31       0
7   -100007       600605.SH  19950509        CNY           5.22       5.23  ...       18.365348         5.2586                交易                  -1.0 2002-06-07 18:07:32       0
8   -100008       600605.SH  19950510        CNY           5.26       5.32  ...       18.365348         5.1942                交易                  -1.0 2002-06-07 18:07:32       0
9   -100009       600605.SH  19950511        CNY           5.15       5.19  ...       18.365348         5.1341                交易                  -1.0 2002-06-07 18:07:32       0

I used the python binding of TileDB(tiledb.from_pandas()) to convert the Dataframe into TileDB array. The array schema is as follows:

  domain=Domain(*[
    Dim(name='TRADE_DT', domain=(numpy.datetime64('1990-12-19T00:00:00.000000000'), numpy.datetime64('2020-11-30T00:00:00.000000000')), tile=1000 nanoseconds, dtype='datetime64[ns]'),
    Dim(name='S_INFO_WINDCODE', domain=(None, None), tile=None, dtype='|S0'),
  ]),
  attrs=[
    Attr(name='index', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='OBJECT_ID', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='CRNCY_CODE', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_PRECLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_OPEN', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_HIGH', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_LOW', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_CLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_CHANGE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_PCTCHANGE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_VOLUME', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_AMOUNT', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_ADJPRECLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_ADJOPEN', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_ADJHIGH', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_ADJLOW', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_ADJCLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_ADJFACTOR', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_AVGPRICE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_TRADESTATUS', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='S_DQ_TRADESTATUSCODE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='OPDATE', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='OPMODE', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=True,
  coords_filters=FilterList([ZstdFilter(level=1), ])
)

I have benchmarked the reading performance of parquet, TileDB and duckdb(https://duckdb.org/).
The result is as follows


Note: duckdb is used without any indices.

It seems that I have not made full of the potential of TileDB, especially for the use case of reading a range along a certain dimension and reading a single attribute.

My question is how can I make reading from TileDB faster.
Is it possible to get performance on par with directly reading parquet files?

2 Likes

Excellent question! This gives us the opportunity to touch upon some important TileDB concepts regarding dataframes. Also we are days away from releasing our new tutorials on docs.tiledb.com that cover these issues in detail.

When storing a dataframe in TileDB, you can choose to model it either as a 1D dense array or as a ND sparse array. You have modeled it as a 2D sparse array. I will cover the two cases separately below.

Please do not hesitate to ask questions around how to ingest the dataframe as dense/sparse array and tweak the other parameters. @ihnorton will help out. I would be very interested in seeing the new benchmark results after you try out the suggestions below.

At the beginning of your experiments (before you run any TileDB command), add:

cfg = tiledb.Ctx().config()
cfg.update(
  {
    'py.init_buffer_bytes': 1024**2 * 50 # 50MB per attribute
  }
)
tiledb.default_ctx(cfg)

This is the space to allocate in the read queries for each of your attributes. You can tweak appropriately. The smaller this buffer is, the more internal “incomplete” queries will happen with successive reallocations, which will impact performance.

1D dense array

This means no indices / dimensions, all columns are attributes.

This is the equivalent to Parquet and DuckDB without indices, so a more apples-to-apples comparison. You can experiment with parameter tile (the space tile extent), and I would suggest trying 10K, 100K, 1M. This is similar to Parquet’s page size (number of rows in the page).

Another thing you can experiment with is the compressors for the attributes. Parquet uses Snappy by default (not supported in TileDB), which is typically faster than TileDB’s default zstd.

Regarding out-of-core operations, you can emulate those in a for loop, reading N rows (e.g., multiples of the tile size, e.g., 2M) at a time and filtering in pandas. You can probably do something similar with Parquet.

Sparse array

Although you can definitely choose any number of columns as dimensions (i.e., indices), please note that in your benchmarks you are slicing only on time. I would suggest for that particular benchmark to use a single dimension (time) and retry. As mentioned in another thread, adding dimensions has diminishing returns.

The most important tweak here is to set query(order='U', ... to get the results back in no specific order. Currently the default layout is row-major, which forces an internal sorting (we are changing this behavior in the upcoming release). This incurs a significant cost which is avoided in duckdb (as you probably did not apply an ORDER BY there) and of course Parquet.

Another parameter to experiment with here is capacity in the array schema, which is essentially the number of rows in a sparse tile (similar to tile in dense arrays). Note that tile in sparse arrays simply affects the internal sorted order. Please see the docs for more information on the data tile in sparse/dense arrays.

Finally, similar to my suggestion for dense arrays, you should experiment a bit with the compressors for the dimensions and attributes.

Looking forward to seeing your findings.

1 Like

@stavros

Following your advice, I did more benchmarks focusing on the 1D dense array modeling.
The results are as follows:

It seems that TileDB is still quite behind the parquet, especially for the writing.

If you are willing to investigate this case more closely, I can share the dataset I used to do the benchmark.

1 Like

@qiuwei thank you! Could you please share your dataset and script? We will try to reproduce and get back to you with performance tuning and optimization patches.

@stavros

the code snippet is here: https://gist.github.com/qiuwei/e1118d19692412a94f53abdb0536c441
the original data is here: https://www.dropbox.com/s/3olu4z9iqoxf9x0/eodprices.parquet?dl=0

Thanks in advance!

1 Like

Working on a blog post around benchmarking TileDB in similar ways as discussed here, I just arrived here googling “benchmarking TileDB” and variations of it.

I also found GitHub - TileDB-Inc/tiledb-benchmarks: TileDB benchmark scripts and output, but that’s similarly long ago.

Besides, I only found scattered blog posts (Representing thousands of NetCDF Files using TileDB | by Peter Killick | Met Office Informatics Lab | Medium).

I’m sure I’m missing something evident? I’d love to properly reference all existing TileDB benchmarks!