The original data contains 11499018 rows and 25 columns which takes roughly 900M disk space if stored as a parquet file with the backend of pyarrow.
df.head(10)
result:
OBJECT_ID S_INFO_WINDCODE TRADE_DT CRNCY_CODE S_DQ_PRECLOSE S_DQ_OPEN ... S_DQ_ADJFACTOR S_DQ_AVGPRICE S_DQ_TRADESTATUS S_DQ_TRADESTATUSCODE OPDATE OPMODE
0 -100000 600605.SH 19950427 CNY 5.56 5.40 ... 18.365348 5.2604 交易 -1.0 2002-06-07 18:07:31 0
1 -100001 600605.SH 19950428 CNY 5.19 5.35 ... 18.365348 5.2965 交易 -1.0 2002-06-07 18:07:31 0
2 -100002 600605.SH 19950502 CNY 5.29 5.25 ... 18.365348 5.2647 交易 -1.0 2002-06-07 18:07:31 0
3 -100003 600605.SH 19950503 CNY 5.30 5.30 ... 18.365348 5.3132 交易 -1.0 2002-06-07 18:07:31 0
4 -100004 600605.SH 19950504 CNY 5.30 5.30 ... 18.365348 5.1982 交易 -1.0 2002-06-07 18:07:31 0
5 -100005 600605.SH 19950505 CNY 5.18 5.24 ... 18.365348 5.2177 交易 -1.0 2002-06-07 18:07:31 0
6 -100006 600605.SH 19950508 CNY 5.20 5.24 ... 18.365348 5.2270 交易 -1.0 2002-06-07 18:07:31 0
7 -100007 600605.SH 19950509 CNY 5.22 5.23 ... 18.365348 5.2586 交易 -1.0 2002-06-07 18:07:32 0
8 -100008 600605.SH 19950510 CNY 5.26 5.32 ... 18.365348 5.1942 交易 -1.0 2002-06-07 18:07:32 0
9 -100009 600605.SH 19950511 CNY 5.15 5.19 ... 18.365348 5.1341 交易 -1.0 2002-06-07 18:07:32 0
I used the python binding of TileDB(tiledb.from_pandas()
) to convert the Dataframe into TileDB array. The array schema is as follows:
domain=Domain(*[
Dim(name='TRADE_DT', domain=(numpy.datetime64('1990-12-19T00:00:00.000000000'), numpy.datetime64('2020-11-30T00:00:00.000000000')), tile=1000 nanoseconds, dtype='datetime64[ns]'),
Dim(name='S_INFO_WINDCODE', domain=(None, None), tile=None, dtype='|S0'),
]),
attrs=[
Attr(name='index', dtype='int64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='OBJECT_ID', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='CRNCY_CODE', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_PRECLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_OPEN', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_HIGH', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_LOW', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_CLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_CHANGE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_PCTCHANGE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_VOLUME', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_AMOUNT', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_ADJPRECLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_ADJOPEN', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_ADJHIGH', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_ADJLOW', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_ADJCLOSE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_ADJFACTOR', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_AVGPRICE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_TRADESTATUS', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='S_DQ_TRADESTATUSCODE', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='OPDATE', dtype='datetime64[ns]', var=False, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='OPMODE', dtype='<U0', var=True, filters=FilterList([ZstdFilter(level=1), ])),
],
cell_order='row-major',
tile_order='row-major',
capacity=10000,
sparse=True,
allows_duplicates=True,
coords_filters=FilterList([ZstdFilter(level=1), ])
)
I have benchmarked the reading performance of parquet, TileDB and duckdb(https://duckdb.org/).
The result is as follows
Note: duckdb is used without any indices.
It seems that I have not made full of the potential of TileDB, especially for the use case of reading a range along a certain dimension and reading a single attribute.
My question is how can I make reading from TileDB faster.
Is it possible to get performance on par with directly reading parquet files?