How to model financial data using tiledb

I am trying to use tiledb to model financial data for machine learning.

We have some features for stocks over the market every day.
The features of the stocks include open price, close price etc.
Due to that we derive new features based on the available data, for example, we may calculate close price/open price and use it as a new feature, the number of features might change.

I am wondering how should I use tiledb to model this kind of data.
It is quite natural to have a time dimension since it is time-series data.
I would like to have the stocks as another dimension because there are new stocks emerging from time to time.
Is it ok to model the features also as a dimension? According to the documents, it is not possible to change the tiledb array schema to add new attributes at the moment, so I am not sure whether I could model the features as attributes.

Our use case is mainly:
1.append new data to the array every day after the market is closed. We derive all the features for all the stocks for that certain day.
2. From time to time, we may come up with new ideas about what features need to be calculated. So we calculate those features for the historical data. Then we update the tiledb array with those new features.
3. Of course reading slices of the array fast is a must.

Our current solution:
We store each feature in a single folder. Data for all the stocks at a certain timestamp is store in a single parquet file. We are basically homebrewing our poor man’s version of chunking.

1 Like

I am also looking forward for TileDB to support changing the array schema to add/remove attributes, as well as DaskDataframe support for out-of-core computations.

You can check the requested/planned features here.

IMO, TileDB has the potential to become one of the best storage solutions out there, and I’m excited to see them implement all these new features :slight_smile:

2 Likes

Hi @qiuwei, thanks for reaching out.

It is quite natural to have a time dimension since it is time-series data.

Yes, I suggest this being the first dimension of your sparse array.

I would like to have the stocks as another dimension because there are new stocks emerging from time to time.

Yes, that’s probably your second dimension. TileDB (since 2.0) supports string dimensions. Version 2.2 (days away) will also implement Hilbert order support that will remove certain complexities around space tiling and may optimize your 2D reads. Let’s revisit this once the new release is out.

Is it ok to model the features also as a dimension?

It really depends on whether you anticipate frequent slicing on the features. The quick answer is that it is ok, but adding numerous dimensions will have diminishing returns in the pruning effectiveness of your reads.

According to the documents, it is not possible to change the tiledb array schema to add new attributes at the moment

That is true and @Mtrl_Scientist is right in that this falls under the upcoming “schema evolution” feature, scheduled for version 2.3, so pretty soon.

Our current solution: We store each feature in a single folder.

I was going to suggest something similar. Store any subset of features in separate arrays, and repeat your dimensions, i.e., replicate the coordinates of the original array so that you don’t have to constrain your array to a single timestamp. That will give you rapid multi-dimensional slicing (contrary to Parquet), at the cost of storage redundancy (which may not be that bad - your space is still asymptotically linear). Also you can still take advantage of the fast updates, versioning, etc. of TileDB (also contrary to Parquet). Perhaps this is a viable workaround until version 2.3.

Please feel free to reach out again with more questions.

2 Likes

Hi @stavros, thanks for the detailed answer. It helps a lot!

@stavros I tried modeling features as array attributes. However, since tiledb does not support array schema evolution at the moment, I turned to some workaround, i.e., pre-allocate lots of attributes for future usage(I assume empty attributes do not take extra disk space?)
In our use-case , the number of attributes grows very slowly, so we will be good for quite long time by pre-allocating doubled size of the current number of attributes.

However, according to the documents here https://docs.tiledb.com/main/solutions/tiledb-embedded/api-usage/writing-arrays/writing-in-dense-subarrays.
Partial writes to some attributes are not supported at the moment.

Is this constraint going to be relaxed in the future? Since all attributes are written to different files, so I guess partial writes are technically possible?

1 Like

Yes, this will be relaxed very soon. Hopefully by early January.

2 Likes

@stavros Great to hear that!