I am trying to use tiledb to model financial data for machine learning.
We have some features for stocks over the market every day.
The features of the stocks include open price, close price etc.
Due to that we derive new features based on the available data, for example, we may calculate close price/open price and use it as a new feature, the number of features might change.
I am wondering how should I use tiledb to model this kind of data.
It is quite natural to have a time dimension since it is time-series data.
I would like to have the stocks as another dimension because there are new stocks emerging from time to time.
Is it ok to model the features also as a dimension? According to the documents, it is not possible to change the tiledb array schema to add new attributes at the moment, so I am not sure whether I could model the features as attributes.
Our use case is mainly:
1.append new data to the array every day after the market is closed. We derive all the features for all the stocks for that certain day.
2. From time to time, we may come up with new ideas about what features need to be calculated. So we calculate those features for the historical data. Then we update the tiledb array with those new features.
3. Of course reading slices of the array fast is a must.
Our current solution:
We store each feature in a single folder. Data for all the stocks at a certain timestamp is store in a single parquet file. We are basically homebrewing our poor man’s version of chunking.