I’ve just started trying to use the R tiledb package. I see lots of R examples, which is great, except that they all seem to use very small amounts of data.
Are there any examples that take a larger amount of data, and show how to set up an appropriate tiledb data model on disk?
I’ve used the R arrow package before. Its docs are reasonably clear on how to create an arrow Dataset on disk, and how/why you should organize its directory structure. What’s the equivalent for tiledb?
For newer projects, I have R arrays with up to 4 dimensions, which Arrow cannot directly handle. Since tiledb is supposed to natively handle multi-dimensional arrays, I want to try it! For my initial application, I have a risk model estimated daily, that I want to store on disk for later fast access. So each day, I have a covariance matrix, factor exposures, etc.
Let’s think about just the covariance matrix. In R, I represent that two different ways:
- As a two-dimensional matrix, one for each date.
- Or, as a 3-dimensional array, where the 3rd dimension is the date.
On disk, am I supposed to create one tiledb Array for each daily date? Or one single big Array for all dates? The second approach (one big Array) sounds more attractive, much more like an RDBMS, but I’m not sure how to do it, or even if that’s the recommended approach.
Note that the risk factors present (row and column names on the R matrix) can be different on different dates. Does that change how I should store things in tiledb?
What if I have two different flavors of risk model? Would it ever make sense to have only one big tiledb Array for ALL the daily covariance matrices, across two or three risk models?
Note, I have other R arrays with up to 4 dimensions. So I really do care about the 3+ dimension case, and would rather not force everything to fit into 2d tables, as I’d have to do with Arrow.
Thanks!