Dataset with mixed shapes

Given how labeled dimensions for dense arrays are already under development, would it be possible to also add further 1D vectors to create a dataset with mixed shapes?

Here’s an example xarray dataset with mixed shapes:

the amount_base & amount_quote variables depend on n_pips & date as dimensions, but all the other variables only depend on the date dimension. It would be wasteful to store them all in the same 2D dense array. Instead, one could create a 2D dense array for the amount_base & amount_quote variables, and a 1D array for all the other variables. Finally, all data could be merged into a single datastructure (such as an xarray dataset) upon read.

I suppose it could be done manually by the user via tiledb groups, but it would be nice to have a native solution.

Hi @Mtrl_Scientist, I’ll discuss this with @julia. We are about to ship the first version of dimension labels, we’ll see if anything we do today conflicts with a potential extension to the case you are describing. Thanks!

Hi @Mtrl_Scientist, thank you for reaching out. If I understand correctly, what you are asking for is the ability to read from multiple arrays that share similar dimension with a single query? In your use cases are the extra arrays always one dimensional, or do you want to be able to handle a mix of different arrays together?

The current dimension label feature we are working on is roughly analogous to the xarray coordinates, but they aren’t limited to a single label per dimension. The dataset you showed should fit in a single array with multiple dimension labels.

Hi @julia,

Yes. The amount_base & amount_quote variables would be read from a dense 2D array, whereas the remaining variables would be read from a dense 1D array that has the same labels for the date dimension.

They don’t always have to be one-dimensional, no. Yes, I’d like to be able to handle a mix of different arrays. Come to think of it, it’d actually be very similar to what Kerchunk does. Maybe support for TileDB could be added?

Their examples show a more complex use-case with mixed-shape variables:

Here, the variables COH, inc, and lsmap depend on a different set of dimensions than the rest.

This is a feature I’ve thought about before, but it’s not currently on our roadmap. You may want to make a feature request for multi-array queries here: TileDB Feedback

1 Like

Thanks, I’ve just submitted one!