Love seeing the tighter integration with Xarray
I just saw that this was updated
Very cool that you can now open TileDB arrays with Xarray!
Dense example with int dimensions and no mapping:
Dense example with int dimensions and mapping of float attribute to dimension (by labelling attribute “x.data”):
It even seems to support sparse arrays:
I also love how you can set/read coordinate/attribute metadata:
However, the dimension can currently only be of integer type.
I realize that this is for now primarily geared towards the Pangeo community, and converting NetCDF files to TileDB arrays, but are there any plans on allowing to ingest non-int type dimensions (float, strings, datetime…)?
Thank you for your input, Mtrl_Scientist, and for being a TileDB user. We appreciate the importance of Xarray integration for the community and are currently working on further development in this area. A lot of our work has been focused on the pangeo community and working with NetCDF files, and most of our current focus is on improving support for TileDB Groups. Still, Sparse Arrays and floating point coords are on our radar. We are currently working on improving support for datetime type coordinates with Dense Arrays. Look towards our upcoming releases for upgrades in these areas, and please continue to be in touch! We’re always happy to hear from users and will take this into account as we plan our development.
Having something similar to the zarr implementation would be ideal:
However, this is a downsampled example. Would it be possible to also lazy-load the coordinates?
If I understand your question, you would like to store data similar to the data in the zarr example above in TileDB, and access it directly with xarray. Is that correct?
Right now you can do this by storing the data in multiple dense TileDB arrays, similar to how xarray data is stored in zarr or NetCDF. The internals of xarray are built around the assumption that you are using dense data in a NetCDF-like data model, and even if we added full sparse support, you would probably see better performance using this data model. You can try and get best-of-both-world by adding sparse “axis labels” (extra arrays that map from data-to-dimension) to the group as well.
Example dense array storage:
-
date
array:- dimensions:
-
date
(type: np.uin64, data: 0, 1, 2, …, 120)
-
- attributes:
-
date.data
(type: np.datetime[‘D’], data: 2020-11-23, …, 2021-11-19)
-
- dimensions:
-
price
array:- dimensions:
-
price
(type: np.uint64, data: 0, 1, …, 66)
-
- attributes:
-
price.data
(type, np.float64, data: …)
-
- dimensions:
-
quantity
array:- dimensions:
-
price
(type: np.uint64, data: 0, 1, 2, …) -
date
(type: np.uint64, data: 0, 1, 2, …)
-
- attributes:
-
quantity
(type: float, data: …)
-
- dimensions:
-
spot
array:- dimensions:
-
date
(type: np.uint64, data: 0, 1, 2, …)
-
- attributes:
-
spot
(type: float, data: …)
-
- dimensions:
You can load this into xarray using:
xr.merge([xr.open_dataset(f"{group_uri}/{array_name}" for array_name in ["date", "price", "quantity", "spot"])])
With respect to lazy-loading coordinates, that happens on the xarray side and you would need to make a feature request with the xarray developers directly.