Xarray <--> TileDB

Mtrl_Scientist · March 12, 2021, 11:35am

Love seeing the tighter integration with Xarray

Mtrl_Scientist · August 22, 2021, 4:47pm

I just saw that this was updated

Very cool that you can now open TileDB arrays with Xarray!

Dense example with int dimensions and no mapping:

Dense example with int dimensions and mapping of float attribute to dimension (by labelling attribute “x.data”):

It even seems to support sparse arrays:

I also love how you can set/read coordinate/attribute metadata:

However, the dimension can currently only be of integer type.

I realize that this is for now primarily geared towards the Pangeo community, and converting NetCDF files to TileDB arrays, but are there any plans on allowing to ingest non-int type dimensions (float, strings, datetime…)?

lex-c · August 26, 2021, 1:00am

Thank you for your input, Mtrl_Scientist, and for being a TileDB user. We appreciate the importance of Xarray integration for the community and are currently working on further development in this area. A lot of our work has been focused on the pangeo community and working with NetCDF files, and most of our current focus is on improving support for TileDB Groups. Still, Sparse Arrays and floating point coords are on our radar. We are currently working on improving support for datetime type coordinates with Dense Arrays. Look towards our upcoming releases for upgrades in these areas, and please continue to be in touch! We’re always happy to hear from users and will take this into account as we plan our development.

Mtrl_Scientist · November 22, 2021, 3:17pm

Having something similar to the zarr implementation would be ideal:

However, this is a downsampled example. Would it be possible to also lazy-load the coordinates?

julia · November 22, 2021, 6:30pm

If I understand your question, you would like to store data similar to the data in the zarr example above in TileDB, and access it directly with xarray. Is that correct?

Right now you can do this by storing the data in multiple dense TileDB arrays, similar to how xarray data is stored in zarr or NetCDF. The internals of xarray are built around the assumption that you are using dense data in a NetCDF-like data model, and even if we added full sparse support, you would probably see better performance using this data model. You can try and get best-of-both-world by adding sparse “axis labels” (extra arrays that map from data-to-dimension) to the group as well.

Example dense array storage:

date array:
- dimensions:
  - date (type: np.uin64, data: 0, 1, 2, …, 120)
- attributes:
  - date.data (type: np.datetime[‘D’], data: 2020-11-23, …, 2021-11-19)
price array:
- dimensions:
  - price (type: np.uint64, data: 0, 1, …, 66)
- attributes:
  - price.data (type, np.float64, data: …)
quantity array:
- dimensions:
  - price (type: np.uint64, data: 0, 1, 2, …)
  - date (type: np.uint64, data: 0, 1, 2, …)
- attributes:
  - quantity (type: float, data: …)
spot array:
- dimensions:
  - date (type: np.uint64, data: 0, 1, 2, …)
- attributes:
  - spot (type: float, data: …)

You can load this into xarray using:

xr.merge([xr.open_dataset(f"{group_uri}/{array_name}" for array_name in ["date", "price", "quantity", "spot"])])

With respect to lazy-loading coordinates, that happens on the xarray side and you would need to make a feature request with the xarray developers directly.

Topic		Replies	Views
Reducing precision	3	799	January 21, 2020
Appending to existing xarray with TileDB-CF-Py	6	472	December 4, 2023
Welcome to the TileDB Forum	0	1086	April 9, 2018
Managing Large Geospatial Arrays with TileDB	3	702	September 29, 2023
Sparse float matrix	1	308	November 30, 2023

Xarray <--> TileDB

Related topics