Xarray <--> TileDB

Love seeing the tighter integration with Xarray :+1:

3 Likes

I just saw that this was updated :smiley:

Very cool that you can now open TileDB arrays with Xarray!

Dense example with int dimensions and no mapping:
image

Dense example with int dimensions and mapping of float attribute to dimension (by labelling attribute “x.data”):
image

It even seems to support sparse arrays:

I also love how you can set/read coordinate/attribute metadata:
image

However, the dimension can currently only be of integer type.

I realize that this is for now primarily geared towards the Pangeo community, and converting NetCDF files to TileDB arrays, but are there any plans on allowing to ingest non-int type dimensions (float, strings, datetime…)?

1 Like

Thank you for your input, Mtrl_Scientist, and for being a TileDB user. We appreciate the importance of Xarray integration for the community and are currently working on further development in this area. A lot of our work has been focused on the pangeo community and working with NetCDF files, and most of our current focus is on improving support for TileDB Groups. Still, Sparse Arrays and floating point coords are on our radar. We are currently working on improving support for datetime type coordinates with Dense Arrays. Look towards our upcoming releases for upgrades in these areas, and please continue to be in touch! We’re always happy to hear from users and will take this into account as we plan our development.

3 Likes

Having something similar to the zarr implementation would be ideal:

However, this is a downsampled example. Would it be possible to also lazy-load the coordinates?

2 Likes

If I understand your question, you would like to store data similar to the data in the zarr example above in TileDB, and access it directly with xarray. Is that correct?

Right now you can do this by storing the data in multiple dense TileDB arrays, similar to how xarray data is stored in zarr or NetCDF. ​The internals of xarray are built around the assumption that you are using dense data in a NetCDF-like data model, and even if we added full sparse support, you would probably see better performance using this data model. You can try and get best-of-both-world by adding sparse “axis labels” (extra arrays that map from data-to-dimension) to the group as well.

Example dense array storage:

  • date array:
    • dimensions:
      • date (type: np.uin64, data: 0, 1, 2, …, 120)
    • attributes:
      • date.data (type: np.datetime[‘D’], data: 2020-11-23, …, 2021-11-19)
  • price array:
    • dimensions:
      • price (type: np.uint64, data: 0, 1, …, 66)
    • attributes:
      • price.data (type, np.float64, data: …)
  • quantity array:
    • dimensions:
      • price (type: np.uint64, data: 0, 1, 2, …)
      • date (type: np.uint64, data: 0, 1, 2, …)
    • attributes:
      • quantity (type: float, data: …)
  • spot array:
    • dimensions:
      • date (type: np.uint64, data: 0, 1, 2, …)
    • attributes:
      • spot (type: float, data: …)

You can load this into xarray using:

xr.merge([xr.open_dataset(f"{group_uri}/{array_name}" for array_name in ["date", "price", "quantity", "spot"])])

With respect to lazy-loading coordinates, that happens on the xarray side and you would need to make a feature request with the xarray developers directly.

1 Like