Appending to existing xarray with TileDB-CF-Py

Hello,

I am trying to use the adaptor GitHub - TileDB-Inc/TileDB-CF-Py: TileDB interface with awareness of the CF metadata conventions for load netCDF files into tiledb embedded, my question how to append a new file to an existing array?

Is there another adaptor more relevant?

Thanks in advance

Hello @sna1988,

Do you have an existing array that you would like to append to or many NetCDF files that you want to convert together?

The TileDB-CF-Py is the best library for automated conversion of NetCDF to TileDB. It has two methods for converting NetCDF files. You can either use the NetCDF4ConverterEngine or the xarray writer.

Currently the xarray converter is a little easier to use if you are handling multiple files. If you have all your data at once, you can convert a single multi-file xarray object to TileDB using the from_xarray method in the tiledb.cf.xarray_engine module.

Hope this helps and feel free to reach back out with any follow-up questions.

Best,
Julia

Hello @julia, thanks for your reply.
My use case is the following:
Every week or month or so, i will fetch some weather netCDF file from public api to load the content into tileDB. So I need to update my tile db array somehow, is this possible with this library?

I tried with from_xarray, it seems to create a group each time and no append parameter is provided.

Thanks in advance,

It is possible to do with this library. The xarray converter does always create a group since a xarray Dataset maps to a TileDB Group that may have one or more arrays.

If you are okay with having a TileDB group instead of a single array, you can copy additional data to an existing array using the copy_data_from_xarray method. The region you want to write the data in is specified by the region parameter. You do need to make sure the dimension you are appending on is large enough to support all future data. This can be set by either creating a sufficiently large array in the initial xarray dataset or using the encoding parameters.

If you are willing to share more details about your NetCDF data, I can help provide a more specific example.

Best,
Julia

Thank you for your reply.

Let me give a concrete files examples.
I am trying to create an array called “wind” and to push the netCDF files located in the below folder. If you can provide me with an example of code for defining domains / tiles and creating array according to the schema of the files, then appending to it; that would be very helpful.

Files:

I apologize for the delay. Your message got lost last week during the Holiday shuffle. Below is a walk through of copying the data from NetCDF to a TileDB group using xarray.

I am going to assume the depth, latitude, and longitude is the same in all files, and that you want data starting from Jan. 1, 2023 appending onward.

I used the following Python libraries:

import tiledb
import tiledb.cf
import numpy as np
import xarray as xr

I downloaded three example files from Copernicus and set the desired location for my output TileDB group:

file1 = "SMOC_20230101_R20230111.nc"
file2 = "SMOC_20230102_R20230111.nc"
file3 = "SMOC_20230103_R20230111.nc"
group_uri = "smoc_group"

Then I created the initial dataset from the first file. Note that I explicitly set the encoding for the time in xarray. This is so the xarray metadata will be consistent in each dataset.

# Open dataset and update time encoding.
time_encoding_units = "hours since 2023-01-01T00:00:00"
dataset = xr.open_dataset(file1)
dataset["time"].encoding["units"] = time_encoding_units

# Create the encoding data for the TileDB arrays. 
# Consider tuning the filters and tile extents based on your constraints.
filters = tiledb.FilterList([tiledb.ZstdFilter(level=7)])
main_attr_encoding = {
    "tiles": (1, 1, 2041, 2160),
    "filters": filters,
}
encoding = {var: main_attr_encoding for var in dataset}
for coord in dataset.coords:
    encoding[coord] = {"filters": filters}

# I only explicitly set the tile size for the time coordinate, because the other coordinates can
# easily be loaded all at once, but this array might actually get large as you append over many
# years. 
encoding["time"]["tiles"] = (1_000_000,) 

# Create the group and copy metadata over.
tiledb.cf.xarray_engine.create_group_from_xarray(
    dataset,
    group_uri,
    unlimited_dims=("time",),
    encoding=encoding,
)

Next, I create a function for copying data.

def copy_files(input_files, copy_coords=False):

    # Hard-coded start time for time indexing.
    start = np.datetime64("2023-01-01T00:00:00", "h")
    time_encoding_units = "hours since 2023-01-01T00:00:00"

    # Open the dataset
    dataset = xr.open_mfdataset(input_files)

    # Make sure the dataset uses the same start time as all other dataset.
    dataset["time"].encoding["units"] = time_encoding_units

    # Get the start and end time for the region.
    def hour_offset(start, value):
        time = value.data.astype(np.datetime64("", "h"))
        return (time - start).astype("int")
    t1 = hour_offset(start, dataset["time"][0])
    t2 = hour_offset(start, dataset["time"][-1])

    # Copy the data
    tiledb.cf.xarray_engine.copy_data_from_xarray(
        dataset,
        group_uri, 
        region={"time":  slice(t1, t2 + 1)},
        skip_vars = None if copy_coords else {"depth", "latitude", "longitude"},
    )

Finally, I copy the data. First I copy data from that first file including all of the coordinate data.

copy_files([file1], True)

Next, I copy the second two files. I skip the coordinate data, because the latitude, longitude, and depth are the same in every file.

copy_files([file2, file3])

The results can be accessed directly with TileDB or opened in an xarray dataset:

ds = xr.open_dataset(group_uri, engine="tiledb")

Hello @julia,

Thank you very much for your reply.
Its answers my needs.

Regards