SWOT database with TileDB

fbriol · June 16, 2020, 9:39am

Hello,

I am looking for a storage solution for spatial data for the future SWOT satellite. The ground segment produces the data in NetCDF format and then must be stored in a database. At the moment, I was thinking of using Zarr as a container. But then I came across your library. And I have several questions.

Problem

The products are a set of arrays with two dimensions: num_lines and num_pixels. num_pixels has a fixed size, and num_lines is a dimension that contains the number of lines measured by the satellite. This value increases over the life of the satellite. Most of the time, we make a time selection in our processing to extract a subset of the measurements.

We also want to make geographical selections on variables containing the positions of the satellite.

Below is an example of a product file.

SWOT_L2_LR_SSH_Expert_001_001_20111113T000000_20111113T005126_DG10_01 {
dimensions:
    num_lines = 9866 ;
    num_pixels = 71 ;
    num_sides = 2 ;
variables:
    double time(num_lines) ;
        time:long_name = "time in UTC" ;
        time:standard_name = "time" ;
        time:units = "seconds since 2000-01-01 00:00:00.0" ;
    int latitude(num_lines, num_pixels) ;
        latitude:long_name = "latitude (positive N, negative S)" ;
        latitude:standard_name = "latitude" ;
        latitude:units = "degrees_north" ;
        latitude:scale_factor = 1.e-06 ;
    int longitude(num_lines, num_pixels) ;
        longitude:long_name = "longitude (degrees East)" ;
        longitude:standard_name = "longitude" ;
        longitude:units = "degrees_east" ;
        longitude:scale_factor = 1.e-06 ;
    int ssh_karin(num_lines, num_pixels) ;
        ssh_karin:_FillValue = 2147483647 ;
        ssh_karin:long_name = "sea surface height" ;
        ssh_karin:standard_name = "sea surface height above reference ellipsoid" ;
        ssh_karin:units = "m" ;
        ssh_karin:scale_factor = 0.0001 ;
        ssh_karin:valid_min = -15000000 ;
        ssh_karin:valid_max = 150000000 ;
        ssh_karin:coordinates = "longitude latitude" ;

HDF data model.

The formalization of swot data follows the HDF data model. This model defines Dimensions, Groups, Variables, and Attributes.

The notion of Group is identical in TileDB and HDF.

For the dimensions, it is different, and I confess that I did not understand. A dimension in TileDB has a domain which can be an integer to represent indexes, but also dates and reals. In this case, do these dimensions describe the coordinates of the array? Are the values of these coordinates fixed when writing?

Array objects represent HDF Variables.

Are metadata objects the equivalent of HDF Attributes? What is the difference between Metadata and Attributes? I can’t find the meta attribute and Metadata class in the Python documentation. Can this attribute be used, or is it reserved for internal use?

Index

The use of spatial indexes appears to be through the use of dimensions in the case of SparseArray. Is it possible to index other elements of our dataset? In our case, the satellite coordinates are not accessible using the dimensions but using additional variables.

Updating a schema

If I want to update a schema, add an array, do you need to recreate it and copy the existing data to a new schema?

Implementation idea with TileDB

The idea is to build a DataFrame describing pair between a calendar and a partition of the satellite’s observations. We also want to add other indexes between these partitions and additional information: spatial indexes, half-orbit, etc. Is this the right approach?

Thanks for your help.

stavros · June 21, 2020, 10:17pm

Hello,

Apologies for the very late reply.

Your use case is ideal for TileDB. We are working actively to provide an easy alternative to NetCDF, which also works on the cloud and takes advantage of the other various benefits of TileDB. We will soon be posting documentation and video tutorials on the subject, clarifying the similarities and differences between TileDB and NetCDF / HDF5, as well as the use of TileDB for dataframes.

In the meantime, below I provide some information that you may hopefully find useful. You can always reach out to us again here or by emailing us at hello@tiledb.com.

TileDB vs. NetCDF / HDF5 data modeling

Some important clarifications on terminology.

Dimensions: They are exactly the same between TileDB and NetCDF / HDF5. Some differences:

TileDB supports both dense and sparse arrays. For sparse arrays, cells are allowed to be empty and TileDB materializes only non-empty cells on the disk, storing their explicit coordinates (more on this term below).
Even for the dense case, TileDB’s dimension domain is more flexible than NetCDF and HDF5. You can define any domain for each of your dimensions (even MAX_UINT64, which effectively makes the domain “infinite”). That has no impact in performance. Then you can write anywhere in the domain via TileDB’s API and TileDB will know internally which space is empty and which is populated. All this is abstracted for you. You can even define your own fill values to be used for empty spaces in dense arrays.

Groups: This concept is the same in TileDB and HDF5 / NetCDF.

Variables: In TileDB, this is called an “attribute” and has the same functionality as a variable in HDF5 / NetCDF. We chose “attribute” here because this is a standard term that denotes a column in a Database table.

Attributes: As mentioned above, in TileDB we call the HDF5 / NetCDF variables as attributes. The HDF5 attributes in TileDB are called “array metadata”. You can see some documentation here:
https://docs.tiledb.com/main/basic-concepts/array-metadata
https://docs.tiledb.com/main/api-usage/array-metadata

Coordinates:: Here is where things become a bit complicated. In NetCDF, the coordinates are effectively axes labels. That is, instead of using positional indices, NetCDF allows you to add a vector per dimension which maps a label to a positional index, allowing you to query using these label values.

In TileDB, we call as coordinates the tuple that identifies an array cell, e.g., (1, 3) identifies a cell in a 2D array, with value 1 in the first dimension and 3 along the second.

Today you can manually add axes labels in TileDB and mimic what NetCDF does by creating a 1D sparse array per dimension, where the dimension values are the labels (even of string type) and the attribute values (in the TileDB terminology) are the positional indices along that dimension. We are planning on adding a more explicit API for axes labels, which is tracked here.

Updating the array schema

This is a great point, which we call schema evolution. We don’t support it currently but it has a high priority in our roadmap. It is being tracked here.

Indexing

Recall that TileDB supports both dense and sparse arrays:

In dense arrays, there is no need for a specialized multi-dimensional index. TileDB maintains very lightweight metadata that makes it easy to efficiently locate the results of an array slice on disk.
In sparse arrays, the situation is different because TileDB stores only non-empty cells with their explicit coordinates (in the TileDB terminology), which can be anywhere in the multi-dimensional space. So we need a way to index those coordinates and efficiently prune non-relevant results upon slicing. This is achieved with the use of R-trees. You can learn more about this here.
Axes labels, applicable to both dense and sparse arrays, is a form of indexing as well, in case you wish to query by non-positional indices as in NetCDF. As explained above, you can benefit from TileDB’s 1D sparse arrays, which support even string dimension types.

Dataframes

The cool thing about TileDB and its support of sparse arrays in addition to dense, is that it can offer full generic support for dataframes (i.e., tabular data), very similar to what you get from a traditional database (where you can even query with SQL via TileDB’s integrations with MariaDB, PrestoDB and SparkSQL). This is because a dataframe can be easily thought of as a sparse array. You can learn more here.

To sum up, your post is touching upon many important topics which we are planning on clarifying with a lot of new documentation, examples, and tutorials. Please stay tuned as we will start publishing those over the next few weeks.

I hope this helps. Thanks again for checking TileDB out!

Topic		Replies	Views
How to model financial data using tiledb	8	1285	March 2, 2021
Is TileDB a good fit for my use-case? Help wanted	5	1561	March 31, 2022
Xarray <--> TileDB	4	2358	November 22, 2021
TileDB for sparse time series pivot tables?	4	1436	March 6, 2020
Confused by dimensions vs attributes need help designing backend for project	5	882	March 10, 2023