Hello,
I am looking for a storage solution for spatial data for the future SWOT satellite. The ground segment produces the data in NetCDF format and then must be stored in a database. At the moment, I was thinking of using Zarr as a container. But then I came across your library. And I have several questions.
Problem
The products are a set of arrays with two dimensions: num_lines and num_pixels. num_pixels has a fixed size, and num_lines is a dimension that contains the number of lines measured by the satellite. This value increases over the life of the satellite. Most of the time, we make a time selection in our processing to extract a subset of the measurements.
We also want to make geographical selections on variables containing the positions of the satellite.
Below is an example of a product file.
SWOT_L2_LR_SSH_Expert_001_001_20111113T000000_20111113T005126_DG10_01 {
dimensions:
num_lines = 9866 ;
num_pixels = 71 ;
num_sides = 2 ;
variables:
double time(num_lines) ;
time:long_name = "time in UTC" ;
time:standard_name = "time" ;
time:units = "seconds since 2000-01-01 00:00:00.0" ;
int latitude(num_lines, num_pixels) ;
latitude:long_name = "latitude (positive N, negative S)" ;
latitude:standard_name = "latitude" ;
latitude:units = "degrees_north" ;
latitude:scale_factor = 1.e-06 ;
int longitude(num_lines, num_pixels) ;
longitude:long_name = "longitude (degrees East)" ;
longitude:standard_name = "longitude" ;
longitude:units = "degrees_east" ;
longitude:scale_factor = 1.e-06 ;
int ssh_karin(num_lines, num_pixels) ;
ssh_karin:_FillValue = 2147483647 ;
ssh_karin:long_name = "sea surface height" ;
ssh_karin:standard_name = "sea surface height above reference ellipsoid" ;
ssh_karin:units = "m" ;
ssh_karin:scale_factor = 0.0001 ;
ssh_karin:valid_min = -15000000 ;
ssh_karin:valid_max = 150000000 ;
ssh_karin:coordinates = "longitude latitude" ;
HDF data model.
The formalization of swot data follows the HDF data model. This model defines Dimensions, Groups, Variables, and Attributes.
The notion of Group is identical in TileDB and HDF.
For the dimensions, it is different, and I confess that I did not understand. A dimension in TileDB has a domain which can be an integer to represent indexes, but also dates and reals. In this case, do these dimensions describe the coordinates of the array? Are the values of these coordinates fixed when writing?
Array objects represent HDF Variables.
Are metadata objects the equivalent of HDF Attributes? What is the difference between Metadata and Attributes? I can’t find the meta attribute and Metadata class in the Python documentation. Can this attribute be used, or is it reserved for internal use?
Index
The use of spatial indexes appears to be through the use of dimensions in the case of SparseArray. Is it possible to index other elements of our dataset? In our case, the satellite coordinates are not accessible using the dimensions but using additional variables.
Updating a schema
If I want to update a schema, add an array, do you need to recreate it and copy the existing data to a new schema?
Implementation idea with TileDB
The idea is to build a DataFrame describing pair between a calendar and a partition of the satellite’s observations. We also want to add other indexes between these partitions and additional information: spatial indexes, half-orbit, etc. Is this the right approach?
Thanks for your help.