Hello,
Apologies for the very late reply.
Your use case is ideal for TileDB. We are working actively to provide an easy alternative to NetCDF, which also works on the cloud and takes advantage of the other various benefits of TileDB. We will soon be posting documentation and video tutorials on the subject, clarifying the similarities and differences between TileDB and NetCDF / HDF5, as well as the use of TileDB for dataframes.
In the meantime, below I provide some information that you may hopefully find useful. You can always reach out to us again here or by emailing us at hello@tiledb.com.
TileDB vs. NetCDF / HDF5 data modeling
Some important clarifications on terminology.
Dimensions: They are exactly the same between TileDB and NetCDF / HDF5. Some differences:
- TileDB supports both dense and sparse arrays. For sparse arrays, cells are allowed to be empty and TileDB materializes only non-empty cells on the disk, storing their explicit coordinates (more on this term below).
- Even for the dense case, TileDB’s dimension domain is more flexible than NetCDF and HDF5. You can define any domain for each of your dimensions (even
MAX_UINT64
, which effectively makes the domain “infinite”). That has no impact in performance. Then you can write anywhere in the domain via TileDB’s API and TileDB will know internally which space is empty and which is populated. All this is abstracted for you. You can even define your own fill values to be used for empty spaces in dense arrays.
Groups: This concept is the same in TileDB and HDF5 / NetCDF.
Variables: In TileDB, this is called an “attribute” and has the same functionality as a variable in HDF5 / NetCDF. We chose “attribute” here because this is a standard term that denotes a column in a Database table.
Attributes: As mentioned above, in TileDB we call the HDF5 / NetCDF variables as attributes. The HDF5 attributes in TileDB are called “array metadata”. You can see some documentation here:
https://docs.tiledb.com/main/basic-concepts/array-metadata
https://docs.tiledb.com/main/api-usage/array-metadata
Coordinates:: Here is where things become a bit complicated. In NetCDF, the coordinates are effectively axes labels. That is, instead of using positional indices, NetCDF allows you to add a vector per dimension which maps a label to a positional index, allowing you to query using these label values.
In TileDB, we call as coordinates the tuple that identifies an array cell, e.g., (1, 3)
identifies a cell in a 2D array, with value 1
in the first dimension and 3
along the second.
Today you can manually add axes labels in TileDB and mimic what NetCDF does by creating a 1D sparse array per dimension, where the dimension values are the labels (even of string type) and the attribute values (in the TileDB terminology) are the positional indices along that dimension. We are planning on adding a more explicit API for axes labels, which is tracked here.
Updating the array schema
This is a great point, which we call schema evolution. We don’t support it currently but it has a high priority in our roadmap. It is being tracked here.
Indexing
Recall that TileDB supports both dense and sparse arrays:
- In dense arrays, there is no need for a specialized multi-dimensional index. TileDB maintains very lightweight metadata that makes it easy to efficiently locate the results of an array slice on disk.
- In sparse arrays, the situation is different because TileDB stores only non-empty cells with their explicit coordinates (in the TileDB terminology), which can be anywhere in the multi-dimensional space. So we need a way to index those coordinates and efficiently prune non-relevant results upon slicing. This is achieved with the use of R-trees. You can learn more about this here.
- Axes labels, applicable to both dense and sparse arrays, is a form of indexing as well, in case you wish to query by non-positional indices as in NetCDF. As explained above, you can benefit from TileDB’s 1D sparse arrays, which support even string dimension types.
Dataframes
The cool thing about TileDB and its support of sparse arrays in addition to dense, is that it can offer full generic support for dataframes (i.e., tabular data), very similar to what you get from a traditional database (where you can even query with SQL via TileDB’s integrations with MariaDB, PrestoDB and SparkSQL). This is because a dataframe can be easily thought of as a sparse array. You can learn more here.
To sum up, your post is touching upon many important topics which we are planning on clarifying with a lot of new documentation, examples, and tutorials. Please stay tuned as we will start publishing those over the next few weeks.
I hope this helps. Thanks again for checking TileDB out!