Is TileDB a good fit for my use-case? Help wanted

Hi all!

I have only just begun playing around with TileDB and was hoping that someone more experienced may be able to tell me whether it is a good fit for my problem.

I develop software for radio interferometry. We have long used a custom format called a measurement set (although the storage backend is actually something called a casacore table). This is a columnar database which supports storing large arrays as columns. It also supports a subset of SQL. Unfortunately, it lacks the thread-safety/parallelism needed as we move to distributed/cloud computing.

Our specific use-case seems to be somewhat rare. We need a relational database (i.e. supporting SQL-like queries) which can store chunked array data. Our large data products will typically consist of (number_of_rows, number_of_channels, number_of_correlations) arrays. Typical values for row are 1M+, channel between 1k and 32k and correlation <=4. In addition to these large arrays, we also store a variety of metadata columns e.g. the time at which each measurement occurred, the specific coordinates at which the measurement was taken etc. It is often these columns of metadata (which may also be multi-dimensional) which need to be used to query the large data products i.e. select a range of times, reorder on a different condition etc. We also store a number of subtables containing ancillary information such as telescope state, which would need to be preserved if we moved to a newer format. To top it all off, our largest data products are complex valued and this is rarely supported (natively).

I have done a little playing around with TileDB and it could definitely store the individual arrays. The point at which I got stuck (likely due to ignorance, I have almost no experience with databases), is composing multiple arrays into a database structure. As an example, consider two TileDB arrays called TIME and DATA. TIME is a 1-D array of time values with dimension t and DATA is a 2-D array of data values with dimension (t,f). I need a way to compose a database such that their t axes are aligned and that it is possible to do things like “SELECT DATA WHERE x < TIME < y”. I could not figure out if this was possible with TileDB.

Thanks in advance if you had the patience to read the above!

Hi JSKenyon!

My understanding, based on the TileDB website is that TileDB only supports SQL on tabular-style data.

In addition to fast serverless SQL for tabular data, TileDB offers more natural and efficient multi-dimensional access via Python, R, Spark, Dask and geospatial libraries, ideal for diverse non-tabular data.

Your use case sounds more like a multidimensional array indexing scenario. But I have not actually tried the TileDB SQL stuff, so I could be wrong about that.

Functionally, what you want to do is quite straightforward in Xarray. In Xarray you can write code like

ds = xr.open_dataset(uri)
ds.sel(time=slice('2010-01-01', '2020-01-1'))

and Xarray uses pandas datetimeindexes to figure out which data it needs to grab. This would work with any storage backend supported by Xarray, including TileDB, NetCDF, Zarr, etc.

Is there a particular reason it needs to be SQL?

Hello @JSKenyon, welcome to the forum! First I’d like to say that your use case sounds very interesting and a great fit for TileDB. Sounds like you’ve already been making good progress in testing out modeling your different data components in TileDB. We’d love to setup a call to discuss things more indepth if you are up for it.

To top it all off, our largest data products are complex valued and this is rarely supported (natively).

Complex value support has been on our roadmap for a while, your use case sounds interesting and another vote to add support for complex numbers.

As an example, consider two TileDB arrays called TIME and DATA. TIME is a 1-D array of time values with dimension t and DATA is a 2-D array of data values with dimension (t,f). I need a way to compose a database such that their t axes are aligned and that it is possible to do things like “SELECT DATA WHERE x < TIME < y”. I could not figure out if this was possible with TileDB.

It sounds like your use case for these two arrays is around axes labels and performing composite query of the two array. Axes labels are supported in TileDB today by storing the labels in a sparse array with the labeled dimension. We are actively working on improvements to bring richer support for axes labels into TileDB to be released soon.

In python this works in a way like:

# First open the arrays
with tiledb.open("TIME") as TIME:
   with tiledb.open("DATA") as DATA:
       # Query the TIME array using incompletes to fetch one batch to t values at a time:
       for time_df in TIME.query(return_incomplete=True).multi_index[x, y]:
           # Use the time_df's "t" values to query the DATA array.
           data_df = DATA.df[time_df["t"]]
           # work with data_df pandas dataframe

If you’d like to use SQL, you can perform the query you are looking for in several ways. The two easiest might be a subquery:

SELECT data.* from `data` WHERE t  in (select t FROM `TIME` WHERE  x < t AND  t< y)

Or a with a join condition:

SELECT data.* from `data` inner join `TIME` on time.t = data.t WHERE  x < `time`.t AND  `time`.t < y

With the sql statements we pushdown the the where in clause or join clause to perform these in an optimal fashion with MariaDB.

Please let me know if you have any questions on the SQL or python examples.

@rabernat while SQL is often associated with tabular data, our integrations with MariaDB, Presto/Trino and Spark give TileDB broad capabilities for performing many types of queries on all different types of data. Specifically both multi-dimensional sparse and dense can be queries. We don’t restrict in any fashion the type of arrays or data that can be queried. We also pushdown things like the WHERE clause from the SQL engine directly to TileDB to perform filtering in a highly parallel and efficient fashion. The capabilities are quite broad for how it can be used.

1 Like

Thank you for the reply @rabernat!

We already use xarray extensively - I think it is brilliant. I am also using zarr as an intermediate storage backend. The issue is that xarray datasets, even when using dask arrays, are ill suited to certain operations (in my experience). While most simple selection is as you describe, reordering the data may be impossible due to memory limits/an all-to-all mapping in the dask layer. We can use SQL queries to work around this limitation i.e. return the rows which we need to produce a given ordering/meet a given criteria and then construct dask arrays and xarray datasets on top of them.

I, personally, don’t think there is a reason that it needs to be SQL - it is merely that the current (legacy) storage format supports it and that it is familiar to the community. It is also, admittedly, useful for data inspection purposes when you have >TB scale data from which you need to select only a handful of rows. Another important point is that selection in xarray when using dask arrays may still pass a large amount of data though memory, as the selection is (to the best of my knowledge) expressed as a layer in the resulting dask graph i.e. the data on which it depends all ends up in memory at some point. This is not (necessarily) true of the SQL case.

1 Like

Thanks @seth!

I will need to play around with those examples and get a feel for how it works. I would be happy to chat at some point, although it may need to wait until I have a slightly better handle on the examples you provided.

Thanks again for the assist!

1 Like

Hi @JSKenyon Just following up here on complex type support. We do already have some support for complex types as we store SAR data as TileDB arrays. This is stored as a 2 cell value in TileDB and represented as numpy complex types (you can persist these types to TileDB arrays). I would need to know a little bit more about your data type to know if you can use this directly but it works well and surfaces up through all of the scientific Python libraries.