How are "Axes Labels" currently implemented?

I’d like to access/index columns by non-positional indices (in my case integer identifiers, but could include strings in the general case).

Sounds like I’m looking for “Axes Labels” , which are mentioned briefly in the data model.

In this comment @stavros says,

Today you can manually add axes labels in TileDB and mimic what NetCDF does by creating a 1D sparse array per dimension, where the dimension values are the labels (even of string type) and the attribute values (in the TileDB terminology) are the positional indices along that dimension.

Also in May 2020 I see:

We just announced TileDB 2.0 that adds support for string and heterogeneous dimensions to sparse arrays. Axes labels can now be implemented by “attaching” any sparse array (acting practically as any dataframe) by mapping coordinates (e.g., string labels) to positional indices. Currently, the user must do it manually.

It looks like this ticket is tracking Axes Labels, but it’s been “in progress” since Apr 2020.

So my question is, have Axes Labels been implemented, and if so how can they be used?

I’m not worried about adding/accessing the Axes Labels manually. Ultimately I just want a quick way to look up identifiers and map them to column (or row) numbers (some kind B-tree lookup perhaps?)

NB I’m working in Python if that makes a difference!

Thanks!

1 Like

I’ve since found this commit from May 2020, which gives an Axes Labels example for the C API, and an example for the C++ API. (I see nothing for the Python API.)

In the C++ example, looking at the read_data_array_with_label function, it seems there are two queries performed on two separate arrays (with entirely separate URIs). First the string label is queried from the sparse label array, which returns the coordinates of the label in the data array. Those coordinates are then used with a second query on the data array.

This all makes sense and seems a natural way to do it manually. But I’m still confused about the description of Axes Labels in the data model, where the diagram shows labels sitting alongside the arrays, and the description says:

Axes labels: These are practically other (dense or sparse) arrays attached to each dimension, which facilitate slicing multi-dimensional ranges on conditions other than array positional indices.

As far as I can see, in the example, the labels are entirely separate arrays, and are in no way “attached” to the data dimensions. I can imagine a scenario whereby the data and labels are in the same group, but the data-model certainly suggests that something is built-in.

Is this just a case of the data-model being a future (aspirational) case, rather than the currently available implementation? And if so, is there a time-line for Axes Labels being built-in?

Hi @cokelid, thanks for reaching out.

I understand the confusion and I am happy to share our thoughts and timeline. I will also update the Data Model section in the docs soon to avoid further confusion.

Let’s treat the dense and sparse array cases separately.

Sparse arrays

Since TileDB 2.0, sparse arrays support dimensions of different types, and of any type (e.g., floats, strings, datetime, etc). That makes “axes labels” for sparse arrays native. That is, you don’t need to maintain another level of redirection by mapping labels to some arrays integer indices. The array stores and searches natively on the axes labels which can be of any type.

Dense arrays

Internally, for performance purposes, TileDB supports only integer dimensions for dense arrays, and they are all homogeneous (so that we can template on a single datatype which makes the code faster). Therefore, if you want to support axes labels, currently you need to manually create, say, a sparse 1D array that maps strings to integer indices for each dimension. That will give you very fast lookups for the indices, and then you can apply the indices in a second query to the dense array.

We understand that this is cumbersome and we’d like much better behavior for dense arrays. Here is what we thinking about implementing. Although internally dense arrays need to have homogenous integer dimensions, at the array schema level (upon creation), we will allow the user to set dimensions of any type (similar to sparse arrays), effectively defining axes labels for dense arrays. We will offer various APIs for the user to provide the axes label vectors upon ingestion, and TileDB will practically create this two-layered indexing internally (i.e., it will maintain the extra mapping from labels to integers), without forcing the user to do so in separate arrays with separate URIs. Then, the user will be able to query either based on the labels on the indices. So, same implementation idea, but way better experience for the user as they will be interfacing natively with a single array.

I hope the above helps. We will be starting the implementation of this feature soon and we will try to get it done in Q1 2022.

1 Like