Confused by dimensions vs attributes need help designing backend for project

Dave_L · March 8, 2023, 8:39pm

I am trying to understand when I should be using attributes or dimensions for storing 1-D arrays in time.

I have a 1-D array with n elements representing an IR spectra. At different points in time (at the second level) I will take a new scan with an IR sensor and get transmittance values at different wavelengths, and I want to add new scans as rows to a tileDB store. I am using SparseArrays since the sampling rate is sparse, but I know the start and end dates of my trial period.

I got the following to work (i.e: one dimension for time and 100 attributes for each wavelength):

# Create the one dimension
d1 = tiledb.Dim(name="timestamp", 
                domain=(strt_dt, end_dt),
                tile=1000, 
                dtype="datetime64[s]")

# Create a domain using the two dimensions
dom1 = tiledb.Domain(d1)

# Create attributes for each wavelength attribute (wv is an array of the wavelengths I sampled at)
attrs = [tiledb.Attr(name=str(i), dtype=np.float64) for i in wv]

# URI 
array_name = "/path/IRspectra"

# Create the array schema, setting `sparse=True` to indicate a sparse array
schema1 = tiledb.ArraySchema(domain=dom1,
                             sparse=True,
                             attrs=attrs)

# Create the array on disk (it will initially be empty)
tiledb.Array.create(array_name, schema1)

# Write some data:
data = dict(zip(wv.astype(str),spectra))
with tiledb.open(array_name,'w') as A:
    
    # Write the data:
    A[strt_dt] = data

But it seems to me that defining two dimensions; one for timestamp and the other for wavelength would make more intuitive sense. The code might change as follows:

# Create the two dimensions
d1 = tiledb.Dim(name="timestamp", 
                domain=(strt_dt, end_dt),
                tile=1000, 
                dtype="datetime64[s]")
d2 = tiledb.Dim(name="wavelength",
                domain=(0,99),
                dtype=int)
# Create a domain using the two dimensions
dom = tiledb.Domain(d1,d2)

# Create attributes for each wavelength attribute (wv is an array of the wavelengths I sampled at)
attr = tiledb.Attr(name="transmittance", dtype=np.float64)

# URI 
array_name = "/path/IRspectra"

# Create the array schema, setting `sparse=True` to indicate a sparse array
schema1 = tiledb.ArraySchema(domain=dom,
                             sparse=True,
                             attrs=[attr])

# Create the array on disk (it will initially be empty)
tiledb.Array.create(array_name, schema1)

# Write some data:
with tiledb.open(array_name,'w') as A:
    
    # Write the data:
    A[strt_dt, wv] = {"transmittance": spectra} # spectra is an array of transmittance values

but I get something like ValueError: value length (100) does not match coordinate length (1)
(admittedly I rewrote the code from memory as I erased it so not sure if that’s the exact error).

questions are: what did I do wrong in the second case? How should I structure this kind of array so that it best fits my use case?

EDIT:
error is IndexError: sparse index dimension length mismatch

ihnorton · March 9, 2023, 3:20pm

Hi @Dave_L,

Thanks for getting in touch, this sounds like a great use-case for TileDB. For sparse writes, the coordinates you pass on the left hand side of the assignment need to have entries for every cell.

I modified the code slightly to demonstrate, please see:

ihnorton/forum-wavelength-demo on TileDB Cloud.

Hope this helps, please let us know if you have further questions.

Isaiah

(Script version in code block below)

# Response to: https://forum.tiledb.com/t/confused-by-dimensions-vs-attributes-need-help-designing-backend-for-project/541?u=ihnorton

#%%
import tiledb, numpy as np
import tempfile

#%%
# Create the two dimensions
strt_dt = np.datetime64('2019-01-01T00:00:00')
end_dt = np.datetime64('2019-01-01T00:00:00') + np.timedelta64(100,'D')

d1 = tiledb.Dim(name="timestamp",
                domain=(strt_dt, end_dt),
                tile=1000,
                dtype="datetime64[s]")
d2 = tiledb.Dim(name="wavelength",
                domain=(0,99),
                dtype=int)
# Create a domain using the two dimensions
dom = tiledb.Domain(d1,d2)

# Create attributes for each wavelength attribute (wv is an array of the wavelengths I sampled at)
attr = tiledb.Attr(name="transmittance", dtype=np.float64)

# URI
array_name = tempfile.mkdtemp()

# Create the array schema, setting `sparse=True` to indicate a sparse array
schema1 = tiledb.ArraySchema(domain=dom,
                             sparse=True,
                             attrs=[attr])

# Create the array on disk (it will initially be empty)
tiledb.Array.create(array_name, schema1)

#%%
# Create some data:
data_time = np.arange(
  strt_dt,
  strt_dt + np.timedelta64(100,'s'),
  dtype='datetime64[s]',
)

data_wv = np.linspace(0,99,100)

spectra = np.random.rand(100)

#%%
# Write some data:
with tiledb.open(array_name,'w') as A:

    # Write the data:
    A[data_time, data_wv] = {"transmittance": spectra} # spectra is an array of transmittance values
# %%
with tiledb.open(array_name,'r') as A:
    print(A.schema)
    print(A.df[:])
# %%

Dave_L · March 9, 2023, 4:00pm

Thank you for the reply and I appreciate you taking time to modify the code.
I think the demo you have provided at link assumes that each transmittance value is recorded at it’s own time-stamp (which in reality they are when an IR sensor takes a scan). However, in my scenario I get an entire array of transmittance values from the sensor for any given scan and I essentially want to add rows in my tileDB array for each new scan. i.e: first scan on Jan 1st 2023 at 19:00:00, second scan on Jan 1st 2023 at 19:30:35 etc (scans taken at irregular time intervals at the second level for flexibility). Think of adding new rows to a matrix where rows are indexed by time, and columns represent wavelength(nm) or wavenumber indices. The DB should preferably be optimized for read-speeds because I will probably have only a few scans per day over the course of a trial.

In my example, I was able to get this to work with 1 Dimension (timestamp) and an attribute for each wavelength, but I am not sure this is the right way to do it and I have now become confused between what attributes and dimensions are in the tileDB model and when I should use one over the other to create my schema. Thank you again, I hope this clarifies what I am attempting to do and if you have further guidance I’m all ears.

ihnorton · March 10, 2023, 2:28pm

In general, using an attribute vs dimension largely comes down to the expected access patterns. If you commonly need to slice some subset of wavelengths (say 15-25) across all timepoints, then making the wavelength a dimension will be better. TileDB will arrange the data such that most slicing queries can be satisfied by only loading a subset of the data; please see this section of the docs for more background: Key Concepts & Data Format - TileDB Embedded Docs

I realized that the first example you sent is creating a single attribute per wavelength – that’s likely to be much worse for the usage pattern you described, because each wavelength attribute would be written to a separate file. So for this use-case, especially with a small number of writes, using wavelength as a dimension should work well.

Dave_L · March 10, 2023, 7:33pm

Thank you for the reply @ihnorton. So, querying and slicing the data efficiently would be key so if representing wavelength as a dimension is more correct I will do that. For convenience and usability though, my mind naturally wants to arrange this data as rows indexed by date and columns indexed by wavelength. What is the overhead for transforming that data back into vectors that could be used for modeling, plotting, etc…? Additionally, what would be a good tile-size to use (still a bit unclear on this). this is the current view of the data as you propose it (you can understand how this is not necessarily an intuitive way to view this data):

ihnorton · March 10, 2023, 9:04pm

The wavelength would be returned as a single column (or array if you use a[] or a.multi_index[] instead of .df[]), so should have no overhead.

The tiling for sparse arrays (see: capacity - docs) defaults to 10k cells. This will work well for most applications, but the parameter is configurable to allow application-specific fine-tuning when necessary.

Topic		Replies	Views
How to model financial data using tiledb	8	1284	March 2, 2021
Using a multi-dimensional sparse array in python	4	1253	July 31, 2019
TileDB for sparse time series pivot tables?	4	1436	March 6, 2020
Multiple concurrent writers to append-only sparse array	10	744	January 15, 2024
Data structure for Lidar data	2	99	October 31, 2024

Confused by dimensions vs attributes need help designing backend for project

Related topics