Hi all. I’ve just began using TileDB and am at a loss of how to best store tabular data that gets updated every half-hour. Im using the Python API and the module won’t let me use ‘30T’ as a unit for the tile. I want to have this as a dense array but can only seem to get it working when I change the index step to seconds and then use a sparse array which adds a value every 60*30 cell. What is the advised method for approaching this?
I’ve added a detailed walkthrough of my current steps here - https://nbviewer.jupyter.org/github/AyrtonB/Sage/blob/master/3)%20TileDB%20-%20Test.ipynb
The issue is reproducible through:
## Imports import pandas as pd import numpy as np import tiledb ## Reading in the data df_FUELHH_raw = pd.read_csv('https://raw.githubusercontent.com/AyrtonB/Sage/master/data/elec_data.csv') df_FUELHH = df_FUELHH_raw.drop(columns=['Unnamed: 0']).copy().astype(np.int32) df_FUELHH.index = pd.to_datetime(df_FUELHH_raw['Unnamed: 0']) df_FUELHH.index.name = 'datetime' ## Defining the time domain dom = tiledb.Domain(tiledb.Dim(name='datetime', domain=(np.datetime64('2015-02-22T00:00'), np.datetime64('2019-10-01T00:00')), tile=525600, # Minutes in a year dtype='datetime64[ms]')) ## Defining the schema FUELHH_col_order = ['ccgt', 'oil', 'coal', 'nuclear', 'wind', 'ps', 'npshyd', 'ocgt', 'other', 'intfr', 'intirl', 'intned', 'intew', 'biomass', 'intnem'] attrs = [tiledb.Attr(name=fuel, dtype=np.int32) for fuel in FUELHH_col_order] schema = tiledb.ArraySchema(domain=dom, sparse=False, attrs=attrs) ## Creating The array tiledb.DenseArray.create('test_arr', schema) ## Assigning data to the array attr_data_dict = df_FUELHH.T.apply(np.array, axis=1).to_dict() with tiledb.DenseArray(array_dir, 'w') as TileDB_array: TileDB_array[:] = attr_data_dict
For your original question, the cell count mismatch occurs because the underlying dense array dimensions are based on the time range multiplied by the resolution (minutes in this case). The
tile argument is distinct from the
domain (for more information see this doc section), and we don’t have a way to specify an arbitrary division like that for a dense array right now, though you could handle it manually by setting the domain to
(0, (np.datetime64('2019-10-01T00:00') - np.datetime64('2015-02-22T00:00')) / 30, which will give the desired number of cells. Note that you will need to do the domain/index conversion manually at the moment (I will add an issue to look at this, we might be able to allow a
pandas.timedelta_range as a domain specifier).
However, all that said, your data could instead be well-modeled with a sparse array, which I’ve done in a modified version of your notebook below:
Aside from changing the array type, the only other significant change is in the write step: for a sparse array, the coordinates must be specified when writing, which is done by taking the
datetime index values and using them in the assignment – this provides coordinates exactly matching the cell count of the attribute arrays, and very nicely round-trips the dataframe.
Hope this helps, and thanks for your interest in TileDB. Please let us know if you have additional questions.
Thanks for the great explanation, that worked perfectly