Hi all. I’ve just began using TileDB and am at a loss of how to best store tabular data that gets updated every half-hour. Im using the Python API and the module won’t let me use ‘30T’ as a unit for the tile. I want to have this as a dense array but can only seem to get it working when I change the index step to seconds and then use a sparse array which adds a value every 60*30 cell. What is the advised method for approaching this?
I’ve added a detailed walkthrough of my current steps here - https://nbviewer.jupyter.org/github/AyrtonB/Sage/blob/master/3)%20TileDB%20-%20Test.ipynb
The issue is reproducible through:
## Imports
import pandas as pd
import numpy as np
import tiledb
## Reading in the data
df_FUELHH_raw = pd.read_csv('https://raw.githubusercontent.com/AyrtonB/Sage/master/data/elec_data.csv')
df_FUELHH = df_FUELHH_raw.drop(columns=['Unnamed: 0']).copy().astype(np.int32)
df_FUELHH.index = pd.to_datetime(df_FUELHH_raw['Unnamed: 0'])
df_FUELHH.index.name = 'datetime'
## Defining the time domain
dom = tiledb.Domain(tiledb.Dim(name='datetime',
domain=(np.datetime64('2015-02-22T00:00'), np.datetime64('2019-10-01T00:00')),
tile=525600, # Minutes in a year
dtype='datetime64[ms]'))
## Defining the schema
FUELHH_col_order = ['ccgt', 'oil', 'coal', 'nuclear', 'wind', 'ps', 'npshyd', 'ocgt', 'other', 'intfr', 'intirl', 'intned', 'intew', 'biomass', 'intnem']
attrs = [tiledb.Attr(name=fuel, dtype=np.int32) for fuel in FUELHH_col_order]
schema = tiledb.ArraySchema(domain=dom, sparse=False, attrs=attrs)
## Creating The array
tiledb.DenseArray.create('test_arr', schema)
## Assigning data to the array
attr_data_dict = df_FUELHH.T.apply(np.array, axis=1).to_dict()
with tiledb.DenseArray(array_dir, 'w') as TileDB_array:
TileDB_array[:] = attr_data_dict
Hi Ayrton,
For your original question, the cell count mismatch occurs because the underlying dense array dimensions are based on the time range multiplied by the resolution (minutes in this case). The tile
argument is distinct from the domain
(for more information see this doc section), and we don’t have a way to specify an arbitrary division like that for a dense array right now, though you could handle it manually by setting the domain to (0, (np.datetime64('2019-10-01T00:00') - np.datetime64('2015-02-22T00:00')) / 30
, which will give the desired number of cells. Note that you will need to do the domain/index conversion manually at the moment (I will add an issue to look at this, we might be able to allow a pandas.timedelta_range
as a domain specifier).
However, all that said, your data could instead be well-modeled with a sparse array, which I’ve done in a modified version of your notebook below:
Aside from changing the array type, the only other significant change is in the write step: for a sparse array, the coordinates must be specified when writing, which is done by taking the datetime
index values and using them in the assignment – this provides coordinates exactly matching the cell count of the attribute arrays, and very nicely round-trips the dataframe.
Hope this helps, and thanks for your interest in TileDB. Please let us know if you have additional questions.
Isaiah
Hi ihnorton,
Thanks for the great explanation, that worked perfectly