Handling Non-Standard Index Timesteps

AyrtonB · May 16, 2020, 2:10am

Hi all. I’ve just began using TileDB and am at a loss of how to best store tabular data that gets updated every half-hour. Im using the Python API and the module won’t let me use ‘30T’ as a unit for the tile. I want to have this as a dense array but can only seem to get it working when I change the index step to seconds and then use a sparse array which adds a value every 60*30 cell. What is the advised method for approaching this?

AyrtonB · May 16, 2020, 1:32pm

I’ve added a detailed walkthrough of my current steps here - https://nbviewer.jupyter.org/github/AyrtonB/Sage/blob/master/3)%20TileDB%20-%20Test.ipynb

The issue is reproducible through:

## Imports
import pandas as pd
import numpy as np
import tiledb

## Reading in the data
df_FUELHH_raw = pd.read_csv('https://raw.githubusercontent.com/AyrtonB/Sage/master/data/elec_data.csv')

df_FUELHH = df_FUELHH_raw.drop(columns=['Unnamed: 0']).copy().astype(np.int32)
df_FUELHH.index = pd.to_datetime(df_FUELHH_raw['Unnamed: 0'])
df_FUELHH.index.name = 'datetime'

## Defining the time domain
dom = tiledb.Domain(tiledb.Dim(name='datetime', 
                               domain=(np.datetime64('2015-02-22T00:00'), np.datetime64('2019-10-01T00:00')), 
                               tile=525600, # Minutes in a year 
                               dtype='datetime64[ms]'))

## Defining the schema
FUELHH_col_order = ['ccgt', 'oil', 'coal', 'nuclear', 'wind', 'ps', 'npshyd', 'ocgt', 'other', 'intfr', 'intirl', 'intned', 'intew', 'biomass', 'intnem']

attrs = [tiledb.Attr(name=fuel, dtype=np.int32) for fuel in FUELHH_col_order]
schema = tiledb.ArraySchema(domain=dom, sparse=False, attrs=attrs)

## Creating The array
tiledb.DenseArray.create('test_arr', schema)

## Assigning data to the array
attr_data_dict = df_FUELHH.T.apply(np.array, axis=1).to_dict()

with tiledb.DenseArray(array_dir, 'w') as TileDB_array:
    TileDB_array[:] = attr_data_dict

ihnorton · May 16, 2020, 7:21pm

Hi Ayrton,

For your original question, the cell count mismatch occurs because the underlying dense array dimensions are based on the time range multiplied by the resolution (minutes in this case). The tile argument is distinct from the domain (for more information see this doc section), and we don’t have a way to specify an arbitrary division like that for a dense array right now, though you could handle it manually by setting the domain to (0, (np.datetime64('2019-10-01T00:00') - np.datetime64('2015-02-22T00:00')) / 30, which will give the desired number of cells. Note that you will need to do the domain/index conversion manually at the moment (I will add an issue to look at this, we might be able to allow a pandas.timedelta_range as a domain specifier).

However, all that said, your data could instead be well-modeled with a sparse array, which I’ve done in a modified version of your notebook below:

Aside from changing the array type, the only other significant change is in the write step: for a sparse array, the coordinates must be specified when writing, which is done by taking the datetime index values and using them in the assignment – this provides coordinates exactly matching the cell count of the attribute arrays, and very nicely round-trips the dataframe.

Hope this helps, and thanks for your interest in TileDB. Please let us know if you have additional questions.

Isaiah

AyrtonB · May 17, 2020, 10:56am

Hi ihnorton,

Thanks for the great explanation, that worked perfectly

Topic		Replies	Views
TileDB for sparse time series pivot tables?	4	1436	March 6, 2020
Data structure for Lidar data	2	103	October 31, 2024
Basic from_pandas usage problems	2	929	April 12, 2022
Nonvalid / Inconsistent Timestamp Records	3	1045	December 11, 2018
Confused by dimensions vs attributes need help designing backend for project	5	886	March 10, 2023

Handling Non-Standard Index Timesteps

Related topics