Basic from_pandas usage problems

Hello,

I’m trying to use TileDB for basic roundtripping of Pandas DataFrames with a DateTime index. I presume that makes it a dense 2D array?

Anyway, my code just to test TileDB is really simple!

import numpy as np
import pandas as pd
rows, columns = 365*40, 15_000
df = pd.DataFrame(np.random.randint(0,rows,size=(rows, columns)), columns=[f"COL_{i}" for i in range(columns)], index=pd.date_range(start=pd.datetime(1995, 1, 1), periods=rows, freq='D'))

tiledb.from_pandas(f"<S3 URI>", df, full_domain=True, mode='ingest', cell_order='col-major')

Unfortunately I’m getting an error:

IndexError: cannot index datetime dimension with non-datetime interval

I’ve no idea what I’m doing wrong. The index is a datetime index comprised of Timestamps!

If I remove the index I can write - but I notice that it is extremely slow. Well over 5-6x comparable Parquet times. Is there anything I’m doing that’s just blatently a bit silly?

Thanks for you help!

I can get this written by creating the attributes, dimensions and schema all seperately (so not using the from_pandas API method)

I then notice another issue - as I ramp up the number of columns performance degrades to a point where it’s unuseable. With 15,000 columns I’m unable to pull back any data at all without waiting, well, I’ve never let it run to completion so I’m not sure how long it would take! I’m assuming again doing something wrong here. Any thoughts?

Hi @mhertz,

To workaround your first issue, you can convert the index to a column with df.reset_index and then setting it back to an index with index_col in from_pandas.

import tiledb
import numpy as np
import pandas as pd

rows, columns = 365 * 40, 15_000
df = pd.DataFrame(
    np.random.randint(0, rows, size=(rows, columns)),
    columns=[f"COL_{i}" for i in range(columns)],
    index=pd.date_range(start=pd.datetime(1995, 1, 1), periods=rows, freq="D"),
)
df.reset_index(inplace=True)  # convert index to column as tiledb.from_pandas expects integers
tiledb.from_pandas(
    "test_array",
    df,
    full_domain=True,
    mode="ingest",
    cell_order="col-major",
    index_col=[0],  # reset column back to index here
)

As for your other issue, we will investigate the high column count performance more and get back to you. We believe it is due to thread contention issues but will need to dig in to understand this more.

Thanks.