Basic from_pandas usage problems

mhertz · April 12, 2022, 9:08am

Hello,

I’m trying to use TileDB for basic roundtripping of Pandas DataFrames with a DateTime index. I presume that makes it a dense 2D array?

Anyway, my code just to test TileDB is really simple!

import numpy as np
import pandas as pd
rows, columns = 365*40, 15_000
df = pd.DataFrame(np.random.randint(0,rows,size=(rows, columns)), columns=[f"COL_{i}" for i in range(columns)], index=pd.date_range(start=pd.datetime(1995, 1, 1), periods=rows, freq='D'))

tiledb.from_pandas(f"<S3 URI>", df, full_domain=True, mode='ingest', cell_order='col-major')

Unfortunately I’m getting an error:

IndexError: cannot index datetime dimension with non-datetime interval

I’ve no idea what I’m doing wrong. The index is a datetime index comprised of Timestamps!

If I remove the index I can write - but I notice that it is extremely slow. Well over 5-6x comparable Parquet times. Is there anything I’m doing that’s just blatently a bit silly?

Thanks for you help!

mhertz · April 12, 2022, 10:53am

I can get this written by creating the attributes, dimensions and schema all seperately (so not using the from_pandas API method)

I then notice another issue - as I ramp up the number of columns performance degrades to a point where it’s unuseable. With 15,000 columns I’m unable to pull back any data at all without waiting, well, I’ve never let it run to completion so I’m not sure how long it would take! I’m assuming again doing something wrong here. Any thoughts?

nguyenv · April 12, 2022, 5:06pm

Hi @mhertz,

To workaround your first issue, you can convert the index to a column with df.reset_index and then setting it back to an index with index_col in from_pandas.

import tiledb
import numpy as np
import pandas as pd

rows, columns = 365 * 40, 15_000
df = pd.DataFrame(
    np.random.randint(0, rows, size=(rows, columns)),
    columns=[f"COL_{i}" for i in range(columns)],
    index=pd.date_range(start=pd.datetime(1995, 1, 1), periods=rows, freq="D"),
)
df.reset_index(inplace=True)  # convert index to column as tiledb.from_pandas expects integers
tiledb.from_pandas(
    "test_array",
    df,
    full_domain=True,
    mode="ingest",
    cell_order="col-major",
    index_col=[0],  # reset column back to index here
)

As for your other issue, we will investigate the high column count performance more and get back to you. We believe it is due to thread contention issues but will need to dig in to understand this more.

Thanks.

Topic		Replies	Views
Pandas dataframe examples?	4	2087	October 21, 2020
Handling Non-Standard Index Timesteps	3	888	May 17, 2020
Improved performance	2	1269	December 28, 2020
Dataframe with multidimensional values	2	622	July 15, 2021
TileDBError: [TileDB::Writer] Error: Duplicate coordinates are not allowed	4	3633	March 21, 2020

Basic from_pandas usage problems

Related Topics