TileDBError: [TileDB::Writer] Error: Duplicate coordinates are not allowed

Hi again,

Sorry to be asking all these question, but I think tileDB is great and I’d like to continue using it.

I’m experiencing a weird bug that only occurs with certain dataframes and I can’t see why.
I’ve uploaded the bug-producing dataframe here.

In this example, the first column is dense (every index is associated with a value), whereas the 2nd column is sparse (missing values). The whole dataframe is of sparse type, however.

when I then try to save this data with the following code, I get an error, telling me that there are duplicate coordinates:

newdir = Path(r"/orderbook_data")

def to_tileDB(x,pair):
    # Turn Pandas Sparse DF into COO
    coo = x.sparse.to_coo()
    # Supposedly gets rid of duplicates in COO (inplace)
    coo.sum_duplicates()
    # Get numeric values of rows & cols, not just index
    row = x.index[coo.row].values
    col = pd.to_numeric(pd.Series(x.columns[coo.col]),downcast="float").to_numpy()

    # Define schema
    dom = tiledb.Domain(
        tiledb.Dim(name="price", domain=(0, 9e12), tile=x.max().max(), dtype=np.float64),
        #tiledb.Dim(name="date", domain=(np.datetime64('1980-01-01'), np.datetime64("2100-01-01")), tile=np.timedelta64(4, 'h'), dtype="datetime64[ns]")
        tiledb.Dim(name="date", domain=(0, 9e21), tile=14400e9, dtype=np.float64))

    atr_raw = tiledb.Attr(name="raw", dtype=np.float64)

    schema = tiledb.ArraySchema(domain=dom, sparse=True,attrs=[atr_raw],cell_order='col-major',tile_order='row-major')

    # Make array if it doesn't already exist
    if not os.path.exists(os.path.join(newdir,pair)):
        tiledb.SparseArray.create(os.path.join(newdir,pair),schema)
    # Write to tileDB array
    with tiledb.SparseArray(os.path.join(newdir,pair), mode='w') as A:
        A[row,col] = {"raw":coo.data}

Error:
---------------------------------------------------------------------------
TileDBError Traceback (most recent call last)
in
35
36 # Save to DB
—> 37 to_tileDB(j_df.iloc[:,:-1],pair)
38
39 # Clear lists

<ipython-input-323-2f1730455654> in to_tileDB(x, pair)
     25     # Write to tileDB array
     26     with tiledb.SparseArray(os.path.join(newdir,pair), mode='w') as A:
---> 27         A[row,col] = {"raw":coo.data}

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()

TileDBError: [TileDB::Writer] Error: Duplicate coordinates are not allowed

I’ve already verified with x.index.duplicated that there are no duplicate indices, and obviously the columns have different timestamps in this example.

I’m also using a fresh tileDB array every time to make sure it’s not somehow colliding with previous data.

COO apparently allows duplicate entries for sparse matrices, but they should be removed when using x.sum_duplicates().

However, even after applying that function, I get the same error. And it only happens with certain dataframes. I.e. the previous 100 with the same format would work.

An example dataframe that looks very similar but DOES work, can be found here.

The data look very similar and I don’t understand why one works and the other doesn’t.
If you have any ideas, I’d appreciate your input.

Hi @Mtrl_Scientist,

Thanks for trying TileDB. I think I need more information in order to debug. How are you loading the data? It would be very helpful to have a few more lines which demonstrate the error, including loading of the data (along the lines of this guide).

I tried the following:

import pandas as pd
from pathlib import Path

# < your code from the comment >

x = pd.read_excel("bug.xlsx")
to_tileDB(x, "testdir")

but got an error because the return of read_excel is not sparse (AttributeError: Can only use the '.sparse' accessor with Sparse data.).

Thanks!

Sure, thanks for looking into it!

bug = pd.read_csv(os.path.join(newdir,"bug.csv"),index_col=0)
bug.columns = pd.to_datetime(bug.columns,unit="ns")
bug = bug.astype(pd.SparseDtype("float", np.nan)) # Converts dataframe to sparse format

working = pd.read_csv(os.path.join(newdir,"working_test.csv"),index_col=0)
working.columns = pd.to_datetime(working.columns,unit="ns")
working = working.astype(pd.SparseDtype("float", np.nan)) # Converts dataframe to sparse format

# Save
to_tileDB(working,"working")
to_tileDB(bug,"bug")

---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
<ipython-input-357-8e3ed9e0ccba> in <module>
      9 # Save
     10 to_tileDB(working,"working")
---> 11 to_tileDB(bug,"bug")

<ipython-input-323-2f1730455654> in to_tileDB(x, pair)
     25     # Write to tileDB array
     26     with tiledb.SparseArray(os.path.join(newdir,pair), mode='w') as A:
---> 27         A[row,col] = {"raw":coo.data}

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()

TileDBError: [TileDB::Writer] Error: Duplicate coordinates are not allowed

Hi @Mtrl_Scientist,

The first issue here is that the to_coo process is casting the coordinates to int32, which does end up with duplicates, and will not preserve your indexes correctly.

The second issue is that the schema here is storing time as a double (float64), whereas in pandas/numpy dates are stored as int64 with selectable units and timespans. If you can afford microsecond precision, then you could convert to python datetime.datetime with the pandas Timestamp.to_pydatetime function and store those (double/float64) values, but it may still be tricky to index due to loss of precision.

Alternatively, I would suggest waiting a few days for the heterogenous dimensions feature to be merged. We would be very happy for you to try this out, and we will do a snapshot build of TileDB-Py against the core TileDB library for easy installation.

Best,
Isaiah

Hey @ihnorton

The int32 output is fine, which I later use to access the real values like this:

# Get numeric values of rows & cols, not just index
row = x.index[coo.row].values
col = pd.to_numeric(pd.Series(x.columns[coo.col]),downcast="float").to_numpy()

Even with the int32 limitation, it should allow me to have 2.1B unique row or column values before I start having duplicates. I’ve never had more than 300k, so I’m not sure that’s the problem.

I’m very much looking forward to the heterogenous dimensions though!

What’s strange is that when I do the following:

bug = pd.read_csv(os.path.join(newdir,"bug.csv"),index_col=0)
bug.columns = pd.to_datetime(bug.columns,unit="ns")
bug = bug.astype(pd.SparseDtype("float", np.nan)) # Converts dataframe to sparse format

working = pd.read_csv(os.path.join(newdir,"working_test.csv"),index_col=0)
working.columns = pd.to_datetime(working.columns,unit="ns")
working = working.astype(pd.SparseDtype("float", np.nan)) # Converts dataframe to sparse format

def zip_dict(x):
    coo = x.sparse.to_coo()
    # Supposedly gets rid of duplicates in COO (inplace)
    coo.sum_duplicates()
    # Get numeric values of rows & cols, not just index
    row = x.index[coo.row].values
    col = pd.to_numeric(pd.Series(x.columns[coo.col]),downcast="float").to_numpy()

    zip_len = len(dict(zip(row,col)))
    row_len_x = x.shape[0]
    return zip_len,row_len_x

print("Working dataframe")
zip_len,row_len_x = zip_dict(working)
print(f"Unique keys after zipping to dict: {zip_len}, Unique row indices of original df: {row_len_x}")

print("Bugged dataframe")
zip_len,row_len_x = zip_dict(bug)
print(f"Unique keys after zipping to dict: {zip_len}, Unique row indices of original df: {row_len_x}")

Which gives me this output:

Working dataframe
Unique keys after zipping to dict: 576, Unique row indices of original df: 576
Bugged dataframe
Unique keys after zipping to dict: 696, Unique row indices of original df: 696

As you can see, I do not get any duplicates at all, and these are the values which I’m trying to feed the tileDB array. So, I’m definitely not feeding the array duplicate values (since I cannot have duplicate keys in a dict). Could there be something wrong with my schema like limited resolution?

EDIT:

OK, if I reverse the zipped dict, so that the columns are the keys, I do get the duplicate values in the bugged df:

Working dataframe
Unique keys after zipping to dict: 2, Unique col indices of original df: 2
Bugged dataframe
Unique keys after zipping to dict: 1, Unique col indices of original df: 2

The problem is that the downcast to “float” doesn’t work properly in this line:

pd.to_numeric(pd.Series(x.columns[coo.col]),downcast="float").to_numpy()

When using int64 instead, it works and produces unique values.

Best regards,

Fred