Hi again,
Sorry to be asking all these question, but I think tileDB is great and I’d like to continue using it.
I’m experiencing a weird bug that only occurs with certain dataframes and I can’t see why.
I’ve uploaded the bug-producing dataframe here.
In this example, the first column is dense (every index is associated with a value), whereas the 2nd column is sparse (missing values). The whole dataframe is of sparse type, however.
when I then try to save this data with the following code, I get an error, telling me that there are duplicate coordinates:
newdir = Path(r"/orderbook_data")
def to_tileDB(x,pair):
# Turn Pandas Sparse DF into COO
coo = x.sparse.to_coo()
# Supposedly gets rid of duplicates in COO (inplace)
coo.sum_duplicates()
# Get numeric values of rows & cols, not just index
row = x.index[coo.row].values
col = pd.to_numeric(pd.Series(x.columns[coo.col]),downcast="float").to_numpy()
# Define schema
dom = tiledb.Domain(
tiledb.Dim(name="price", domain=(0, 9e12), tile=x.max().max(), dtype=np.float64),
#tiledb.Dim(name="date", domain=(np.datetime64('1980-01-01'), np.datetime64("2100-01-01")), tile=np.timedelta64(4, 'h'), dtype="datetime64[ns]")
tiledb.Dim(name="date", domain=(0, 9e21), tile=14400e9, dtype=np.float64))
atr_raw = tiledb.Attr(name="raw", dtype=np.float64)
schema = tiledb.ArraySchema(domain=dom, sparse=True,attrs=[atr_raw],cell_order='col-major',tile_order='row-major')
# Make array if it doesn't already exist
if not os.path.exists(os.path.join(newdir,pair)):
tiledb.SparseArray.create(os.path.join(newdir,pair),schema)
# Write to tileDB array
with tiledb.SparseArray(os.path.join(newdir,pair), mode='w') as A:
A[row,col] = {"raw":coo.data}
Error:
---------------------------------------------------------------------------
TileDBError Traceback (most recent call last)
in
35
36 # Save to DB
—> 37 to_tileDB(j_df.iloc[:,:-1],pair)
38
39 # Clear lists
<ipython-input-323-2f1730455654> in to_tileDB(x, pair)
25 # Write to tileDB array
26 with tiledb.SparseArray(os.path.join(newdir,pair), mode='w') as A:
---> 27 A[row,col] = {"raw":coo.data}
tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()
tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()
tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()
tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()
TileDBError: [TileDB::Writer] Error: Duplicate coordinates are not allowed
I’ve already verified with x.index.duplicated that there are no duplicate indices, and obviously the columns have different timestamps in this example.
I’m also using a fresh tileDB array every time to make sure it’s not somehow colliding with previous data.
COO apparently allows duplicate entries for sparse matrices, but they should be removed when using x.sum_duplicates().
However, even after applying that function, I get the same error. And it only happens with certain dataframes. I.e. the previous 100 with the same format would work.
An example dataframe that looks very similar but DOES work, can be found here.
The data look very similar and I don’t understand why one works and the other doesn’t.
If you have any ideas, I’d appreciate your input.