I’m trying to create a Sparse array with a Unicode attribute in Python. The data I’m trying to load is in a Pandas dataframe and contains NaN values which throws an error saying “TypeError: Failed to convert buffer for attribute: x”.
I can’t share my data so I’ve written a quick example which throws the same error:
import pandas as pd
import tiledb
d = {"col1": ["abc", "def"],
"col2": [1, 2]}
good_df = pd.DataFrame(data=d)
d["col1"] = ["abc", None]
bad_df = pd.DataFrame(data=d)
tiledb.from_dataframe("works.tiledb", good_df) # Works
tiledb.from_dataframe("doesnt_work.tiledb", bad_df)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()
tiledb/np2buf.pyx in tiledb.libtiledb.array_to_buffer()
tiledb/np2buf.pyx in tiledb.libtiledb._varlen_cell_dtype()
TypeError: Unsupported varlen cell datatype ('<class 'NoneType'>')
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-24-429838dda7e7> in <module>
----> 1 tiledb.from_dataframe("doesnt_work.tiledb", bad_df)
~/.local/lib/python3.6/site-packages/tiledb/dataframe_.py in from_dataframe(uri, dataframe, **kwargs)
374 DeprecationWarning)
375
--> 376 from_pandas(uri, dataframe, **kwargs)
377
378 def from_pandas(uri, dataframe, **kwargs):
~/.local/lib/python3.6/site-packages/tiledb/dataframe_.py in from_pandas(uri, dataframe, **kwargs)
508
509 # TODO ensure correct col/dim ordering
--> 510 A[tuple(coords)] = write_dict
511
512 else:
tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()
tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()
TypeError: Failed to convert buffer for attribute: 'col1'
In my actual code I’ve also tried creating the array manually (in addition to using from_dataframe) and see the same error. Nan values in a numeric attribute work as you’d expect.
I’m using Python 3.6.12, Pandas 1.1.3 and Tiledb 0.7.1.
Can anyone provide some advice on how to handle Nan whilst creating arrays with Unicode attributes in Python please.
Currently, any missing values (represented by pandas as NaN) need to be filled, either when creating the the dataframe or by passing the fillna argument to tiledb.from_csv or tiledb.from_pandas like this:
import pandas as pd
import tiledb
d = {"col1": ["abc", "def"],
"col2": [1, 2]}
good_df = pd.DataFrame(data=d)
d["col1"] = ["abc", None]
bad_df = pd.DataFrame(data=d)
tiledb.from_dataframe("works.tiledb", good_df) # Works
tiledb.from_dataframe("works_with_fillna.tiledb", bad_df, fillna={'col1': ''})
# ^ call should succeed with fillna
In the next major TileDB Embedded release (2.2) we will introduce native support for nullable datasets, so the fillna will not be necessary. (the code is merged, but we are finalizing other aspects of the release now with the aim of releasing in early December, after the US holidays this week)
Thanks for the speedy reply I’ll try that in the (home) office tomorrow.
Is there a similar way that you can add the fillna whilst creating an array from scratch too? In my real code there are only a couple of columns with NAs in.
The next release sounds interesting, what else do we have to look forward to?
A couple comments – happy to dig in further if I misunderstood what you mean by “from scratch” here:
You only need to specify/fill the columns with NAs.
If you are using Pandas read_csv implementation directly, there are a variety of options for handling NAs, including overrides for interpretation of other strings as NA (see here).
For Pandas dataframes, you can call pandas.DataFrame.fillna directly to accomplish the same thing directly on the dataframe.
Indeed! Another highlight is the “Hilbert” tiling feature, which will simplify array creation as well as providing significant performance boosts. There is an active list of upcoming features and improvements here: