Writing sparse arrays in Python with Unicode containing NaN

Hi,

I’m trying to create a Sparse array with a Unicode attribute in Python. The data I’m trying to load is in a Pandas dataframe and contains NaN values which throws an error saying “TypeError: Failed to convert buffer for attribute: x”.

I can’t share my data so I’ve written a quick example which throws the same error:

import pandas as pd
import tiledb
d = {"col1": ["abc", "def"],
        "col2": [1, 2]}
good_df = pd.DataFrame(data=d)
d["col1"] = ["abc", None]
bad_df = pd.DataFrame(data=d)
tiledb.from_dataframe("works.tiledb", good_df) # Works
tiledb.from_dataframe("doesnt_work.tiledb", bad_df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/np2buf.pyx in tiledb.libtiledb.array_to_buffer()

tiledb/np2buf.pyx in tiledb.libtiledb._varlen_cell_dtype()

TypeError: Unsupported varlen cell datatype ('<class 'NoneType'>')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-429838dda7e7> in <module>
----> 1 tiledb.from_dataframe("doesnt_work.tiledb", bad_df)

~/.local/lib/python3.6/site-packages/tiledb/dataframe_.py in from_dataframe(uri, dataframe, **kwargs)
    374                   DeprecationWarning)
    375 
--> 376     from_pandas(uri, dataframe, **kwargs)
    377 
    378 def from_pandas(uri, dataframe, **kwargs):

~/.local/lib/python3.6/site-packages/tiledb/dataframe_.py in from_pandas(uri, dataframe, **kwargs)
    508 
    509                 # TODO ensure correct col/dim ordering
--> 510                 A[tuple(coords)] = write_dict
    511 
    512             else:

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

TypeError: Failed to convert buffer for attribute: 'col1'

In my actual code I’ve also tried creating the array manually (in addition to using from_dataframe) and see the same error. Nan values in a numeric attribute work as you’d expect.

I’m using Python 3.6.12, Pandas 1.1.3 and Tiledb 0.7.1.

Can anyone provide some advice on how to handle Nan whilst creating arrays with Unicode attributes in Python please.

Thanks

Nick

Hi @nickholway,

Currently, any missing values (represented by pandas as NaN) need to be filled, either when creating the the dataframe or by passing the fillna argument to tiledb.from_csv or tiledb.from_pandas like this:

import pandas as pd
import tiledb
d = {"col1": ["abc", "def"],
        "col2": [1, 2]}
good_df = pd.DataFrame(data=d)
d["col1"] = ["abc", None]
bad_df = pd.DataFrame(data=d)
tiledb.from_dataframe("works.tiledb", good_df) # Works

tiledb.from_dataframe("works_with_fillna.tiledb", bad_df, fillna={'col1': ''})
#       ^ call should succeed with fillna

In the next major TileDB Embedded release (2.2) we will introduce native support for nullable datasets, so the fillna will not be necessary. (the code is merged, but we are finalizing other aspects of the release now with the aim of releasing in early December, after the US holidays this week)

Best,
Isaiah

1 Like

Thanks for the speedy reply :slight_smile: I’ll try that in the (home) office tomorrow.

Is there a similar way that you can add the fillna whilst creating an array from scratch too? In my real code there are only a couple of columns with NAs in.

The next release sounds interesting, what else do we have to look forward to?

A couple comments – happy to dig in further if I misunderstood what you mean by “from scratch” here:

  • You only need to specify/fill the columns with NAs.
  • If you are using Pandas read_csv implementation directly, there are a variety of options for handling NAs, including overrides for interpretation of other strings as NA (see here).
  • For Pandas dataframes, you can call pandas.DataFrame.fillna directly to accomplish the same thing directly on the dataframe.

Indeed! Another highlight is the “Hilbert” tiling feature, which will simplify array creation as well as providing significant performance boosts. There is an active list of upcoming features and improvements here:

Best,
Isaiah