Writing sparse arrays in Python with Unicode containing NaN

nickholway · November 24, 2020, 3:16pm

Hi,

I’m trying to create a Sparse array with a Unicode attribute in Python. The data I’m trying to load is in a Pandas dataframe and contains NaN values which throws an error saying “TypeError: Failed to convert buffer for attribute: x”.

I can’t share my data so I’ve written a quick example which throws the same error:

import pandas as pd
import tiledb
d = {"col1": ["abc", "def"],
        "col2": [1, 2]}
good_df = pd.DataFrame(data=d)
d["col1"] = ["abc", None]
bad_df = pd.DataFrame(data=d)
tiledb.from_dataframe("works.tiledb", good_df) # Works
tiledb.from_dataframe("doesnt_work.tiledb", bad_df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/np2buf.pyx in tiledb.libtiledb.array_to_buffer()

tiledb/np2buf.pyx in tiledb.libtiledb._varlen_cell_dtype()

TypeError: Unsupported varlen cell datatype ('<class 'NoneType'>')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-429838dda7e7> in <module>
----> 1 tiledb.from_dataframe("doesnt_work.tiledb", bad_df)

~/.local/lib/python3.6/site-packages/tiledb/dataframe_.py in from_dataframe(uri, dataframe, **kwargs)
    374                   DeprecationWarning)
    375 
--> 376     from_pandas(uri, dataframe, **kwargs)
    377 
    378 def from_pandas(uri, dataframe, **kwargs):

~/.local/lib/python3.6/site-packages/tiledb/dataframe_.py in from_pandas(uri, dataframe, **kwargs)
    508 
    509                 # TODO ensure correct col/dim ordering
--> 510                 A[tuple(coords)] = write_dict
    511 
    512             else:

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

TypeError: Failed to convert buffer for attribute: 'col1'

In my actual code I’ve also tried creating the array manually (in addition to using from_dataframe) and see the same error. Nan values in a numeric attribute work as you’d expect.

I’m using Python 3.6.12, Pandas 1.1.3 and Tiledb 0.7.1.

Can anyone provide some advice on how to handle Nan whilst creating arrays with Unicode attributes in Python please.

Thanks

Nick

ihnorton · November 24, 2020, 5:57pm

Hi @nickholway,

Currently, any missing values (represented by pandas as NaN) need to be filled, either when creating the the dataframe or by passing the fillna argument to tiledb.from_csv or tiledb.from_pandas like this:

import pandas as pd
import tiledb
d = {"col1": ["abc", "def"],
        "col2": [1, 2]}
good_df = pd.DataFrame(data=d)
d["col1"] = ["abc", None]
bad_df = pd.DataFrame(data=d)
tiledb.from_dataframe("works.tiledb", good_df) # Works

tiledb.from_dataframe("works_with_fillna.tiledb", bad_df, fillna={'col1': ''})
#       ^ call should succeed with fillna

In the next major TileDB Embedded release (2.2) we will introduce native support for nullable datasets, so the fillna will not be necessary. (the code is merged, but we are finalizing other aspects of the release now with the aim of releasing in early December, after the US holidays this week)

Best,
Isaiah

nickholway · November 24, 2020, 8:20pm

Thanks for the speedy reply I’ll try that in the (home) office tomorrow.

Is there a similar way that you can add the fillna whilst creating an array from scratch too? In my real code there are only a couple of columns with NAs in.

The next release sounds interesting, what else do we have to look forward to?

ihnorton · November 24, 2020, 8:44pm

A couple comments – happy to dig in further if I misunderstood what you mean by “from scratch” here:

You only need to specify/fill the columns with NAs.
If you are using Pandas read_csv implementation directly, there are a variety of options for handling NAs, including overrides for interpretation of other strings as NA (see here).
For Pandas dataframes, you can call pandas.DataFrame.fillna directly to accomplish the same thing directly on the dataframe.

Indeed! Another highlight is the “Hilbert” tiling feature, which will simplify array creation as well as providing significant performance boosts. There is an active list of upcoming features and improvements here:

github.com

TileDB-Inc/TileDB/blob/dev/HISTORY.md#new-features

# TileDB v2.18.0 Release Notes

## Announcements

* TileDB 2.18, targeted for release in November 2023, includes a preview set of aggregate pushdown APIs. The APIs will be finalized in 2.19 with performance improvements.

## Disk Format

* Fix the format specification for group members. [#4380](https://github.com/TileDB-Inc/TileDB/pull/4380)
* Update fragment format spec for info on tile sizes and tile offsets. [#4416](https://github.com/TileDB-Inc/TileDB/pull/4416)

## Configuration changes

* Remove vfs.file.max_parallel_ops config option. [#3964](https://github.com/TileDB-Inc/TileDB/pull/3964)

## Breaking C API changes

* Behavior breaking change: `tiledb_group_remove_member` cannot remove named members by URI if the URI is duplicated. [#4391](https://github.com/TileDB-Inc/TileDB/pull/4391)

## New features

This file has been truncated. show original

Best,
Isaiah

Topic		Replies	Views
From_pandas correct way to set nullable attributes	2	460	April 4, 2023
Python Documentation	1	885	February 25, 2021
Storing waveform segment as var attr in sparse array	2	741	February 5, 2021
Am I wrongly filling a sparse array that has a variable length string attribute? Or is this a bug?	4	943	August 15, 2023
How to write to a sparse array slice with a time dimension. Getting: ValueError: Could not convert object to NumPy datetime	0	650	May 30, 2023

Writing sparse arrays in Python with Unicode containing NaN

Related topics