From_pandas correct way to set nullable attributes

Hi,

When I look at the Array schema produced by calling the from_pandas Data Import method on my dataframe, I see that nullable=false for all of my attributes. I am trying to create a Sparse Array, and would like to (for some of my attributes) allow nullable=true because some measurements will not be known to me when I write to the array.

I don’t currently see an argument to pass to this function which will allow me to do this. I see the fillna option, but it doesn’t look like what I need.

Can someone help me with this?

Here is how I currently save the dataframe as a tiledb Sparse Array:

tiledb.from_pandas("/path/2/Array",
                   df,
                   sparse=True,
                   index_dims=["dim1","dim2","dim3","dim4","dim5"],
                   column_types={"dim1":"dateTime64[ns]","dim2":np.float64, "dim2":"ascii", "dim3":"ascii", "dim5":"ascii"},
                   mode="ingest",
                   cell_order="row-major"
                  )

Thanks!

[Edit]


Following this guide from the tiledb website did not set nullable=True for my attribute. When I read in the array, the schema returns:

attrs=[
    Attr(name='attr1', dtype='<U0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='attr2', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],

But I want,

attrs=[
    Attr(name='attr1', dtype='<U0', var=True, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='attr2', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],

Hi @Dave_L,

Here’s an example:

import tiledb, pandas as pd
import tempfile
uri = tempfile.mkdtemp()
tiledb.from_pandas(uri,
    pd.DataFrame({"a": np.random.rand(4), "b": ["aaa","bb","ccccc", "d"]}),
    sparse=True,
    column_types={"a":np.float64, "b":pd.StringDtype()})
In [9]: a = tiledb.open(uri)

In [10]: a.schema
Out[10]:
ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 3), tile=3, dtype='int64', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='a', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='b', dtype='<U0', var=True, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=True,
)

This is with TileDB-Py 0.21.1. If it doesn’t work as above for you, please let me know your TileDB-Py version.

Hope that helps,
Isaiah

Ah! Works like a charm. Thank you for your help Isaiah

1 Like