Python Documentation

Hi,

I’d like to try out the new features in TileDB in Python. It appears that the Python docs haven’t been updated to reflect nullable attributes etc. Do you know when these will become available?

Thanks

Nick

1 Like

Hi @nickholway,

The 0.8.4 release has support for automatically creating nullable attributes from Pandas dataframes with columns using nullable types (integers and boolean). Here is an example:

import tiledb, pandas as pd

# create a nullable integer array
a = pd.array([1, 2, None, 4, 5, None, None], dtype=pd.Int64Dtype())

# create a nullable boolean
b = pd.array([True, False, None, True, True, None, None], dtype='boolean')

# create dataframe from this data
df = pd.DataFrame({"int_column": a, "bool_column": b})

print("input dataframe: \n", df)

uri = "/tmp/df1.tiledb"

# store dataframe using TileDB
tiledb.from_pandas(uri, df)

# read dataframe back, and print the associated schema
with tiledb.open(uri) as A:
    print("result dataframe: ")
    print(A.df[:]) # read result directly as a dataframe

    print("\nAutomatic array schema: \n")
    print(A.schema)
Output from example script:
input dataframe: 
    int_column  bool_column
0           1         True
1           2        False
2        <NA>         <NA>
3           4         True
4           5         True
5        <NA>         <NA>
6        <NA>         <NA>
result dataframe: 
   int_column  bool_column
0           1         True
1           2        False
2        <NA>         <NA>
3           4         True
4           5         True
5        <NA>         <NA>
6        <NA>         <NA>

Automatic array schema: 

ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 6), tile=6, dtype='uint64'),
  ]),
  attrs=[
    Attr(name='int_column', dtype='int64', var=False, nullable=True, filters=FilterList([ZstdFilter(level=1), ])),
    Attr(name='bool_column', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=False,
  coords_filters=FilterList([ZstdFilter(level=-1), ])
)

Right now we don’t support NumPy’s masked arrays because they don’t seem to be widely-used (Pandas implementation does not use them), but please let me know if those are of interest; or if there are other libraries with nullable data support that you have in mind.

Best,
Isaiah