Hi,
I’d like to try out the new features in TileDB in Python. It appears that the Python docs haven’t been updated to reflect nullable attributes etc. Do you know when these will become available?
Thanks
Nick
Hi,
I’d like to try out the new features in TileDB in Python. It appears that the Python docs haven’t been updated to reflect nullable attributes etc. Do you know when these will become available?
Thanks
Nick
Hi @nickholway,
The 0.8.4 release has support for automatically creating nullable attributes from Pandas dataframes with columns using nullable types (integers and boolean). Here is an example:
import tiledb, pandas as pd
# create a nullable integer array
a = pd.array([1, 2, None, 4, 5, None, None], dtype=pd.Int64Dtype())
# create a nullable boolean
b = pd.array([True, False, None, True, True, None, None], dtype='boolean')
# create dataframe from this data
df = pd.DataFrame({"int_column": a, "bool_column": b})
print("input dataframe: \n", df)
uri = "/tmp/df1.tiledb"
# store dataframe using TileDB
tiledb.from_pandas(uri, df)
# read dataframe back, and print the associated schema
with tiledb.open(uri) as A:
print("result dataframe: ")
print(A.df[:]) # read result directly as a dataframe
print("\nAutomatic array schema: \n")
print(A.schema)
input dataframe:
int_column bool_column
0 1 True
1 2 False
2 <NA> <NA>
3 4 True
4 5 True
5 <NA> <NA>
6 <NA> <NA>
result dataframe:
int_column bool_column
0 1 True
1 2 False
2 <NA> <NA>
3 4 True
4 5 True
5 <NA> <NA>
6 <NA> <NA>
Automatic array schema:
ArraySchema(
domain=Domain(*[
Dim(name='__tiledb_rows', domain=(0, 6), tile=6, dtype='uint64'),
]),
attrs=[
Attr(name='int_column', dtype='int64', var=False, nullable=True, filters=FilterList([ZstdFilter(level=1), ])),
Attr(name='bool_column', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=1), ])),
],
cell_order='row-major',
tile_order='row-major',
capacity=10000,
sparse=False,
coords_filters=FilterList([ZstdFilter(level=-1), ])
)
Right now we don’t support NumPy’s masked arrays because they don’t seem to be widely-used (Pandas implementation does not use them), but please let me know if those are of interest; or if there are other libraries with nullable data support that you have in mind.
Best,
Isaiah