The open_dataframe
function uses all available cpu cores at the moment.
Is there any way to restrict that?
I have already tried pyarrow.set_cpu_count()
, but no luck.
The open_dataframe
function uses all available cpu cores at the moment.
Is there any way to restrict that?
I have already tried pyarrow.set_cpu_count()
, but no luck.
I would not use that function, as it loads the entire array in the dataframe. Please check this tutorial for some basic usage of TileDB with dataframes: https://nbviewer.jupyter.org/github/TileDB-Inc/TileDB-Cloud-Example-Notebooks/blob/release/tutorials/notebooks/python/dataframes/df_basics.ipynb. More tutorials are coming up shortly.
Please let me know if there are any issues.
Hi @stavros, we are feeding the whole dataset into a machine learning algorithm. Reading the whole dataset into a dataframe is convenient for us since we do not want to modify the API of the ML algorithm.
Anyway, I think providing a way to restrict the number of threads is orthogonal to reading part of the array.
In the long run, I think we should definitely follow your suggestion. TileDB serves great as a data source for ML frameworks.
You can set the "sm.compute_concurrency_level"
configuration parameter in the context. You can do it in the global context before you run another TileDB command as follows:
cfg = tiledb.Ctx().config()
cfg.update(
{
'sm.compute_concurrency_level': 1
}
)
tiledb.default_ctx(cfg)
You may also want to similarly set "sm.io_concurrency_level"
.
@stavros anther question regarding the tutorial. I noticed you avoid using open_dataframe
in the tutorial. Is there any implementation difference between A.df[:] and tiledb.open_dataframe
besides that tiledb.open_dataframe
reads in the whole array.
I wrote some test code, and found in some cases A.df[:] does not work but tiledb.open_dataframe
returns the correct result.
creating test.tldb
:
from typing import List
import tiledb
import numpy as np
import pandas as pd
from tiledb.dataframe_ import _tiledb_result_as_dataframe
from tiledb import ZstdFilter, FilterList
import time
def create_array(array_name, attr_names: List[str], ctx=None):
if tiledb.object_type(array_name) == "array":
print(f"{array_name} alreayd exists")
return
if ctx is None:
ctx = tiledb.default_ctx()
dim1 = tiledb.Dim(name='date', ctx=ctx,
domain=(np.datetime64('2000-01-01'), np.datetime64('2030-01-01')),
tile=np.timedelta64(365, 'D'),
dtype=np.datetime64('', 'D').dtype)
dim2 = tiledb.Dim(name="instruments", ctx=ctx, domain=(None, None), tile=None, dtype=np.bytes_)
dom = tiledb.Domain(dim1, dim2, ctx=ctx)
attrs = (tiledb.Attr(name, dtype=np.float64, filters=FilterList([ZstdFilter(level=1), ]) )for name in attr_names)
schema = tiledb.ArraySchema(ctx=ctx, domain=dom, sparse=True,
attrs=attrs)
tiledb.Array.create(array_name, schema)
def get_dimension_from_index(df: pd.DataFrame):
return (df.index.get_level_values(name).tolist() for name in df.index.names)
def get_attrs_from_pandas(df: pd.DataFrame):
return {col_name: df.loc[:, col_name] for col_name in df.columns.tolist()}
#%%
def update_array(array_name, df):
import time
t0 = time.time()
data = df.to_numpy()
i, j = get_dimension_from_index(df)
with tiledb.SparseArray(array_name, mode='w') as A:
A[i, j] = get_attrs_from_pandas(df)
t1 = time.time()
print(f"writing to tiledb takes {t1-t0}s")
df = pd.DataFrame({"date": [pd.to_datetime("2020-10-12"), pd.to_datetime("2020-10-12"), pd.to_datetime("2020-10-13")], "instruments": ["600831.SH", "300937.SH", "900211.SZ"], "price": [1.0, 2.0, 3.0]})
df = df.set_index(['date', 'instruments'])
attrs = df.columns.tolist()
create_array("test.tldb", attrs)
update_array("test.tldb", df)
Schema of test.tldb
:
ArraySchema(
domain=Domain(*[
Dim(name='date', domain=(numpy.datetime64('2000-01-01'), numpy.datetime64('2030-01-01')), tile=365 days, dtype='datetime64[D]'),
Dim(name='instruments', domain=(None, None), tile=None, dtype='|S0'),
]),
attrs=[
Attr(name='price', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
],
cell_order='row-major',
tile_order='row-major',
capacity=10000,
sparse=True,
allows_duplicates=False,
coords_filters=FilterList([ZstdFilter(level=-1), ])
)
Read using tiledb.open_dataframe
without any problem
df = tiledb.open_dataframe("test.tldb")
However, read using the method shown in your tutorial:
A = tiledb.open("test.tldb")
A.df[:,:]
throws the following exception:
---------------------------------------------------------------------------
TileDBError Traceback (most recent call last)
~/Projects/playground/playground/tiledb_alpha_attrs.py in <module>
----> 1 A.df[:,:]
~/.cache/pypoetry/virtualenvs/playground-gu9jURa0-py3.7/lib/python3.7/site-packages/tiledb/multirange_indexing.py in __getitem__(self, idx)
250 df = result
251 else:
--> 252 df = self.pa_to_pandas(result)
253
254 if use_stats():
~/.cache/pypoetry/virtualenvs/playground-gu9jURa0-py3.7/lib/python3.7/site-packages/tiledb/multirange_indexing.py in pa_to_pandas(self, pyquery)
279 raise TileDBError("Cannot convert to pandas via this path without pyarrow; please disable Arrow results")
280 try:
--> 281 table = pyquery._buffers_to_pa_table()
282 except Exception as exc:
283 if MultiRangeIndexer.debug:
TileDBError: TileDB-Arrow: tiledb datatype not understood ('DATETIME_DAY', cell_val_num: 1)
I am using tiledb-py 0.7.4
Thanks in advance!
Currently, yes: for the .df[]
pathway we use Apache Arrow by default, which is well-optimized for dataframe-heavy tasks like creating Pandas dataframes from query results, and for string conversions. However, our interface is missing some metadata to round-trip units of datetime64 other than ms
and ns
. Thanks for the bug report/reminder – we are actively working on several improvements here, and in the next release the .df
and open_dataframe
paths will be the same (both via Arrow by default), as well as preserving the necessary metadata for other date types.
Best,
Isaiah