How to restrict the number of threads when using tiledb-py's open_dataframe function

qiuwei · December 23, 2020, 7:13am

The open_dataframe function uses all available cpu cores at the moment.
Is there any way to restrict that?

I have already tried pyarrow.set_cpu_count(), but no luck.

stavros · December 23, 2020, 7:22am

I would not use that function, as it loads the entire array in the dataframe. Please check this tutorial for some basic usage of TileDB with dataframes: https://nbviewer.jupyter.org/github/TileDB-Inc/TileDB-Cloud-Example-Notebooks/blob/release/tutorials/notebooks/python/dataframes/df_basics.ipynb. More tutorials are coming up shortly.

Please let me know if there are any issues.

qiuwei · December 23, 2020, 7:38am

Hi @stavros, we are feeding the whole dataset into a machine learning algorithm. Reading the whole dataset into a dataframe is convenient for us since we do not want to modify the API of the ML algorithm.

Anyway, I think providing a way to restrict the number of threads is orthogonal to reading part of the array.

In the long run, I think we should definitely follow your suggestion. TileDB serves great as a data source for ML frameworks.

stavros · December 23, 2020, 7:43am

You can set the "sm.compute_concurrency_level" configuration parameter in the context. You can do it in the global context before you run another TileDB command as follows:

cfg = tiledb.Ctx().config()
cfg.update(
  {
    'sm.compute_concurrency_level': 1
  }
)
tiledb.default_ctx(cfg)

stavros · December 23, 2020, 7:49am

You may also want to similarly set "sm.io_concurrency_level".

qiuwei · December 23, 2020, 5:22pm

@stavros anther question regarding the tutorial. I noticed you avoid using open_dataframe in the tutorial. Is there any implementation difference between A.df[:] and tiledb.open_dataframe besides that tiledb.open_dataframe reads in the whole array.

I wrote some test code, and found in some cases A.df[:] does not work but tiledb.open_dataframe returns the correct result.

creating test.tldb:

from typing import List
import tiledb
import numpy as np
import pandas as pd
from tiledb.dataframe_ import _tiledb_result_as_dataframe
from tiledb import ZstdFilter, FilterList
import time

def create_array(array_name, attr_names: List[str], ctx=None):
    if tiledb.object_type(array_name) == "array":
        print(f"{array_name} alreayd exists")
        return
    if ctx is None:
        ctx = tiledb.default_ctx()

    dim1 = tiledb.Dim(name='date', ctx=ctx,
                      domain=(np.datetime64('2000-01-01'), np.datetime64('2030-01-01')),
                      tile=np.timedelta64(365, 'D'),
                      dtype=np.datetime64('', 'D').dtype)
    dim2 = tiledb.Dim(name="instruments", ctx=ctx, domain=(None, None), tile=None, dtype=np.bytes_)

    dom = tiledb.Domain(dim1, dim2, ctx=ctx)
    attrs = (tiledb.Attr(name, dtype=np.float64, filters=FilterList([ZstdFilter(level=1), ]) )for name in attr_names)
    schema = tiledb.ArraySchema(ctx=ctx, domain=dom, sparse=True,
                                attrs=attrs)

    tiledb.Array.create(array_name, schema)

def get_dimension_from_index(df: pd.DataFrame):
    return (df.index.get_level_values(name).tolist() for name in df.index.names)

def get_attrs_from_pandas(df: pd.DataFrame):
    return {col_name: df.loc[:, col_name] for col_name in df.columns.tolist()}


#%%
def update_array(array_name, df):
    import time 
    t0 = time.time()
    data = df.to_numpy()
    i, j = get_dimension_from_index(df)
    with tiledb.SparseArray(array_name, mode='w') as A:
        A[i, j] = get_attrs_from_pandas(df)
    t1 = time.time()
    print(f"writing to tiledb takes {t1-t0}s")

df = pd.DataFrame({"date": [pd.to_datetime("2020-10-12"), pd.to_datetime("2020-10-12"), pd.to_datetime("2020-10-13")], "instruments": ["600831.SH", "300937.SH", "900211.SZ"], "price": [1.0, 2.0, 3.0]})
df = df.set_index(['date', 'instruments'])

attrs = df.columns.tolist()
create_array("test.tldb", attrs)
update_array("test.tldb", df)

Schema of test.tldb:

ArraySchema(
  domain=Domain(*[
    Dim(name='date', domain=(numpy.datetime64('2000-01-01'), numpy.datetime64('2030-01-01')), tile=365 days, dtype='datetime64[D]'),
    Dim(name='instruments', domain=(None, None), tile=None, dtype='|S0'),
  ]),
  attrs=[
    Attr(name='price', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=False,
  coords_filters=FilterList([ZstdFilter(level=-1), ])
)

Read using tiledb.open_dataframe without any problem

df = tiledb.open_dataframe("test.tldb")

However, read using the method shown in your tutorial:

A = tiledb.open("test.tldb")
A.df[:,:]

throws the following exception:

---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
~/Projects/playground/playground/tiledb_alpha_attrs.py in <module>
----> 1 A.df[:,:]

~/.cache/pypoetry/virtualenvs/playground-gu9jURa0-py3.7/lib/python3.7/site-packages/tiledb/multirange_indexing.py in __getitem__(self, idx)
    250                 df = result
    251             else:
--> 252                 df = self.pa_to_pandas(result)
    253 
    254             if use_stats():

~/.cache/pypoetry/virtualenvs/playground-gu9jURa0-py3.7/lib/python3.7/site-packages/tiledb/multirange_indexing.py in pa_to_pandas(self, pyquery)
    279             raise TileDBError("Cannot convert to pandas via this path without pyarrow; please disable Arrow results")
    280         try:
--> 281             table = pyquery._buffers_to_pa_table()
    282         except Exception as exc:
    283             if MultiRangeIndexer.debug:

TileDBError: TileDB-Arrow: tiledb datatype not understood ('DATETIME_DAY', cell_val_num: 1)

I am using tiledb-py 0.7.4

Thanks in advance!

ihnorton · December 23, 2020, 6:59pm

Currently, yes: for the .df[] pathway we use Apache Arrow by default, which is well-optimized for dataframe-heavy tasks like creating Pandas dataframes from query results, and for string conversions. However, our interface is missing some metadata to round-trip units of datetime64 other than ms and ns. Thanks for the bug report/reminder – we are actively working on several improvements here, and in the next release the .df and open_dataframe paths will be the same (both via Arrow by default), as well as preserving the necessary metadata for other date types.

Best,
Isaiah

Topic		Replies	Views
Pandas dataframe examples?	4	2228	October 21, 2020
Improved performance	2	1415	December 28, 2020
Dataframe with multidimensional values	2	702	July 15, 2021
TileDBError: [TileDB::Writer] Error: Duplicate coordinates are not allowed	4	3772	March 21, 2020
Basic from_pandas usage problems	2	919	April 12, 2022

How to restrict the number of threads when using tiledb-py's open_dataframe function

Related topics