How to restrict the number of threads when using tiledb-py's open_dataframe function

The open_dataframe function uses all available cpu cores at the moment.
Is there any way to restrict that?

I have already tried pyarrow.set_cpu_count(), but no luck.

I would not use that function, as it loads the entire array in the dataframe. Please check this tutorial for some basic usage of TileDB with dataframes: More tutorials are coming up shortly.

Please let me know if there are any issues.

1 Like

Hi @stavros, we are feeding the whole dataset into a machine learning algorithm. Reading the whole dataset into a dataframe is convenient for us since we do not want to modify the API of the ML algorithm.

Anyway, I think providing a way to restrict the number of threads is orthogonal to reading part of the array.

In the long run, I think we should definitely follow your suggestion. TileDB serves great as a data source for ML frameworks.

You can set the "sm.compute_concurrency_level" configuration parameter in the context. You can do it in the global context before you run another TileDB command as follows:

cfg = tiledb.Ctx().config()
    'sm.compute_concurrency_level': 1
1 Like

You may also want to similarly set "sm.io_concurrency_level".

@stavros anther question regarding the tutorial. I noticed you avoid using open_dataframe in the tutorial. Is there any implementation difference between A.df[:] and tiledb.open_dataframe besides that tiledb.open_dataframe reads in the whole array.

I wrote some test code, and found in some cases A.df[:] does not work but tiledb.open_dataframe returns the correct result.

creating test.tldb:

from typing import List
import tiledb
import numpy as np
import pandas as pd
from tiledb.dataframe_ import _tiledb_result_as_dataframe
from tiledb import ZstdFilter, FilterList
import time

def create_array(array_name, attr_names: List[str], ctx=None):
    if tiledb.object_type(array_name) == "array":
        print(f"{array_name} alreayd exists")
    if ctx is None:
        ctx = tiledb.default_ctx()

    dim1 = tiledb.Dim(name='date', ctx=ctx,
                      domain=(np.datetime64('2000-01-01'), np.datetime64('2030-01-01')),
                      tile=np.timedelta64(365, 'D'),
                      dtype=np.datetime64('', 'D').dtype)
    dim2 = tiledb.Dim(name="instruments", ctx=ctx, domain=(None, None), tile=None, dtype=np.bytes_)

    dom = tiledb.Domain(dim1, dim2, ctx=ctx)
    attrs = (tiledb.Attr(name, dtype=np.float64, filters=FilterList([ZstdFilter(level=1), ]) )for name in attr_names)
    schema = tiledb.ArraySchema(ctx=ctx, domain=dom, sparse=True,

    tiledb.Array.create(array_name, schema)

def get_dimension_from_index(df: pd.DataFrame):
    return (df.index.get_level_values(name).tolist() for name in df.index.names)

def get_attrs_from_pandas(df: pd.DataFrame):
    return {col_name: df.loc[:, col_name] for col_name in df.columns.tolist()}

def update_array(array_name, df):
    import time 
    t0 = time.time()
    data = df.to_numpy()
    i, j = get_dimension_from_index(df)
    with tiledb.SparseArray(array_name, mode='w') as A:
        A[i, j] = get_attrs_from_pandas(df)
    t1 = time.time()
    print(f"writing to tiledb takes {t1-t0}s")

df = pd.DataFrame({"date": [pd.to_datetime("2020-10-12"), pd.to_datetime("2020-10-12"), pd.to_datetime("2020-10-13")], "instruments": ["600831.SH", "300937.SH", "900211.SZ"], "price": [1.0, 2.0, 3.0]})
df = df.set_index(['date', 'instruments'])

attrs = df.columns.tolist()
create_array("test.tldb", attrs)
update_array("test.tldb", df)

Schema of test.tldb:

    Dim(name='date', domain=(numpy.datetime64('2000-01-01'), numpy.datetime64('2030-01-01')), tile=365 days, dtype='datetime64[D]'),
    Dim(name='instruments', domain=(None, None), tile=None, dtype='|S0'),
    Attr(name='price', dtype='float64', var=False, filters=FilterList([ZstdFilter(level=1), ])),
  coords_filters=FilterList([ZstdFilter(level=-1), ])

Read using tiledb.open_dataframe without any problem

df = tiledb.open_dataframe("test.tldb")

However, read using the method shown in your tutorial:

A ="test.tldb")

throws the following exception:

TileDBError                               Traceback (most recent call last)
~/Projects/playground/playground/ in <module>
----> 1 A.df[:,:]

~/.cache/pypoetry/virtualenvs/playground-gu9jURa0-py3.7/lib/python3.7/site-packages/tiledb/ in __getitem__(self, idx)
    250                 df = result
    251             else:
--> 252                 df = self.pa_to_pandas(result)
    254             if use_stats():

~/.cache/pypoetry/virtualenvs/playground-gu9jURa0-py3.7/lib/python3.7/site-packages/tiledb/ in pa_to_pandas(self, pyquery)
    279             raise TileDBError("Cannot convert to pandas via this path without pyarrow; please disable Arrow results")
    280         try:
--> 281             table = pyquery._buffers_to_pa_table()
    282         except Exception as exc:
    283             if MultiRangeIndexer.debug:

TileDBError: TileDB-Arrow: tiledb datatype not understood ('DATETIME_DAY', cell_val_num: 1)

I am using tiledb-py 0.7.4

Thanks in advance!

Currently, yes: for the .df[] pathway we use Apache Arrow by default, which is well-optimized for dataframe-heavy tasks like creating Pandas dataframes from query results, and for string conversions. However, our interface is missing some metadata to round-trip units of datetime64 other than ms and ns. Thanks for the bug report/reminder – we are actively working on several improvements here, and in the next release the .df and open_dataframe paths will be the same (both via Arrow by default), as well as preserving the necessary metadata for other date types.


1 Like