TileDBError: Error: Internal TileDB uncaught exception; std::bad_alloc

Hi again,

I’m having issues reading out a particular value range.
My array goes from price values of $0 to $70’000, but I cannot read out the range between $5’500 and $7’000 in one go. The only way to read this range is iteratively in smaller chunks (i.e. $250 increments) and then concatenate the results, but this is not ideal as it takes much longer.

Things I’ve tried:

  • Read out iteratively and save to a new array
  • Optimize new array in terms of tile layout and tile capacity

Unfortunately to no avail… The error persists, but really only in this narrow range.

You can find the data here (~1.5 GB, 300 M data points).
Code that produces the error:

import tiledb
import pandas as pd
from pathlib import Path
import numpy as np

def from_tileDB2(p1,p2,sdir,pair):
    with tiledb.open(os.path.join(sdir,f"{pair}")) as A:
        data = A[p1:p2,:]

    df = pd.DataFrame({"price":np.array(data["coords"]["price"],dtype=np.float64),
              "date":np.array(data["coords"]["date"],dtype='datetime64[ns]'),
              "data":np.array(data["data"],dtype=np.float64)}).set_index("price")
    return df

# Source Dir
sdir = Path(r"Your_Path")
# Array Name
pair = "btcusdt2"

# Price Range
p1 = 5500
p2 = 7000

# Array Query
df = from_tileDB2(p1,p2,sdir,pair)

Traceback:

TileDBError                               Traceback (most recent call last)
<ipython-input-14-5cd2827a3f2c> in <module>
     14 p2 = 7000
     15 
---> 16 df = from_tileDB2(p1,p2,sdir,pair)

<ipython-input-14-5cd2827a3f2c> in from_tileDB2(p1, p2, sdir, pair)
      1 def from_tileDB2(p1,p2,sdir,pair):
      2     with tiledb.open(os.path.join(sdir,f"{pair}")) as A:
----> 3         data = A[p1:p2,:]
      4 
      5     df = pd.DataFrame({"price":np.array(data["coords"]["price"],dtype=np.float64),

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__getitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.subarray()

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl._read_sparse_subarray()

tiledb/libtiledb.pyx in tiledb.libtiledb.ReadQuery.__init__()

tiledb/libtiledb.pyx in tiledb.libtiledb.ReadQuery.__init__()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()

TileDBError: Error: Internal TileDB uncaught exception; std::bad_alloc

Array configuration:

config = tiledb.Config()
config["sm.num_reader_threads"] = "8"
config["sm.num_writer_threads"] = "8"
config["sm.tile_cache_size"] = "10000000"

ctx = tiledb.Ctx(config)

dom = tiledb.Domain(
    # tiles = 1 cent increment
    tiledb.Dim(ctx=ctx,name="price", domain=(0, 9e12), tile=0.01, dtype=np.float64),
    # tiles = 1 day increment
    tiledb.Dim(ctx=ctx,name="date", domain=(0, 9e21), tile=86.4e12, dtype=np.float64))
schema = tiledb.ArraySchema(domain=dom, sparse=True,
                            attrs=[tiledb.Attr(name="data", dtype=np.float64,ctx=ctx)],
                            cell_order="row-major",tile_order="row-major",
                            capacity=int(1e9),ctx=ctx)

Leaving the ctx at default with:

ctx = tiledb.Ctx()

Produces the same error by the way.

Would appreciate any help with this! :slight_smile:

Hi @Mtrl_Scientist,

Sorry for the delayed response here. I have not been able to reproduce this with TileDB-Py 0.5.8-0.6.3. However, that may be due to having more memory available locally (16GB). Loading your data range with the schema as-written maxes at about 14.5GB process memory (very crude measurement with Activity Monitor on macOS).

A significant memory reduction is to avoid the copy of the numpy arrays when creating the dataframe, either by removing the wrapping np.array calls, or passing copy=False. Please see my updated code below, which uses about 6GB.

import tiledb
import pandas as pd
from pathlib import Path
import numpy as np
import os

config = tiledb.Config()
config["sm.num_reader_threads"] = "8"
config["sm.num_writer_threads"] = "8"
config["sm.tile_cache_size"] = "10000000"

tiledb.libtiledb.initialize_ctx(config)

#dom = tiledb.Domain(
#    # tiles = 1 cent increment
#    tiledb.Dim(ctx=ctx,name="price", domain=(0, 9e12), tile=0.01, dtype=np.float64),
#    # tiles = 1 day increment
#    tiledb.Dim(ctx=ctx,name="date", domain=(0, 9e21), tile=86.4e12, dtype=np.float64))
#schema = tiledb.ArraySchema(domain=dom, sparse=True,
#                            attrs=[tiledb.Attr(name="data", dtype=np.float64,ctx=ctx)],
#                            cell_order="row-major",tile_order="row-major",
#                            capacity=int(1e9),ctx=ctx)

def from_tileDB2(p1,p2,sdir,pair):
    import time
    start = time.time()

    with tiledb.open(os.path.join(sdir,f"{pair}")) as A:
        data = A[p1:p2,:]

    print("read elapsed: ", time.time() - start)
    
    import pdb ; pdb.set_trace()
    start = time.time()
    
    # 0.5.9
    #df = pd.DataFrame({
    #          "price": data["coords"]["price"],
    #          "date":  data["coords"]["date"].astype('M8[ns]'),
    #          "data":  data["data"]}
    #          ).set_index("price", inplace=True)

    # 0.6.0+
    #df = pd.DataFrame({
    #          "price": data["price"],
    #          "date":  data["date"].astype('M8[ns]'),
    #          "data":  data["data"]}
    #          ).set_index("price", inplace=True)

    # 0.6.0+ original
    df = pd.DataFrame({
              "price": np.array(data["price"], dtype=np.float64),
              "date":  np.array(data["date"], dtype='M8[ns]'),
              "data":  np.array(data["data"])}
              ).set_index("price", inplace=True)
 

    print("df elapsed: ", time.time() - start)

    return df

# Source Dir
sdir = Path(r"./")
# Array Name
pair = "btcusdt2"

# Price Range
p1 = 5500
p2 = 7000

# Array Query
df = from_tileDB2(p1,p2,sdir,pair)

Thanks for looking into it. Sorry for the late reply, I didn’t see the notification.

Under the latest tileDB version, I was able to consolidate the array (takes ~40min), which made it possible to read the 5500 - 7000 range.

The total array size is ~10 GB (no compression filters applied), so it should easily fit into memory.

But I still get the same error when trying to load the entire array, before even making a dataframe from it.
Before loading data:


During loading data:

Crash:

tileDB version: 0.6.3