TileDBError: Error: Internal TileDB uncaught exception; std::bad_alloc

Mtrl_Scientist · April 6, 2020, 11:13am

Hi again,

I’m having issues reading out a particular value range.
My array goes from price values of $0 to $70’000, but I cannot read out the range between $5’500 and $7’000 in one go. The only way to read this range is iteratively in smaller chunks (i.e. $250 increments) and then concatenate the results, but this is not ideal as it takes much longer.

Things I’ve tried:

Read out iteratively and save to a new array
Optimize new array in terms of tile layout and tile capacity

Unfortunately to no avail… The error persists, but really only in this narrow range.

You can find the data here (~1.5 GB, 300 M data points).
Code that produces the error:

import tiledb
import pandas as pd
from pathlib import Path
import numpy as np

def from_tileDB2(p1,p2,sdir,pair):
    with tiledb.open(os.path.join(sdir,f"{pair}")) as A:
        data = A[p1:p2,:]

    df = pd.DataFrame({"price":np.array(data["coords"]["price"],dtype=np.float64),
              "date":np.array(data["coords"]["date"],dtype='datetime64[ns]'),
              "data":np.array(data["data"],dtype=np.float64)}).set_index("price")
    return df

# Source Dir
sdir = Path(r"Your_Path")
# Array Name
pair = "btcusdt2"

# Price Range
p1 = 5500
p2 = 7000

# Array Query
df = from_tileDB2(p1,p2,sdir,pair)

Traceback:

TileDBError                               Traceback (most recent call last)
<ipython-input-14-5cd2827a3f2c> in <module>
     14 p2 = 7000
     15 
---> 16 df = from_tileDB2(p1,p2,sdir,pair)

<ipython-input-14-5cd2827a3f2c> in from_tileDB2(p1, p2, sdir, pair)
      1 def from_tileDB2(p1,p2,sdir,pair):
      2     with tiledb.open(os.path.join(sdir,f"{pair}")) as A:
----> 3         data = A[p1:p2,:]
      4 
      5     df = pd.DataFrame({"price":np.array(data["coords"]["price"],dtype=np.float64),

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__getitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.subarray()

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl._read_sparse_subarray()

tiledb/libtiledb.pyx in tiledb.libtiledb.ReadQuery.__init__()

tiledb/libtiledb.pyx in tiledb.libtiledb.ReadQuery.__init__()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()

TileDBError: Error: Internal TileDB uncaught exception; std::bad_alloc

Array configuration:

config = tiledb.Config()
config["sm.num_reader_threads"] = "8"
config["sm.num_writer_threads"] = "8"
config["sm.tile_cache_size"] = "10000000"

ctx = tiledb.Ctx(config)

dom = tiledb.Domain(
    # tiles = 1 cent increment
    tiledb.Dim(ctx=ctx,name="price", domain=(0, 9e12), tile=0.01, dtype=np.float64),
    # tiles = 1 day increment
    tiledb.Dim(ctx=ctx,name="date", domain=(0, 9e21), tile=86.4e12, dtype=np.float64))
schema = tiledb.ArraySchema(domain=dom, sparse=True,
                            attrs=[tiledb.Attr(name="data", dtype=np.float64,ctx=ctx)],
                            cell_order="row-major",tile_order="row-major",
                            capacity=int(1e9),ctx=ctx)

Leaving the ctx at default with:

ctx = tiledb.Ctx()

Produces the same error by the way.

Would appreciate any help with this!

ihnorton · June 22, 2020, 4:11pm

Hi @Mtrl_Scientist,

Sorry for the delayed response here. I have not been able to reproduce this with TileDB-Py 0.5.8-0.6.3. However, that may be due to having more memory available locally (16GB). Loading your data range with the schema as-written maxes at about 14.5GB process memory (very crude measurement with Activity Monitor on macOS).

A significant memory reduction is to avoid the copy of the numpy arrays when creating the dataframe, either by removing the wrapping np.array calls, or passing copy=False. Please see my updated code below, which uses about 6GB.

import tiledb
import pandas as pd
from pathlib import Path
import numpy as np
import os

config = tiledb.Config()
config["sm.num_reader_threads"] = "8"
config["sm.num_writer_threads"] = "8"
config["sm.tile_cache_size"] = "10000000"

tiledb.libtiledb.initialize_ctx(config)

#dom = tiledb.Domain(
#    # tiles = 1 cent increment
#    tiledb.Dim(ctx=ctx,name="price", domain=(0, 9e12), tile=0.01, dtype=np.float64),
#    # tiles = 1 day increment
#    tiledb.Dim(ctx=ctx,name="date", domain=(0, 9e21), tile=86.4e12, dtype=np.float64))
#schema = tiledb.ArraySchema(domain=dom, sparse=True,
#                            attrs=[tiledb.Attr(name="data", dtype=np.float64,ctx=ctx)],
#                            cell_order="row-major",tile_order="row-major",
#                            capacity=int(1e9),ctx=ctx)

def from_tileDB2(p1,p2,sdir,pair):
    import time
    start = time.time()

    with tiledb.open(os.path.join(sdir,f"{pair}")) as A:
        data = A[p1:p2,:]

    print("read elapsed: ", time.time() - start)
    
    import pdb ; pdb.set_trace()
    start = time.time()
    
    # 0.5.9
    #df = pd.DataFrame({
    #          "price": data["coords"]["price"],
    #          "date":  data["coords"]["date"].astype('M8[ns]'),
    #          "data":  data["data"]}
    #          ).set_index("price", inplace=True)

    # 0.6.0+
    #df = pd.DataFrame({
    #          "price": data["price"],
    #          "date":  data["date"].astype('M8[ns]'),
    #          "data":  data["data"]}
    #          ).set_index("price", inplace=True)

    # 0.6.0+ original
    df = pd.DataFrame({
              "price": np.array(data["price"], dtype=np.float64),
              "date":  np.array(data["date"], dtype='M8[ns]'),
              "data":  np.array(data["data"])}
              ).set_index("price", inplace=True)
 

    print("df elapsed: ", time.time() - start)

    return df

# Source Dir
sdir = Path(r"./")
# Array Name
pair = "btcusdt2"

# Price Range
p1 = 5500
p2 = 7000

# Array Query
df = from_tileDB2(p1,p2,sdir,pair)

Mtrl_Scientist · July 4, 2020, 1:56pm

Thanks for looking into it. Sorry for the late reply, I didn’t see the notification.

Under the latest tileDB version, I was able to consolidate the array (takes ~40min), which made it possible to read the 5500 - 7000 range.

The total array size is ~10 GB (no compression filters applied), so it should easily fit into memory.

But I still get the same error when trying to load the entire array, before even making a dataframe from it.
Before loading data:

During loading data:

Crash:

tileDB version: 0.6.3

Topic		Replies	Views
Basic from_pandas usage problems	2	919	April 12, 2022
Debugging Segmentation Fault While Loading an Array	3	536	March 14, 2023
Improved performance	2	1415	December 28, 2020
Dataframe with multidimensional values	2	701	July 15, 2021
TileDBError: [TileDB::Writer] Error: Duplicate coordinates are not allowed	4	3772	March 21, 2020

TileDBError: Error: Internal TileDB uncaught exception; std::bad_alloc

Related topics