Nonempty-Domain for Subarrays

Instead of having 500 separate arrays, I tried combining them all in one.

However, I realized that all sub-arrays have overlapping entry-IDs for one of their dimensions.

Hence, when I try to add new data, I have to load the entire subarray to figure out what the last entry-ID was for each one, which takes forever because I have to load 1 subarray after the other to avoid running out of memory.

The way I previously checked the last entry-ID was by looking at the nonempty-domain.

But there’s currently no way to figure out the non-empty domain for a subarray, is there?

I guess one workaround would be to store the last entry-IDs in a separate array and use that as lookup.

Hi @Mtrl_Scientist,

Could you please clarify the setup here?

500 separate arrays

Are these all individual TileDB arrays?

If each “subarray” is a distinct sparse TileDB array, then on the ingestion side, each one will have its own nonempty_domain.

Best,
Isaiah

Hi @ihnorton,

I used to have 500 distinct sparse TileDB arrays.

But I’ve had some performance issues. So, I decided to put all the data into a single TileDB array.

So, instead of there being 500 distinct sparse TileDB arrays, each with their own nonempty_domain, I now have one sparse TileDB array with one nonempty_domain.

However, now I can’t look for each subarray’s last entry-ID anymore without reading the full subarray, since they overlap in the domain, and I’d need individual nonempty_domains for each subarray.

My current workaround is just to have a 2nd array where I look up the last ID for each subarray before fetching & writing new data.

I was wondering if there was a more elegant solution.

Hi @Mtrl_Scientist, some more clarifying questions.

  1. When you combine all 500 arrays into one, do you create 500 attributes all sharing a single ND array domain? Or do you create a new dimension with 500 domain values? Could you please share the array schema in both cases (the old one with the 500 arrays and the new one with a single “combined” array)?

  2. Can you please share the write script (even in pseudocode)?

As a general note, you should be able to store any “small” info in the array metadata.

Hi @stavros,

Sure!

  1. I have about 500 trading pairs. For the separate arrays, I use the coordinates: ‘date’ (giving the date when the trade occurred), and ‘aggID’ (giving the trade ID). When combining them into a single array, I use the ‘pair’ as additional coordinate.

Example of single pair (individual arrays):
image

Example of multiple pairs (one combined array):

Schema for the former
ArraySchema(
  domain=Domain(*[
    Dim(name='date', domain=(numpy.datetime64('1980-01-01T00:00:00.000000000'), numpy.datetime64('2100-01-01T00:00:00.000000000')), tile=604800000000000 nanoseconds, dtype='datetime64[ns]'),
    Dim(name='aggID', domain=(0, 9000000000000000000), tile=10000, dtype='int64'),
  ]),
  attrs=[
    Attr(name='price', dtype='float64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='quantity', dtype='float64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='first_tradeID', dtype='int64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='last_tradeID', dtype='int64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='is_buyer_maker', dtype='int8', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000000,
  sparse=True,
  allows_duplicates=False,
  coords_filters=FilterList([ZstdFilter(level=-1), ])
)
Schema for the latter
ArraySchema(
  domain=Domain(*[
    Dim(name='pair', domain=(None, None), tile=None, dtype='|S0'),
    Dim(name='date', domain=(numpy.datetime64('1980-01-01T00:00:00.000000000'), numpy.datetime64('2100-01-01T00:00:00.000000000')), tile=604800000000000 nanoseconds, dtype='datetime64[ns]'),
    Dim(name='aggID', domain=(0, 9000000000000000000), tile=10000, dtype='int64'),
  ]),
  attrs=[
    Attr(name='price', dtype='float64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='quantity', dtype='float64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='first_tradeID', dtype='int64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='last_tradeID', dtype='int64', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
    Attr(name='is_buyer_maker', dtype='int8', var=False, nullable=False, filters=FilterList([LZ4Filter(level=9), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000000,
  sparse=True,
  allows_duplicates=False,
  coords_filters=FilterList([ZstdFilter(level=-1), ])
)
Here are the 2 functions for writing an array for each pair separately, and for all pairs combined
def to_tiledb500(df,uri):
    config = tiledb.Config({
            "sm.tile_cache_size":str(30_000_000),
            "sm.consolidation.step_min_frags":"5",
            "sm.consolidation.step_max_frags":"5",
            "sm.consolidation.steps":"10",
            "sm.consolidation.buffer_size":str(30_000_000),
            "sm.consolidation.step_size_ratio": "0.1",
            "sm.dedup_coords":'true',
            "py.init_buffer_bytes": str(1024**2 * 400)
        })
    ctx = tiledb.Ctx(config)

    # Domain
    dom = tiledb.Domain(
    #         tiledb.Dim(name="pair", domain=(None,None), tile=None, dtype=np.bytes_),
        tiledb.Dim(name="date", domain=(np.datetime64('1980-01-01'), np.datetime64("2100-01-01")),
                   tile=np.timedelta64(1, 'W'), dtype="datetime64[ns]"),
        tiledb.Dim(name="aggID", domain=(0, 9e18), tile=1e4, dtype=np.int64),)

    # List of available filters
    bit_shuffle = tiledb.BitShuffleFilter()
    byte_shuffle = tiledb.ByteShuffleFilter()
    RLE = tiledb.RleFilter()
    double_delta_encoding = tiledb.DoubleDeltaFilter()
    positive_delta_encoding = tiledb.PositiveDeltaFilter()
    bit_width_reduction = tiledb.BitWidthReductionFilter(window=int(1e3))
    gzip = tiledb.GzipFilter(level=9)
    lz4 = tiledb.LZ4Filter(level=9)
    bzip2 = tiledb.Bzip2Filter(level=9)
    zstd = tiledb.ZstdFilter(level=4)

    # Attributes
    attrs = [
        tiledb.Attr(name="price",dtype=np.float64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),
        tiledb.Attr(name="quantity",dtype=np.float64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),
        tiledb.Attr(name="first_tradeID",dtype=np.int64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),
        tiledb.Attr(name="last_tradeID",dtype=np.int64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),        
        tiledb.Attr(name="is_buyer_maker",dtype=np.int8,ctx=ctx,
           filters=tiledb.FilterList([lz4])),
    ]
    # Schema
    schema = tiledb.ArraySchema(domain=dom, sparse=True,
                                attrs=attrs,
    #                                 coords_filters=[zstd],
                                cell_order="row-major",tile_order="row-major",
                                capacity=int(10e6),ctx=ctx,allows_duplicates=False)

    if not os.path.exists(uri):
        tiledb.SparseArray.create(uri,schema)
    with tiledb.SparseArray(uri, mode='w',ctx=ctx) as A:
        A[
#             df.index.get_level_values('pair').values,
            df.index.get_level_values('date').values,
            df.index.get_level_values('aggID').values,] = {
            "price":df.price.values,
            "quantity":df.quantity.values,
            "first_tradeID":df.first_tradeID.values,
            "last_tradeID":df.last_tradeID.values,
            "is_buyer_maker":df.is_buyer_maker.values
        }
        
def to_tiledb1(df,uri):
    config = tiledb.Config({
            "sm.tile_cache_size":str(30_000_000),
            "sm.consolidation.step_min_frags":"5",
            "sm.consolidation.step_max_frags":"5",
            "sm.consolidation.steps":"10",
            "sm.consolidation.buffer_size":str(30_000_000),
            "sm.consolidation.step_size_ratio": "0.1",
            "sm.dedup_coords":'true',
            "py.init_buffer_bytes": str(1024**2 * 400)
        })

    ctx = tiledb.Ctx(config)

    # Domain
    dom = tiledb.Domain(
        tiledb.Dim(name="pair", domain=(None,None), tile=None, dtype=np.bytes_),
        tiledb.Dim(name="date", domain=(np.datetime64('1980-01-01'), np.datetime64("2100-01-01")),
                   tile=np.timedelta64(1, 'W'), dtype="datetime64[ns]"),
        tiledb.Dim(name="aggID", domain=(0, 9e18), tile=1e4, dtype=np.int64),)

    # List of available filters
    bit_shuffle = tiledb.BitShuffleFilter()
    byte_shuffle = tiledb.ByteShuffleFilter()
    RLE = tiledb.RleFilter()
    double_delta_encoding = tiledb.DoubleDeltaFilter()
    positive_delta_encoding = tiledb.PositiveDeltaFilter()
    bit_width_reduction = tiledb.BitWidthReductionFilter(window=int(1e3))
    gzip = tiledb.GzipFilter(level=9)
    lz4 = tiledb.LZ4Filter(level=9)
    bzip2 = tiledb.Bzip2Filter(level=9)
    zstd = tiledb.ZstdFilter(level=4)

    # Attributes
    attrs = [
        tiledb.Attr(name="price",dtype=np.float64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),
        tiledb.Attr(name="quantity",dtype=np.float64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),
        tiledb.Attr(name="first_tradeID",dtype=np.int64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),
        tiledb.Attr(name="last_tradeID",dtype=np.int64,ctx=ctx,
                   filters=tiledb.FilterList([lz4])),        
        tiledb.Attr(name="is_buyer_maker",dtype=np.int8,ctx=ctx,
           filters=tiledb.FilterList([lz4])),
    ]
    # Schema
    schema = tiledb.ArraySchema(domain=dom, sparse=True,
                                attrs=attrs,
    #                                 coords_filters=[zstd],
                                cell_order="row-major",tile_order="row-major",
                                capacity=int(10e6),ctx=ctx,allows_duplicates=False)

    if not os.path.exists(uri):
        tiledb.SparseArray.create(uri,schema)
    with tiledb.SparseArray(uri, mode='w',ctx=ctx) as A:
        A[
            df.index.get_level_values('pair').values,
            df.index.get_level_values('date').values,
            df.index.get_level_values('aggID').values,] = {
            "price":df.price.values,
            "quantity":df.quantity.values,
            "first_tradeID":df.first_tradeID.values,
            "last_tradeID":df.last_tradeID.values,
            "is_buyer_maker":df.is_buyer_maker.values
        }

Here’s some sample data.

And here’s how to write it:

df = pd.read_csv('sample_data.csv').set_index(['pair','date','aggID'])

# Write to single-pair array
df.loc['DOTUSDT'].pipe(to_tiledb500,'DOTUSDT')
# Write to multi-pair array
df.pipe(to_tiledb1,'multi_pairs')

Thanks for the meta-data tip btw - it’s a more elegant solution! :+1:

Yeah, I think the second schema (single 3D array) is what would work best in this case. You may need to tune your tile extents and tile capacity for optimal performance based on your access patterns, but other than that it looks good. Please let us know if there are any other issues.

1 Like

After some testing with the new 3D array, I’ve noticed that it tis much, much slower than the 2D array.

The following query doesn’t specify the aggID range:

aggID_range = slice(None,None)
df = A.query(attrs=('price','quoteQty',),dims=['date','pair','aggID'],
                     order='G',use_arrow=True).df[pair,date_range,aggID_range]

And gives a read time of 4.1s because of only hitting 5.2% of useful cells.

Read Statistics (no aggID range)

==== READ ====

  • Number of read queries: 1

  • Number of attempts until results are found: 1

  • Number of attributes read: 2

    • Number of fixed-sized attributes read: 2
  • Number of dimensions read: 3

    • Number of fixed-sized dimensions read: 2
    • Number of var-sized dimensions read: 1
  • Number of logical tiles overlapping the query: 40

  • Number of physical tiles read: 240

    • Number of physical fixed-sized tiles read: 160
    • Number of physical var-sized tiles read: 80
  • Number of cells read: 40000000

  • Number of result cells: 2070424

  • Percentage of useful cells read: 5.17606%

  • Number of bytes read: 765945156 bytes (0.713342 GB)

  • Number of read operations: 1536

  • Number of bytes unfiltered: 1899251820 bytes (1.76882 GB)

  • Unfiltering inflation factor: 2.47962x

  • Time to compute estimated result size: 0.0030342 secs

    • Time to compute tile overlap: 0.0018967 secs

      Time to compute relevant fragments: 5.85e-05 secs
      Time to load relevant fragment R-trees: 0.0016049 secs
      Time to compute relevant fragment tile overlap: 0.0002261 secs

  • Time to open array: 0.0084028 secs

    • Time to open array without fragments: 0.0075885 secs
    • Time to load array schema: 0.0071948 secs
    • Time to list fragment uris: 0.0071774 secs
    • Time to load consolidated fragment metadata: 5e-07 secs
    • Time to load fragment metadata: 0.0007927 secs
  • Total metadata read: 449003 bytes (0.000418167 GB)

    • Array schema: 535 bytes (4.98258e-07 GB)
    • Fragment metadata: 4627 bytes (4.30923e-06 GB)
    • R-tree: 246329 bytes (0.000229412 GB)
    • Fixed-sized tile offsets: 141080 bytes (0.000131391 GB)
    • Var-sized tile offsets: 28216 bytes (2.62782e-05 GB)
    • Var-sized tile sizes: 28216 bytes (2.62782e-05 GB)
  • Time to load array metadata: 0.0381581 secs

    • Array metadata size: 20979 bytes (1.95382e-05 GB)
  • Time to initialize the read state: 7.05e-05 secs

    • Time to compute tile overlap: 0.0018967 secs

      Time to compute relevant fragments: 5.85e-05 secs
      Time to load relevant fragment R-trees: 0.0016049 secs
      Time to compute relevant fragment tile overlap: 0.0002261 secs

  • Read time: 4.10029 secs

    • Time to compute next partition: 0.0002067 secs
    • Time to compute result coordinates: 3.44748 secs

      Time to compute sparse result tiles: 0.0002506 secs
      Time to read coordinate tiles: 0.190496 secs
      Time to unfilter coordinate tiles: 0.599311 secs
      Time to compute range result coordinates: 2.58394 secs

    • Time to compute sparse result cell slabs: 0.0036852 secs
    • Time to copy result attribute values: 0.458187 secs

      Time to read attribute tiles: 0.315572 secs
      Time to unfilter attribute tiles: 0.130445 secs
      Time to copy fixed-sized attribute values: 0.0088685 secs

    • Time to copy result coordinates: 0.186646 secs

      Time to copy fixed-sized coordinates: 0.0089522 secs
      Time to copy var-sized coordinates: 0.0728606 secs

  • Total read query time (array open + init state + read): 4.10037 secs

==== Python Stats ====

  • TileDB-Py Indexing Time: 4.21903
    • TileDB-Py query execution time: 4.14337

      TileDB C++ Core initial query submit time: 4.1004

    • TileDB-Py buffer conversion time: 0.0700061

However, when supplying the aggID range, with the following query, the hit-rate & read performance improved a lot:

aggID_range = slice(414488709,420559132)
df = A.query(attrs=('price','quoteQty',),dims=['date','pair','aggID'],
                     order='G',use_arrow=True).df[pair,date_range,aggID_range]
Read Statistics (supply aggID range)

==== READ ====

  • Number of read queries: 1

  • Number of attempts until results are found: 1

  • Number of attributes read: 2

    • Number of fixed-sized attributes read: 2
  • Number of dimensions read: 3

    • Number of fixed-sized dimensions read: 2
    • Number of var-sized dimensions read: 1
  • Number of logical tiles overlapping the query: 3

  • Number of physical tiles read: 18

    • Number of physical fixed-sized tiles read: 12
    • Number of physical var-sized tiles read: 6
  • Number of cells read: 3000000

  • Number of result cells: 2070424

  • Percentage of useful cells read: 69.0141%

  • Number of bytes read: 46860422 bytes (0.0436422 GB)

  • Number of read operations: 1338

  • Number of bytes unfiltered: 141221005 bytes (0.131522 GB)

  • Unfiltering inflation factor: 3.01365x

  • Time to compute estimated result size: 0.0021659 secs

    • Time to compute tile overlap: 0.001491 secs

      Time to compute relevant fragments: 5.55e-05 secs
      Time to load relevant fragment R-trees: 0.0013026 secs
      Time to compute relevant fragment tile overlap: 0.0001276 secs

  • Time to open array: 0.0021548 secs

    • Time to open array without fragments: 0.0005945 secs
    • Time to load array schema: 0.0003788 secs
    • Time to list fragment uris: 0.0009972 secs
    • Time to load consolidated fragment metadata: 5e-07 secs
    • Time to load fragment metadata: 0.0008504 secs
  • Total metadata read: 204653 bytes (0.000190598 GB)

    • Array schema: 535 bytes (4.98258e-07 GB)
    • Fragment metadata: 4627 bytes (4.30923e-06 GB)
    • R-tree: 152923 bytes (0.000142421 GB)
    • Fixed-sized tile offsets: 24160 bytes (2.25008e-05 GB)
    • Var-sized tile offsets: 4832 bytes (4.50015e-06 GB)
    • Var-sized tile sizes: 17576 bytes (1.63689e-05 GB)
  • Time to load array metadata: 0.035096 secs

    • Array metadata size: 20979 bytes (1.95382e-05 GB)
  • Time to initialize the read state: 4.9e-05 secs

    • Time to compute tile overlap: 0.001491 secs

      Time to compute relevant fragments: 5.55e-05 secs
      Time to load relevant fragment R-trees: 0.0013026 secs
      Time to compute relevant fragment tile overlap: 0.0001276 secs

  • Read time: 0.382407 secs

    • Time to compute next partition: 0.0001303 secs
    • Time to compute result coordinates: 0.238934 secs

      Time to compute sparse result tiles: 2.25e-05 secs
      Time to read coordinate tiles: 0.0075762 secs
      Time to unfilter coordinate tiles: 0.0447509 secs
      Time to compute range result coordinates: 0.124944 secs

    • Time to compute sparse result cell slabs: 0.0035241 secs
    • Time to copy result attribute values: 0.0546628 secs

      Time to read attribute tiles: 0.0212823 secs
      Time to unfilter attribute tiles: 0.0213769 secs
      Time to copy fixed-sized attribute values: 0.0086604 secs

    • Time to copy result coordinates: 0.0809236 secs

      Time to copy fixed-sized coordinates: 0.0056463 secs
      Time to copy var-sized coordinates: 0.0667185 secs

  • Total read query time (array open + init state + read): 0.382456 secs

==== Python Stats ====

  • TileDB-Py Indexing Time: 0.507036
    • TileDB-Py query execution time: 0.4315

      TileDB C++ Core initial query submit time: 0.382638

    • TileDB-Py buffer conversion time: 0.0710039

As you can see, the read time dropped to 0.38s (10x improvement!), and the percentage of useful cells read is now close to 70%.

I’ve included the last aggID, date-ranges, and number of entries in the meta-data, which would allow me to extrapolate the required aggID range. But since the number of entries per date-interval is highly variable, I could still end up missing the correct aggID range by a lot.

Maybe it would be better to store a downsampled version of the coordinates in the meta data. but I’d love if it was more flexible (maybe allow dictionaries?) I.e.:

A.meta[pair] = {
    'date':['2021-01-01','2021-01-02'],
    'aggID':[1000,4000]
}

Then, I could just read the meta first, and, based on that, grab the approximate aggID range and have a massively improved read performance.

Unfortunately, the meta data currently only supports same-type entries in tuples or as values. So, I could do something like this right now:

A.meta[f"{pair}_{'2021-01-01'}"] = 1000
A.meta[f"{pair}_{'2021-01-02'}"] = 4000

Thoughts?

That’s because you constrained the range and now you intersect 3 logical tiles instead of 40.

Unfortunately, the meta data currently only supports same-type entries in tuples or as values.

@ihnorton can probably chime in here.

Question: Do you usually query data that you ingested most recently (in separate writes)? If that’s the case, we are adding new time traveling functionality that allows you to focus on any arbitrary time interval within you wrote to the array. That can speed up your queries greatly.

I do a bit of both. Most of the access is for recent data, but I also often access the entire date-range or do multi-range queries.

But I constantly consolidate & vacuum my arrays anyway due to frequent writes, so I can’t time-travel.

Yes, currently tuples must be homogeneous.

One option for an extended representation is JSON-encoding the dictionary, and storing that as a string. We have used in various applications including to store metadata about pandas dataframes. See json — JSON encoder and decoder — Python 3.12.0 documentation

This should work fine for small blobs. I don’t think we have a limit on metadata blob size, but we only regularly test it up to 1 MB in the Python API. JSON encoding of complicated Python data structures may need some extra handling if accessing metadata from another language (R/C++/etc.), whereas TileDB native types should already be mapped pretty well by the respective APIs.

1 Like

That might do the trick! :+1: