Slow AWS Data Slicing

Hi All,

I’m building an API that serves up data from a TileDB database that I have stored on AWS S3, everything is working but is doing so very slowly.

The problem is confined to the following code block.

with tiledb.open(array_dir, "r") as TileDB_array:     # Takes 1.7s
    data = TileDB_array[start_idx:end_idx]               # Takes 9.8s

I’ve tried to do a number of things to get the timings down to what they are above: first I reduced the tile size from a year to a day which led to a x4 speedup, then I moved my API server to the same AWS region which led to a 2x speed-up.

The problem is that despite these speed-ups it still takes ~11s to serve the data, far too slow for my requirements and far slower than the previous SQL set-up I had running.

Are there any general tips/tricks that can be used to speed-up slicing and accessing data? I’m using a time series index but I cant imagine that slows things down too much as its stored as integer values.

Thanks for all help and guidance!

1 Like

Hello @AyrtonB,

Would you mind posting the array schema for us to look at? Can you also clarify a few things.

  1. What version of TileDB and TileDB-Py are you using? If you installed from pip the TileDB-Py version is enough, if you installed from conda please tell use both package versions.
  2. How many fragments are in the array?
    2a. Roughly what is the size of each fragment? Are you writing a single cell per fragment?
  3. Have you ever run consolidation for the fragments or fragment metadata?
    3a. If so, did you also run a vacuum?
  4. How big is the range you are querying? end_idx - start_idx
    4a. How sparse is your dataset? If I remember you have one data point every 30 minutes?

I suspect that the problem is there is a large number of fragments. When an array is open, TileDB must fetch the fragment metadata from every non-consolidated fragment. If your fragment count is high this can add up. To fix this, there is an option to consolidate the fragment metadata. If your fragments are very small, consolidating the entire fragment and not just the metadata might also help provide a performance boost.

Based on your responses we’ll get a recommendation for how to improve your performance. 1.7s to open the array is definitely too slow, and leads me to assume a fragment count issue.

1 Like

Hi @seth, thanks for the quick reply!

Schema:

ArraySchema(
  domain=Domain(*[
    Dim(name='datetime', domain=(numpy.datetime64('2010-01-01T00:00'), numpy.datetime64('2025-01-01T00:00')), tile=1440 minutes, dtype='datetime64[m]'),
  ]),
  attrs=[
    Attr(name='ccgt', dtype='int32'),
    Attr(name='oil', dtype='int32'),
    Attr(name='coal', dtype='int32'),
    Attr(name='nuclear', dtype='int32'),
    Attr(name='wind', dtype='int32'),
    Attr(name='ps', dtype='int32'),
    Attr(name='npshyd', dtype='int32'),
    Attr(name='ocgt', dtype='int32'),
    Attr(name='other', dtype='int32'),
    Attr(name='intfr', dtype='int32'),
    Attr(name='intirl', dtype='int32'),
    Attr(name='intned', dtype='int32'),
    Attr(name='intew', dtype='int32'),
    Attr(name='biomass', dtype='int32'),
    Attr(name='intnem', dtype='int32'),
  ],
  cell_order='row-major',
  tile_order='row-major', sparse=True)
# note: filters omitted

Clarifications:

  1. tiledb 2.0.3 & tiledb-py 0.6.2 - installed via conda

  2. There’s nearly 700 fragments, I’m adding a new fragment each half hour as new data comes in

  3. No I haven’t (consolidated), I’ll test that now.

  4. the date range is 24 hours, one of the great things I’ve had with TileDB is that even when querying ~yr its only marginally slower, so I’ve been operating under the assumption that the time range isnt the root cause. Yes its pretty sparse with ~97% of the array empty as its adding an entry every 30 minutes.

The fragmentation seems like it could be the issue, thanks for guiding me in the right direction. Will update the thread with results of consolidation.

1 Like

Consolidation brought it down to sub 2s with the opening at less thatn 0.2s.

Thanks again for the help @seth!

@AyrtonB glad this solved your issue. Let us know if you run into any other questions or issues.

Out of curiosity, if you don’t mind me asking, what was your previous SQL setup you are comparing against?

Thanks.

Not at all. It was a PostgreSQL DB with PostgREST serving the data, the big difference in speed mainly due to the fact that the DB and API were on the same server. After consolidation the speeds are on par now though and the overall workflow is much nicer, plus being able to have the DB on S3 means it integrates far easier with our other services.

3 Likes