Need help optimizing read performance

Hi,
Following up on a previous post, I’ve now gathered the full dataset (90GB) for a sparse 3D array that looks like this:

As discussed in the previous post, having 3 dimensions (pair, date, aggID), but only supplying 2 ranges in the query (because I don’t know the aggID range for a given pair) leads to poor read performance. The suggested solution was to downsample the date- & aggID ranges (i.e. to daily), and store the data as JSON in the array’s meta-data.

Example of mapping aggID range with date range:

image

Before I started mapping out the aggID range with the date range, I did some tests by first doing a query with only supplying the pair & date-range, and then compared that against how long it took to do the same query with a known aggID range as well.

The figure above, shows how the data generally looks like. Subarrays don’t all start at the same point in time, and some are even discontinued (due to de-listing of the asset). But all generally start at an aggID of 1 (there are exceptions due to exchange data issues). The aggID is monotonically increasing, and should always increase by 1 for each pair (there are some gaps due to exchange data issues). But as you can imagine, not all pairs have equally many entries. Thus, there are vast differences in the aggID range for each pair. The nonempty domain for the aggID range goes from 1 to 425’944’251.
So, naturally, the hit-rate for pairs that have a much smaller range will be poor.

By mapping the aggID range for some pairs against the date range as seen in the figure above (2nd chart), and then submitting a query with the pair, date, and aggID range, the performance increases and the query takes less time to complete. However, results are still not satisfactory, and in a lot of cases, I still get single-digit hit-rates for useful tiles.

So, I also tried to submit multi-range queries for each day (3rd chart in figure above) by supplying a list of ranges for the date- & aggID. However, I’ve found that the performance is either on-par with just supplying 1 query with the date- & aggID extents or slightly worse (probably due to overhead?).

First, I thought it might’ve been an issue due to there being too many fragments (7 fragments of a 90 GB array), but against my expectations, consolidating the array into a single fragment worsened the performance (especially the hit-rate for useful tiles).

Here are my results for the tests I ran according to the figure above:

I always used the same date_range of 1 month of data, and this function

def read_tiledb(pair,t1=None,t2=None,a1=None,a2=None):
    uri = Path(r'C:\Users\Fred\data\futures')
    uri = os.path.join(uri,'Binance_USD-M')
    date_range=slice(t1,t2)
    ID_range = slice(a1,a2)
    
    with tiledb.SparseArray(uri, mode='r') as A:
        df = A.query(
            attrs=(),
            dims=['date','aggID'],index_col=['date'],
            order='U',
            use_arrow=True).df[pair,date_range,ID_range]
    return df

df = read_tiledb(pair=['SXPUSDT'],
                 t1=np.datetime64('2021-03-07 00:00'),
                 t2=np.datetime64('2021-04-08 00:00'),
                 a1=42910059,
                 a2=57303046
                )

With 7 fragments:

With 1 fragment:

The first column is the query where only the pair and date range were submitted, the 2nd column is the percentage of useful tiles read for that query, the 3rd is the time for the query where pair, date, and aggID range were submitted, the 4th column is the percentage of useful cells read for that query, and the last two columns are the same query as the previous, but submitted as multi-range as described in the figure above.

As you can see, consolidating the fragments into 1, has worsened the hitrate for BTC & ETH when supplying all ranges, but improved it for SXP. In case of BTC, the read-time doubled as a consequence, ETH remained the same, and for SXP it dropped from 2min to 8s.

I don’t know what else I can do other than changing the array schema. I’d rather not allow_duplicates and get rid of the aggID dimension out of fear I’d accidentally make 2 identical writes. So, is there any other way I can improve the read performance for this 3D sparse array?

The public TileDB Cloud array uri is:
tiledb://Mtrl_Scientist/Binance_USD-M

Apologies for the late reply, I will look into this over the next couple of days.

1 Like

I don’t see an issue with the dimensions here. The only issue I can think of is the tile capacity, which if I recall is 10M in your array. That might be too large for certain pairs. What is the average cardinality of a query with a specific pair? For example, if BTC has 10M records but SPX has 1K records, then you are fetching at least 10M records whenever you query either BTC or SPX. Therefore, for BTC you are good, but for SPX you are fetching way too many redundant records.

If you have such skew, I would reduce the tile capacity to 10K.