What could affect array opening time?

ohana · July 12, 2023, 8:55pm

Hello,

I have an array in S3 with ~96GB of data stored. The array contains ~660,604 objects. When I try to open the array, it hangs for minutes. Although, I can still read the schema. Does the number of fragments written affect the opening time? Or are there other factors that might cause long open times?

Running,

%time tiledb.array_exists(array_name)

returns:

Wall time: 4min 35s

True```

ihnorton · July 13, 2023, 6:51pm

Hi @ohana,

(message was in akismet flag queue, sorry!)

Does the number of fragments written affect the opening time?

Yes, the library lists the __fragments prefix, and eventually needs to read all of the fragment metadata files. In order to mitigate this, you can periodically run fragment metadata consolidation:

import tiledb

config = tiledb.Config(
    "sm.consolidation.mode": "fragment_meta"}
)
with tiledb.scope_ctx(config):
    tiledb.consolidate(array_uri)

See full documentation for consolidation commands here. Running this should make a significant difference in time to open.

Best,
Isaiah

ohana · July 13, 2023, 9:59pm

Hi @ihnorton,

Thank you for your response.

Running consolidation on the fragment metadata did reduce opening times to ~4 seconds.

At the moment our array is comprised of ~700k objects in S3 and the fragment_metadata file is ~10.4MB. We’ve only ETL’d 2 days out of 20 years of data. If I crudely extrapolate, that could mean we will have on the order of 3 billion objects in S3, and the fragment_metadata file may grow to something like 30GB, after copying over all of the data.

Do you think that read performance will significantly increase after we copy over all of the data?

Would consolidating fragments as we ETL, help to reduce the size of the fragment_metadata file and help maintain read performance?

ihnorton · July 18, 2023, 12:47pm

Circling back here from offline discussion about the ingestion pattern, the suggested approach is to (1) ingest the historical data with timestamp {start,end} ranges set to match the data range during write and (2) perform fragment consolidation (not just fragment metadata consolidation) in order to improve the data locality, which should reduce the size of the fragment metadata.

Topic		Replies	Views
Slow AWS Data Slicing	5	1033	June 12, 2020
S3 first access very slow with 3D tiled dense array	17	1880	January 14, 2022
Preparing array for reading itself takes sometime	2	873	June 27, 2019
Optimizing the reads for sparse arrays	9	747	June 27, 2023
Slow read performance for local and S3 2D sparse array	6	998	April 7, 2022

What could affect array opening time?

Related topics