I have an array in S3 with ~96GB of data stored. The array contains ~660,604 objects. When I try to open the array, it hangs for minutes. Although, I can still read the schema. Does the number of fragments written affect the opening time? Or are there other factors that might cause long open times?
Does the number of fragments written affect the opening time?
Yes, the library lists the __fragments prefix, and eventually needs to read all of the fragment metadata files. In order to mitigate this, you can periodically run fragment metadata consolidation:
Running consolidation on the fragment metadata did reduce opening times to ~4 seconds.
At the moment our array is comprised of ~700k objects in S3 and the fragment_metadata file is ~10.4MB. We’ve only ETL’d 2 days out of 20 years of data. If I crudely extrapolate, that could mean we will have on the order of 3 billion objects in S3, and the fragment_metadata file may grow to something like 30GB, after copying over all of the data.
Do you think that read performance will significantly increase after we copy over all of the data?
Would consolidating fragments as we ETL, help to reduce the size of the fragment_metadata file and help maintain read performance?
Circling back here from offline discussion about the ingestion pattern, the suggested approach is to (1) ingest the historical data with timestamp {start,end} ranges set to match the data range during write and (2) perform fragment consolidation (not just fragment metadata consolidation) in order to improve the data locality, which should reduce the size of the fragment metadata.