Hello TileDB Community,
I am currently working with TileDB arrays stored on an S3 bucket (array_uri
) and am encountering an issue with corrupted or missing fragments causing errors when accessing the array. Here’s a detailed explanation of my setup, the issue, and how I write data:
Problem Description
When attempting to write or access my TileDB array stored in S3, I encounter the following error:
TileDBError: [TileDB::S3] Error: Cannot retrieve S3 object size; Error while listing file s3://dog.gedidb.gedi-l2-l4-v002/array_uri/__fragments/__1736465921075_1736465921075_62934a75c55ee3114948ab18b416afcb_22/__fragment_metadata.tdb[Error Type: 16] [HTTP Response Code: 404] [Remote IP: 139.17.228.42] [Headers: 'accept-ranges' = 'bytes' 'content-length' = '241' 'content-type' = 'application/xml' 'date' = 'Mon, 13 Jan 2025 15:34:21 GMT' 'strict-transport-security' = 'max-age=63072000' 'x-amz-request-id' = 'tx0000072b4fc72c5819025-006785327d-8d64b349-default'] : No response body. (/project/tiledb/fragment.cc:117)
From what I can determine:
- The fragment
__1736465921075_1736465921075_62934a75c55ee3114948ab18b416afcb_22
does not exist physically in the S3 bucket. - However, the TileDB metadata still references this fragment, causing the array to be inaccessible.
I suspect the issue arises due to partial writes or network interruptions during data-writing. Yet, based on the TileDB documentation, incomplete fragments should be ignored during reads, as TileDB uses an “ok” file to signal the completion of a fragment. However, in this case, the array is completely inaccessible, which seems inconsistent with the expected behavior.
How I Currently Write Data
Here’s a simplified version of my current write_granule
function for writing granule data to the TileDB array:
@retry(tiledb.cc.TileDBError, tries=10, delay=5, backoff=3)
def write_granule(self, granule_data: pd.DataFrame) -> None:
# Validate granule data
self._validate_granule_data(granule_data)
# Prepare coordinates (dimensions)
coords = self._prepare_coordinates(granule_data)
# Extract data for scalar and profile variables
data = self._extract_variable_data(granule_data)
try:
# Write to the TileDB array
with tiledb.open(self.array_uri, mode="w", ctx=self.ctx) as array:
dim_names = [dim.name for dim in array.schema.domain]
dims = tuple(coords[dim_name] for dim_name in dim_names)
array[dims] = data
except tiledb.TileDBError as e:
logger.error(f"Failed to write granule data to {self.array_uri}: {e}")
raise
What I Have Tried
- Consolidation and Vacuuming: I attempted to consolidate and vacuum the array to clean up stale references, but the operation also fails due to the missing fragment.
- Fragment Inspection: Using
tiledb.FragmentInfoList
, I cannoy list valid fragments, as the missing fragment remains referenced in the metadata, and the array cannot be accessed.
Questions
- Best Practices for Handling Corrupted Fragments:
- How can I clean up the metadata to remove references to missing or corrupted fragments?
- Is there a way to force TileDB to ignore invalid fragments during access or consolidation?
- Improving Write Robustness:
- What is the recommended approach for writing data to S3 in a way that minimizes the risk of partial writes or missing fragments?
- Would writing locally (e.g., to a temporary directory) and uploading to S3 afterward improve reliability?
- Preventing Future Issues:
- Are there specific configurations (e.g., TileDB context settings or S3 options) that could help prevent this issue?
Thank you for any guidance or recommendations you can provide! Let me know if you need additional details about my setup or workflow.
Cheers,
Simon