Hi TileDB community,
I’m working on storing and querying GEDI (Global Ecosystem Dynamics Investigation) data in TileDB. GEDI data is collected in orbit tracks, resulting in granules with characteristics that pose challenges for efficient querying and storage. Each granule typically corresponds to a specific orbit, covering a long and narrow spatial area with significant gaps between consecutive orbits. When written to TileDB, these granules create fragments with large, sparse, and fragmented spatial domains.
Fragment Characteristics
Here’s an example of a fragment’s metadata from my dataset:
{
'array_schema_name': '__1732123668781_1732123668781_5976fa50e139c128faf6bd5a2e980e70',
'cell_num': 18949923,
'has_consolidated_metadata': True,
'nonempty_domain': ((-11.561823504118008, 0.02811276523269317), # Latitude range
(-77.24830400728762, -68.9293460532509), # Longitude range
(18140, 18140), # Temporal range (day granule)
(0, 100)), # Profile dimension
'num': 1056,
'sparse': True,
'timestamp_range': (1732178482186, 1732178482186),
'uri': 's3://dog.gedidb.gedi-l2-l4-v002/profile_array_uri/__fragments/__1732178482186_1732178482186_77137cc23ceb31c0847aa8fb262e53a2_22',
'version': 22
}
The nonempty_domain shows a large latitude and longitude range due to the nature of the granule’s orbit, even though the actual data points cover only a small portion of this range. This leads to:
- Large spatial domains for each fragment, even though the data is sparse.
- Inefficient queries, as TileDB scans large spatial ranges with minimal data.
Current Workflow
- Each granule is written independently to the array. For example, one granule corresponds to data collected in a single day for a specific orbit.
- Consolidation is performed after writing to reduce the fragment count. While this helps with metadata overhead, the spatial domain of each consolidated fragment remains large.
Challenges
- The large spatial domains created by orbit-based granules are slowing down queries, even after consolidation.
- I understand that writing the data in spatially coherent chunks can help, but given the nature of the data (long, narrow orbits), restructuring the data seems non-trivial.
Questions
- What is the best strategy for writing GEDI data to TileDB? Specifically:
- Should I reorganize the data into fixed spatial tiles (e.g., 1° x 1°) before writing, even if this means re-tiling the data for each granule?
- Would grouping data temporally (e.g., by year) and spatially (e.g., by tiles) before writing improve query performance?
- How should I approach consolidation?
- Should I use incremental consolidation or custom spatial filters during the process to improve fragment compactness?
- Is there a way to restructure the spatial domain during consolidation?
- Is there an alternative way to manage orbit-based data that maintains the native structure but avoids large spatial domains?
I appreciate any insights on optimizing GEDI data storage and query performance in TileDB.
Thank you!