Optimizing TileDB Query Performance for GEDI Data

Hi everyone,

I’m working with a TileDB array to store and query GEDI data. The array includes spatial (latitude, longitude) and temporal dimensions, with variables stored as attributes. I want to get feedback on two aspects:

  1. TileDB Schema: Does the structure of my TileDB array make sense for efficient querying?
  2. Query Optimization: Am I reading the data efficiently, or are there improvements I could make (e.g., indexing strategies, query execution optimizations, parallel reading)?

Here’s the way to look at my TileDB schema:

import tiledb
import os


# S3 TileDB context
tiledb_config = tiledb.Config(
    {
        "vfs.s3.endpoint_override": "https://s3.gfz-potsdam.de",
        "vfs.s3.region": "eu-central-1",
        "vfs.s3.no_sign_request" : "true"
    }
)


ctx = tiledb.Ctx(tiledb_config)

# Read TileDB schema
bucket = "dog.gedidb.gedi-l2-l4-v002"
array_uri = os.path.join(f"s3://{bucket}", "array_uri")

with tiledb.open(array_uri, mode="r", ctx=ctx) as array:
    print(array.schema)

Below is an example of how I would query the data.

import tiledb
import os

# S3 TileDB context

tiledb_config = tiledb.Config(
    {
        "vfs.s3.endpoint_override": "https://s3.gfz-potsdam.de",
        "vfs.s3.region": "eu-central-1",
        "vfs.s3.no_sign_request" : "true"
    }
)

ctx = tiledb.Ctx(tiledb_config)

# Path to the tileDB array
bucket = "dog.gedidb.gedi-l2-l4-v002"
array_uri = os.path.join(f"s3://{bucket}", "array_uri")

# Define query parameters
attr_list =  ["agbd"]
lat_min = -17.140088
lat_max =  -17.094909
lon_min =  145.606605 
lon_max = 145.653595
start_time =  17532
end_time =  19929

# Read the data
with tiledb.open(array_uri, mode="r", ctx=ctx) as array:
   
    query = array.query(attrs=attr_list)
    data = query.multi_index[
        lat_min:lat_max, lon_min:lon_max, start_time:end_time
    ]

As an indication, this is a visualisation of the fragment’s structure in my tileDB array.

I would appreciate any insights on whether my approach is well-optimized or if there are ways to improve it!

Thanks in advance!

Simon.