Hi everyone,
I’m working with a TileDB array to store and query GEDI data. The array includes spatial (latitude, longitude) and temporal dimensions, with variables stored as attributes. I want to get feedback on two aspects:
- TileDB Schema: Does the structure of my TileDB array make sense for efficient querying?
- Query Optimization: Am I reading the data efficiently, or are there improvements I could make (e.g., indexing strategies, query execution optimizations, parallel reading)?
Here’s the way to look at my TileDB schema:
import tiledb
import os
# S3 TileDB context
tiledb_config = tiledb.Config(
{
"vfs.s3.endpoint_override": "https://s3.gfz-potsdam.de",
"vfs.s3.region": "eu-central-1",
"vfs.s3.no_sign_request" : "true"
}
)
ctx = tiledb.Ctx(tiledb_config)
# Read TileDB schema
bucket = "dog.gedidb.gedi-l2-l4-v002"
array_uri = os.path.join(f"s3://{bucket}", "array_uri")
with tiledb.open(array_uri, mode="r", ctx=ctx) as array:
print(array.schema)
Below is an example of how I would query the data.
import tiledb
import os
# S3 TileDB context
tiledb_config = tiledb.Config(
{
"vfs.s3.endpoint_override": "https://s3.gfz-potsdam.de",
"vfs.s3.region": "eu-central-1",
"vfs.s3.no_sign_request" : "true"
}
)
ctx = tiledb.Ctx(tiledb_config)
# Path to the tileDB array
bucket = "dog.gedidb.gedi-l2-l4-v002"
array_uri = os.path.join(f"s3://{bucket}", "array_uri")
# Define query parameters
attr_list = ["agbd"]
lat_min = -17.140088
lat_max = -17.094909
lon_min = 145.606605
lon_max = 145.653595
start_time = 17532
end_time = 19929
# Read the data
with tiledb.open(array_uri, mode="r", ctx=ctx) as array:
query = array.query(attrs=attr_list)
data = query.multi_index[
lat_min:lat_max, lon_min:lon_max, start_time:end_time
]
As an indication, this is a visualisation of the fragment’s structure in my tileDB array.
I would appreciate any insights on whether my approach is well-optimized or if there are ways to improve it!
Thanks in advance!
Simon.