I am working on an optimal TileDB architecture and querying routine for spaceborne Lidar data from the GEDI mission. GEDI data, organized by orbit, includes variables at different levels of granularity:
- Shot-level Variables: Variables like
pai
(Plant Area Index) are stored at the shot level withlatitude
,longitude
, andtime
dimensions. - Profile-level Variables: Other variables, such as
rh
(relative height) andpavd_z
(plant area volume density by height), include an additional verticalprofile_point
dimension to capture profiles within each shot.
I would like feedback on whether the following structure and reading/writing methods align with TileDB best practices.
Current TileDB Structure and Methodology
- Data Structure and Array Setup:
- Two Array Structure:
- Scalar Array: Holds shot-level data using sparse indexing across
latitude
,longitude
, andtime
. - Profile Array: Holds profile-level data with an additional
profile_point
dimension, allowing storage and retrieval of vertically indexed measurements for each shot.
- Scalar Array: Holds shot-level data using sparse indexing across
- Example Variables:
- Scalar variables:
pai
,shot_number
- Profile variables:
rh
,pavd_z
- Scalar variables:
- Writing Data:
- Data is written in a sparse format for both arrays to support efficient indexing and range queries.
- The
profile_point
dimension in the profile array enables efficient storage and retrieval of profile information without redundancy.
- Reading Data:
- Data retrieval uses
multi_index
with filtering latitude, longitude, and time range. - Queries for profile data include the
profile_point
dimension for subsetting within each shot.
Code Example for Reproducibility
The following code demonstrates the TileDB schema, write, and read processes I’m currently using.
Step 1: Initialize the Arrays
import tiledb
import numpy as np
import pandas as pd
# Configuration
lat_range = (-90.0, 90.0)
lon_range = (-180.0, 180.0)
time_range = (pd.Timestamp("2023-01-01").timestamp() * 1e6, pd.Timestamp("2023-12-31").timestamp() * 1e6)
max_profile_length = 101 # Profile levels per shot
# Scalar Array Schema
scalar_schema = tiledb.ArraySchema(
domain=tiledb.Domain(
tiledb.Dim(name="latitude", domain=lat_range, tile=0.1, dtype="float64"),
tiledb.Dim(name="longitude", domain=lon_range, tile=0.1, dtype="float64"),
tiledb.Dim(name="time", domain=time_range, tile=int(1e6), dtype="int64"),
),
attrs=[tiledb.Attr(name="pai", dtype="float64")],
sparse=True
)
# Profile Array Schema
profile_schema = tiledb.ArraySchema(
domain=tiledb.Domain(
tiledb.Dim(name="latitude", domain=lat_range, tile=0.1, dtype="float64"),
tiledb.Dim(name="longitude", domain=lon_range, tile=0.1, dtype="float64"),
tiledb.Dim(name="time", domain=time_range, tile=int(1e6), dtype="int64"),
tiledb.Dim(name="profile_point", domain=(0, max_profile_length - 1), tile=1, dtype="int32")
),
attrs=[
tiledb.Attr(name="rh", dtype="float64"),
tiledb.Attr(name="pavd_z", dtype="float64"),
tiledb.Attr(name="shot_number", dtype="int64")
],
sparse=True
)
# Create arrays (local URIs for example)
tiledb.Array.create("scalar_array", scalar_schema)
tiledb.Array.create("profile_array", profile_schema)
Step 2: Write Sample Data to Arrays
# Sample scalar data
scalar_data = {
"latitude": np.random.uniform(lat_range[0], lat_range[1], 100),
"longitude": np.random.uniform(lon_range[0], lon_range[1], 100),
"time": np.random.randint(time_range[0], time_range[1], 100),
"pai": np.random.rand(100)
}
# Write scalar data
with tiledb.open("scalar_array", mode="w") as array:
array[scalar_data["latitude"], scalar_data["longitude"], scalar_data["time"]] = {"pai": scalar_data["pai"]}
# Sample profile data
profile_data = {
"latitude": np.repeat(np.random.uniform(lat_range[0], lat_range[1], 10), max_profile_length),
"longitude": np.repeat(np.random.uniform(lon_range[0], lon_range[1], 10), max_profile_length),
"time": np.repeat(np.random.randint(time_range[0], time_range[1], 10), max_profile_length),
"profile_point": np.tile(np.arange(max_profile_length), 10),
"rh": np.random.rand(10 * max_profile_length),
"pavd_z": np.random.rand(10 * max_profile_length),
"shot_number": np.repeat(np.random.randint(1e10, 1e12, 10), max_profile_length)
}
# Write profile data
with tiledb.open("profile_array", mode="w") as array:
array[profile_data["latitude"], profile_data["longitude"], profile_data["time"], profile_data["profile_point"]] = {
"rh": profile_data["rh"],
"pavd_z": profile_data["pavd_z"],
"shot_number": profile_data["shot_number"]
}
Step 3: Read Data Using multi_index
Queries
def read_scalar_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time, variables=["pai"]):
with tiledb.open("scalar_array", mode="r") as array:
query = array.query(attrs=variables)
data = query.multi_index[lat_min:lat_max, lon_min:lon_max, start_time:end_time]
print("Filtered Scalar Data:", data)
def read_profile_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time, variables=["rh", "pavd_z", "shot_number"]):
with tiledb.open("profile_array", mode="r") as array:
query = array.query(attrs=variables)
data = query.multi_index[lat_min:lat_max, lon_min:lon_max, start_time:end_time, :]
print("Filtered Profile Data:", data)
# Example range values for querying
lat_min, lat_max = 5.0, 25.0
lon_min, lon_max = 25.0, 50.0
start_time = int(pd.Timestamp("2023-01-01").timestamp() * 1e6)
end_time = int(pd.Timestamp("2023-12-31").timestamp() * 1e6)
read_scalar_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time)
read_profile_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time)
Questions for Optimization:
- Array Structure:
- Given the two-array approach (scalar and profile arrays), would consolidating both data types into a single array with a
profile_point
dimension improve storage and access efficiency?
- Reading Efficiency:
- The current querying uses
multi_index
on latitude, longitude, and time ranges. Are there alternative indexing or storage strategies in TileDB for enhancing query performance on such spatial-temporal data?
- Compression and Memory Management:
- Any recommendations on compression and memory optimization when storing large profile data with extensive
profile_point
dimensions?
Additional Question on Query Optimization for Large Spatial Areas: Given that some queries will span large spatial regions and long timeframes, are there recommended strategies in TileDB to optimize read performance and memory usage for these cases? For instance:
- Are there specific tiling configurations or indexing techniques that work best for extensive spatial and temporal ranges?
- Would breaking down large spatial queries into smaller, more manageable regions or using a TileDB-native approach (e.g., subarray tiling) yield better performance?
- Any tips for balancing query speed and memory consumption when handling large, sparse datasets like GEDI?
Thanks a lot for you help!