Optimal TileDB Structure and Querying Procedure for Spaceborne Lidar (GEDI) Data

I am working on an optimal TileDB architecture and querying routine for spaceborne Lidar data from the GEDI mission. GEDI data, organized by orbit, includes variables at different levels of granularity:

  • Shot-level Variables: Variables like pai (Plant Area Index) are stored at the shot level with latitude, longitude, and time dimensions.
  • Profile-level Variables: Other variables, such as rh (relative height) and pavd_z (plant area volume density by height), include an additional vertical profile_point dimension to capture profiles within each shot.

I would like feedback on whether the following structure and reading/writing methods align with TileDB best practices.

Current TileDB Structure and Methodology

  1. Data Structure and Array Setup:
  • Two Array Structure:
    • Scalar Array: Holds shot-level data using sparse indexing across latitude, longitude, and time.
    • Profile Array: Holds profile-level data with an additional profile_point dimension, allowing storage and retrieval of vertically indexed measurements for each shot.
  • Example Variables:
    • Scalar variables: pai, shot_number
    • Profile variables: rh, pavd_z
  1. Writing Data:
  • Data is written in a sparse format for both arrays to support efficient indexing and range queries.
  • The profile_point dimension in the profile array enables efficient storage and retrieval of profile information without redundancy.
  1. Reading Data:
  • Data retrieval uses multi_index with filtering latitude, longitude, and time range.
  • Queries for profile data include the profile_point dimension for subsetting within each shot.

Code Example for Reproducibility

The following code demonstrates the TileDB schema, write, and read processes I’m currently using.

Step 1: Initialize the Arrays


import tiledb
import numpy as np
import pandas as pd

# Configuration
lat_range = (-90.0, 90.0)
lon_range = (-180.0, 180.0)
time_range = (pd.Timestamp("2023-01-01").timestamp() * 1e6, pd.Timestamp("2023-12-31").timestamp() * 1e6)
max_profile_length = 101  # Profile levels per shot

# Scalar Array Schema
scalar_schema = tiledb.ArraySchema(
    domain=tiledb.Domain(
        tiledb.Dim(name="latitude", domain=lat_range, tile=0.1, dtype="float64"),
        tiledb.Dim(name="longitude", domain=lon_range, tile=0.1, dtype="float64"),
        tiledb.Dim(name="time", domain=time_range, tile=int(1e6), dtype="int64"),
    ),
    attrs=[tiledb.Attr(name="pai", dtype="float64")],
    sparse=True
)

# Profile Array Schema
profile_schema = tiledb.ArraySchema(
    domain=tiledb.Domain(
        tiledb.Dim(name="latitude", domain=lat_range, tile=0.1, dtype="float64"),
        tiledb.Dim(name="longitude", domain=lon_range, tile=0.1, dtype="float64"),
        tiledb.Dim(name="time", domain=time_range, tile=int(1e6), dtype="int64"),
        tiledb.Dim(name="profile_point", domain=(0, max_profile_length - 1), tile=1, dtype="int32")
    ),
    attrs=[
        tiledb.Attr(name="rh", dtype="float64"),
        tiledb.Attr(name="pavd_z", dtype="float64"),
        tiledb.Attr(name="shot_number", dtype="int64")
    ],
    sparse=True
)

# Create arrays (local URIs for example)
tiledb.Array.create("scalar_array", scalar_schema)
tiledb.Array.create("profile_array", profile_schema)

Step 2: Write Sample Data to Arrays


# Sample scalar data
scalar_data = {
    "latitude": np.random.uniform(lat_range[0], lat_range[1], 100),
    "longitude": np.random.uniform(lon_range[0], lon_range[1], 100),
    "time": np.random.randint(time_range[0], time_range[1], 100),
    "pai": np.random.rand(100)
}

# Write scalar data
with tiledb.open("scalar_array", mode="w") as array:
    array[scalar_data["latitude"], scalar_data["longitude"], scalar_data["time"]] = {"pai": scalar_data["pai"]}

# Sample profile data
profile_data = {
    "latitude": np.repeat(np.random.uniform(lat_range[0], lat_range[1], 10), max_profile_length),
    "longitude": np.repeat(np.random.uniform(lon_range[0], lon_range[1], 10), max_profile_length),
    "time": np.repeat(np.random.randint(time_range[0], time_range[1], 10), max_profile_length),
    "profile_point": np.tile(np.arange(max_profile_length), 10),
    "rh": np.random.rand(10 * max_profile_length),
    "pavd_z": np.random.rand(10 * max_profile_length),
    "shot_number": np.repeat(np.random.randint(1e10, 1e12, 10), max_profile_length)
}

# Write profile data
with tiledb.open("profile_array", mode="w") as array:
    array[profile_data["latitude"], profile_data["longitude"], profile_data["time"], profile_data["profile_point"]] = {
        "rh": profile_data["rh"],
        "pavd_z": profile_data["pavd_z"],
        "shot_number": profile_data["shot_number"]
    }

Step 3: Read Data Using multi_index Queries


def read_scalar_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time, variables=["pai"]):
    with tiledb.open("scalar_array", mode="r") as array:
        query = array.query(attrs=variables)
        data = query.multi_index[lat_min:lat_max, lon_min:lon_max, start_time:end_time]
    print("Filtered Scalar Data:", data)

def read_profile_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time, variables=["rh", "pavd_z", "shot_number"]):
    with tiledb.open("profile_array", mode="r") as array:
        query = array.query(attrs=variables)
        data = query.multi_index[lat_min:lat_max, lon_min:lon_max, start_time:end_time, :]
    print("Filtered Profile Data:", data)

# Example range values for querying
lat_min, lat_max = 5.0, 25.0
lon_min, lon_max = 25.0, 50.0
start_time = int(pd.Timestamp("2023-01-01").timestamp() * 1e6)
end_time = int(pd.Timestamp("2023-12-31").timestamp() * 1e6)

read_scalar_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time)
read_profile_data(lat_min, lat_max, lon_min, lon_max, start_time, end_time)

Questions for Optimization:

  1. Array Structure:
  • Given the two-array approach (scalar and profile arrays), would consolidating both data types into a single array with a profile_point dimension improve storage and access efficiency?
  1. Reading Efficiency:
  • The current querying uses multi_index on latitude, longitude, and time ranges. Are there alternative indexing or storage strategies in TileDB for enhancing query performance on such spatial-temporal data?
  1. Compression and Memory Management:
  • Any recommendations on compression and memory optimization when storing large profile data with extensive profile_point dimensions?

Additional Question on Query Optimization for Large Spatial Areas: Given that some queries will span large spatial regions and long timeframes, are there recommended strategies in TileDB to optimize read performance and memory usage for these cases? For instance:

  • Are there specific tiling configurations or indexing techniques that work best for extensive spatial and temporal ranges?
  • Would breaking down large spatial queries into smaller, more manageable regions or using a TileDB-native approach (e.g., subarray tiling) yield better performance?
  • Any tips for balancing query speed and memory consumption when handling large, sparse datasets like GEDI?

Thanks a lot for you help!