I have a specific use case that I’d like to address with TileDB. I want to create a TileDB array that aggregates several rasters. My end goal is to be able to query the data using filters on dimensions or attributes, such as coordinates or a range of values. This could potentially represent a very large array. I’d like to be able to add data to this array progressively, and it could potentially cover the entire Earth in the future. This brings up the question of choosing an architecture that would optimally manage both write and read performance for this array.
We have already tried handling this using a sparse array. However, writing a single raster (with dimensions of 36000x36000 pixels) all at once seems to saturate our RAM, which has a capacity of 32GB. We then tried to distribute the writing by chunks using Dask, but the write operation becomes considerably slow (taking about ~15 minutes per image). We believe the slow performance is mainly due to the conversion operations to a 1D array.
We also experimented with a dense array and Dask’s to_tiledb()
function, which appears to significantly improve performance. However, this introduces other challenges, such as the need to pre-create a dense, empty array. Additionally, there’s the limitation where we can’t load a subset with Dask and challenges with maintaining uniform types in dense array dimensions.
I’d appreciate any insights or recommendations on how to best approach this.
Here is one example of our code with sparse array:
with rasterio.open("./image.tiff", 'r') as src:
data = src.read(1)
print(data.shape)
data = da.from_array(data, chunks='100 MiB')
chunk_sizes = data.chunks
print("Chunk sizes:", chunk_sizes)
print("create linspace")
longitudes = da.linspace(src.bounds.top, src.bounds.bottom, src.height, chunks=chunk_sizes[0])
latitudes = da.linspace(src.bounds.left, src.bounds.right, src.width, chunks=chunk_sizes[1])
lats, longs = da.meshgrid(latitudes, longitudes, indexing='ij')
with tiledb.open("./sample.tdb", 'w') as array:
for i in range(len(chunk_sizes[0])):
for j in range(len(chunk_sizes[1])):
chunk_data = data.blocks[i, j].compute().ravel()
chunk_lats = lats.blocks[i, j].compute().ravel()
chunk_longs = longs.blocks[i, j].compute().ravel()
time_chunk = np.full(chunk_data.shape, 2021, dtype=np.int32)
array[time_chunk, chunk_lats, chunk_longs] = {'data': chunk_data}