Managing Large Geospatial Arrays with TileDB

I have a specific use case that I’d like to address with TileDB. I want to create a TileDB array that aggregates several rasters. My end goal is to be able to query the data using filters on dimensions or attributes, such as coordinates or a range of values. This could potentially represent a very large array. I’d like to be able to add data to this array progressively, and it could potentially cover the entire Earth in the future. This brings up the question of choosing an architecture that would optimally manage both write and read performance for this array.

We have already tried handling this using a sparse array. However, writing a single raster (with dimensions of 36000x36000 pixels) all at once seems to saturate our RAM, which has a capacity of 32GB. We then tried to distribute the writing by chunks using Dask, but the write operation becomes considerably slow (taking about ~15 minutes per image). We believe the slow performance is mainly due to the conversion operations to a 1D array.

We also experimented with a dense array and Dask’s to_tiledb() function, which appears to significantly improve performance. However, this introduces other challenges, such as the need to pre-create a dense, empty array. Additionally, there’s the limitation where we can’t load a subset with Dask and challenges with maintaining uniform types in dense array dimensions.

I’d appreciate any insights or recommendations on how to best approach this.

Here is one example of our code with sparse array:

with rasterio.open("./image.tiff", 'r') as src:
    data = src.read(1)
    print(data.shape)
    data = da.from_array(data, chunks='100 MiB')
    chunk_sizes = data.chunks
    print("Chunk sizes:", chunk_sizes)

    print("create linspace")
    longitudes = da.linspace(src.bounds.top, src.bounds.bottom, src.height, chunks=chunk_sizes[0])
    latitudes = da.linspace(src.bounds.left, src.bounds.right, src.width, chunks=chunk_sizes[1])
    lats, longs = da.meshgrid(latitudes, longitudes, indexing='ij')

    with tiledb.open("./sample.tdb", 'w') as array:
        for i in range(len(chunk_sizes[0])):
            for j in range(len(chunk_sizes[1])):
                chunk_data = data.blocks[i, j].compute().ravel()
                chunk_lats = lats.blocks[i, j].compute().ravel()
                chunk_longs = longs.blocks[i, j].compute().ravel()
                time_chunk = np.full(chunk_data.shape, 2021, dtype=np.int32)

                array[time_chunk, chunk_lats, chunk_longs] = {'data': chunk_data}

Hi Baptiste,

Great use case for TileDB. There were a few questions I wanted to clarify on your data and use case:

  • In regards to aggregating rasters are you referring to performing a mosaic \ stitching operation?
  • Is there a specific reason for selecting the sparse data model for this use?
    • If this is to accommodate gaps in data coverage, the dense model will actually still work fine as long as pixel size is always constant. Gaps will just be represented as nodata and will not have a disk space cost.
    • If it is in regards to dimension support, our recently added dimension labels could be an avenue to resolve this.
  • Could you provide information on the array schema?
  • Is this any sort of publicly available data that you can provide us to reproduce on our end?

We’re happy to hear more about this use case!

Thank you for delving into our specific use case. Here’s the responses of your questions:

  1. Stitching vs. Mosaicking: Currently, our focus is on stitching. Each of our rasters covers a distinct region without any overlap. When combined, these rasters span the entirety of the world.

  2. Sparse Array Choice: Our decision to go with a sparse array wasn’t cemented on strong justifications. We initially chose this approach because:

    • We had to manage dimensions of varied types, including dates, floats, and integers. A sparse array seemed to offer a straightforward solution.
    • Not all the rasters we use span the entire globe. Hence, many areas are data-deficient. However, I concur that with a consistent grid and fixed pixel size, a dense array might be a better fit. We’ve experimented with such an approach, wherein we mapped the entire globe and managed indices rather than coordinates, using transformation functions akin to GDAL’s methodology.
    • Using a dense array might also reduce our computational requirements. From my understanding, there’s no need to flatten the array when inserting data into a dense array.
  3. Array Schema:

    • For our sparse array, the schema is as follows:

      import tiledb
      import numpy as np
      
      min_latitude, max_latitude = -90, 90
      min_longitude, max_longitude = -180, 180
      min_time, max_time = 1970, 2100
      
      domain = tiledb.Domain(
          tiledb.Dim(name="time", domain=(min_time, max_time), tile=1, dtype=np.int32),
          tiledb.Dim(name="latitude", domain=(min_latitude, max_latitude), tile=120, dtype=np.float64),
          tiledb.Dim(name="longitude", domain=(min_longitude, max_longitude), tile=120, dtype=np.float64),
      )
      
      schema = tiledb.ArraySchema(
          domain=domain, sparse=True, attrs=[tiledb.Attr(name="data", dtype=np.int32)]
      )
      
    • For our dense array tests, the schema looks like:

      import dask.array as da
      import rasterio
      import tiledb
      import numpy as np
      
      PIXELS_PER_DEGREE = 12000
      GLOBAL_WIDTH = PIXELS_PER_DEGREE * 360
      GLOBAL_HEIGHT = PIXELS_PER_DEGREE * 180
      
      dom = tiledb.Domain(
        tiledb.Dim(name="time", domain=(2000, 2030), tile=1, dtype=np.int32),
        tiledb.Dim(name="latitude", domain=(0, GLOBAL_HEIGHT - 1), tile=6000, dtype=np.int32),
        tiledb.Dim(name="longitude", domain=(0, GLOBAL_WIDTH - 1), tile=6000, dtype=np.int32),
      )
      
      schema = tiledb.ArraySchema(
        domain=dom, sparse=False, attrs=[tiledb.Attr(name="data", dtype=np.int32)]
      )
      
      tiledb.DenseArray.create(uri, schema)
      

    Do note that while the provided example caters to a raster with a single band, our rasters can potentially contain multiple bands.

  4. Sample Data:

    • Single Band Raster: You can access it at s3://esa-worldcover/v200/2021/map/ESA_WorldCover_10m_2021_v200_S12E036_Map.tif. This data pertains to land cover, and while I’ve shared just one tile, the entire map can be accessed here. Additional tiles are also available via S3. You can list them using the command: aws s3 ls s3://esa-worldcover/v200/2021/map/.
    • Multi-Band Raster: sample.tiff - Google Drive.

Don’t hesitate if you have any other question and thanks again for your help.

Thanks Baptiste for the information and the data links. We will have a look at this data and provide some best practice examples for ingesting into TileDB with optimal performance in mind,

Once we have that information together would you be interested in jumping on a meeting to go over the suggestions? I believe we may have spoken before in the past. If you’re open to it I will send an email with meeting openings.