I work at the UK Met Office and we produce ~300TB of data a day, managing, accessing, sharing, etc this data is a constant challenge and we are always looking at ways of doing it better. I’m interested in how TileDB might help.
The main area I’m currently interested in is N-dimensional arrays such as our weather or climate forecasts with rich metadata (what’s the grid, what model run was this, what is the validity time, etc).
One area we are facing a particular challenge is in the data we are sharing on AWS Earth (and due for AI for Earth on Azure soon) which is a 7 day rolling archive of approx 7TB a day. This is currently exposed as about 3.5 million objects in S3 (individual NetCDF files). Each object is a 2,3 or 4 (x,y,z,ensemble) dimensional array for one forecast step, for one model run. We would rather concatenate/join along both the forecast step and model run dimension to create much higher level objects representing the complete collection of runs for a given model and parameter. For our MOGREPS-G model this would look like a 6D (x,y,z,ensemble, forecast step, forecast run time) array with shape approx (1280, 960, 17, 168, 28). This needs accompanying by a rich metadata set describing the dimension values along all these axis (and other metadata).
We have looked at addressing this challenge using Zarr and whilst promising there are a number of issues. I’ve written more about this on our blog:
Challenges we hope TileDB might help with:
- Object storage. To help keep cost manageable we are using object storage of one form or another.
- Big. 7TB a day rolling over 7 or 14 day period. This represents 4 models and around 20 parameters so would not be expressed (probably) as just one data set but it’s still a lot of data.
- Rolling. As data is added to the end it’s rolled of the beginning. This needs to be done efficiently with out re-writing all objects and keeping metadata in sync at all times.
- Semi-sparse. Models run to different lengths at different times. Sometimes 12 hours, sometimes 72, this creates a semi-space array which some tools don’t like.
- Rich metadata. It’s vital the rich metadata associated with the datasets is maintained.
- Unknown access patterns. We want to expose this data for scientific and commercial activities that may entail a huge range of different access patterns. We want to design a system with a ‘thin client’ (such as the S3 API) that enables users to access the data efficiently (so only the bits they want) for whatever their use case is. In reality we are aiming for a middle ground between performance and flexibility leaning towards flexibility.
- Interoperability with Python and preferably xarray.
Hope this is an interesting topic for the community, I’d love discus further and if this medium isn’t sufficiently high band with maybe organise a call with who ever is relevant.