Long Running Archives with TileDB

I’m curious about your recommendations for using Tiledb as an archive for live streaming AIS data- is the preference to write separate tiledb archive every day? Or can TileDb handle db’s that never close?

The former would be a challenged to query- the latter I’d worry about file integrity.

tks!

S

Hi Scott,

There are two routes that you can take here, primarily depending on how real-time you will need access to that data.

The preferred method would be to write your streams to an array that just represents ~1 day of data, then at the end of the day write that day’s worth of data into your archival array. This would allow you get away with less efficient smaller fragments if you wanted to write close to real-time (e.g. a 5 second, minute, or hour buffer depending on data rates).

The best mechanism to break between a daily data migration would be to utilize time stamps. If you assign an increasing timestamp for each write and keep track of the start and end timestamp for “the day’s data”, at the end of the day you can do an query over the timestamp range to read the day’s data and write to the archive array (e.g. query 4hrs of data with the timestamp range specified will result in a much larger chunk of data to write to archive array) If data volumes are low enough you might be able to write the entire day’s data into a single fragment.

With how TileDB works you can still keep writing new data during this entire process without issue (since they’re getting timestamps outside of the range you’re working in), and you can then delete old data (again can use timestamps) once you’ve verified its in the archive array.

This concept also allows you to have a more effective schema tailored for the daily data and one for archive. To access data you will either read from the archive array if you want data older than the current day, and query the “day’s data” array if you need more recent data.

The other route would be to use only one array, with buffered writes as above, and consolidate at periods (again can use timestamps to specify a consolidation range). Consolidation would be necessary if you’re in a case where you have low frequency data that you need to write often, resulting in many small fragments. Consolidation itself is read safe during the operation, but if/when you wanted to “delete” the old fragments through the vacuum process, its not 100% read safe for a very short period of time. For more information on consolidation: Consolidation and Vacuuming | TileDB Embedded Docs

Either way you do not have to create a separate array for each day (unless you had some reason to). Also for reference TileDB arrays are safe to read while writing. If an array is already open in another process and writes have occured since opening you just need to call “reopen” to ensure subsequent array operations see the new data.

Let me know if anything is not clear here. We’re happy to jump on a call and discuss your use case and needs.

Awesome- thank you for the reply. Just one more question.

If I decided to store the data in one day tiledb arrays, is there a method by which I could query a handful of arrays at once without having to individually open and query each archive?

tks!

S

Hi Scott,

For querying a handful of arrays there is not currently a native API integration, but the best way would be to take advantage of using a TileDB group. If you group those arrays you can iterate over the arrays and perform the open and queries relatively painlessly.

An example in python would be:

with tiledb.Group(group_uri, "r") as group:
    for item in group:
        with tiledb.open(item.uri, "r") as array:

And then perform whatever query you wanted to within that open.

I hope this helps!