Interested in using TileDB for genomics (not variants, more like genome-wide signals) and been looking at the implementation for variants (GenomicsDB). They claim that using TileDB gets around the N+1 problem (can add a new sample easily) - but I’m looking at the docs and I don’t see a way to append to an already created array.
The only way I can think of is at the time of building the space array, specificy a very high dimension for samples, as I think these are not materialized if they do not contain any data.
Does TileDB support appending a row or a column to an existing array, or is the above a way to do it instead? What is the maximum size of dimensions if it is the case?
I think you are looking at the wrong project :). Our genomic variant work is TileDB-VCF and here are the related docs.
In general, you can “append” to a TileDB array by just writing to any slice “at the end” of a huge domain, like
[1,MAX_UINT64]. The only problem with this currently is that you need to keep track of the last domain value you wrote. You can do this through the non-empty domain.
We are happy to add new APIs to provide a better user experience.
Indeed, got confused with the GenomicsDB reference, I was reading the documentation on TileDB-VCF and is what I was refering to.
OK so the smart way at the moment is to create a very large array, keep track of the last ‘sample’ you added and put it at the end. Would have expected some append function like zarr has but this can do as well. Maybe some mentions of this in the docs could be good
Thanks for speedy replies, I’m looking to adopt TileDB for a project so far so good
Yeah indeed, “append” would be useful as an API. The main reason we have avoided “append” is that it isn’t process-safe when you have multiple writers. You need some blocking “registration” step, which the TileDB-VCF project specifically takes care of with the
register step. We need to implement something similar for the general array case. In our roadmap as well .