Parallel gVCF import, possible?

We run several hundred sequences through our variant calling pipeline every day. When a sequence passes automated QC a workflow is fired off to import the resulting gVCF into our s3 tiledb store. Is there any reason we can’t have many independent tiledbvcf import running concurrently against the same bucket? Or will that cause issues with the underlying data?

Hi @Michaeljon_Miller,

As long as the sample IDs are distinct for each worker, then writing from multiple workers simultaneously is fine.

Best,

Isaiah

Thanks Isaiah. This does seem to work, but I suspect that we’ll need to be a lot smarter about batching and consolidating. Running a few batches in serially (50 samples per import) seems to leave things in a good state. But I ran 40 concurrent batches of 50 and now read performance has gone south.

I tried to consolidate but ran into this.

[2025-11-16 20:04:45.612] [tiledb-vcf] [Process: 622] [Thread: 622] [critical]
 Exception: SparseIndexReaderBase: Cannot set array memory budget (3276.600000) 
because it is smaller than the current memory usage (19185324).

And that was with 64Gb assigned and 32 cores.

tiledbvcf utils consolidate fragments \
    --uri ${variant_store} \
    --tiledb-config sm.mem.total_budget=65536,sm.compute_concurrency_level=32 \
    --log-level trace \
    --log-file ${meta.nfr_workflow_id}.tiledb.log

Something tells me our use case might be straining the model a bit. Of course we want column-wise statistics (how many samples with variant x), but two other cases are to retrieve a single sample gVCF and to merge N samples into a single gVCF (where N is going to range up to, and beyond, 25,000 samples, but starting at 5,000 by the end of this year).

Still running into a OOM trying to run consolidation. How do I tell the CLI that it’s free to use the 192gb on the machine?

Didn’t realize sm.mem.total_budget was in bytes. Either way, I’ve adjusted that but still running into OOM issues. I have a store with 250 samples, loaded in batches of 5x10 (50 into tiledbvcf store with a batch size of 10). Between each of those is a consolidation / vacuum pass on fragments and commits. I’ve also completely given up on using gVCF as the source and have moved to using our VCFs instead.

How much RAM should that consolidation take? And, generally, how much time if the store is on s3? I’m having a hard time seeing how this is going to scale into the 1000s let alone our target of 100000+.