Parallel gVCF import, possible?

Michaeljon_Miller · November 11, 2025, 1:10am

We run several hundred sequences through our variant calling pipeline every day. When a sequence passes automated QC a workflow is fired off to import the resulting gVCF into our s3 tiledb store. Is there any reason we can’t have many independent tiledbvcf import running concurrently against the same bucket? Or will that cause issues with the underlying data?

ihnorton · November 14, 2025, 1:13pm

Hi @Michaeljon_Miller,

As long as the sample IDs are distinct for each worker, then writing from multiple workers simultaneously is fine.

Best,

Isaiah

Michaeljon_Miller · November 16, 2025, 10:12pm

Thanks Isaiah. This does seem to work, but I suspect that we’ll need to be a lot smarter about batching and consolidating. Running a few batches in serially (50 samples per import) seems to leave things in a good state. But I ran 40 concurrent batches of 50 and now read performance has gone south.

I tried to consolidate but ran into this.

[2025-11-16 20:04:45.612] [tiledb-vcf] [Process: 622] [Thread: 622] [critical]
 Exception: SparseIndexReaderBase: Cannot set array memory budget (3276.600000) 
because it is smaller than the current memory usage (19185324).

And that was with 64Gb assigned and 32 cores.

tiledbvcf utils consolidate fragments \
    --uri ${variant_store} \
    --tiledb-config sm.mem.total_budget=65536,sm.compute_concurrency_level=32 \
    --log-level trace \
    --log-file ${meta.nfr_workflow_id}.tiledb.log

Something tells me our use case might be straining the model a bit. Of course we want column-wise statistics (how many samples with variant x), but two other cases are to retrieve a single sample gVCF and to merge N samples into a single gVCF (where N is going to range up to, and beyond, 25,000 samples, but starting at 5,000 by the end of this year).

michaeljon · December 4, 2025, 2:30am

Still running into a OOM trying to run consolidation. How do I tell the CLI that it’s free to use the 192gb on the machine?

michaeljon · December 5, 2025, 5:10pm

Didn’t realize sm.mem.total_budget was in bytes. Either way, I’ve adjusted that but still running into OOM issues. I have a store with 250 samples, loaded in batches of 5x10 (50 into tiledbvcf store with a batch size of 10). Between each of those is a consolidation / vacuum pass on fragments and commits. I’ve also completely given up on using gVCF as the source and have moved to using our VCFs instead.

How much RAM should that consolidation take? And, generally, how much time if the store is on s3? I’m having a hard time seeing how this is going to scale into the 1000s let alone our target of 100000+.

Topic		Replies	Views
VCF store with ~2000 gVCF files, can't export	0	81	November 11, 2025
Usage help -- disk space, parallel writes	5	914	June 21, 2022
Write confirmation (Question)	16	2163	March 9, 2021
How to speed up the reading from tiledb	5	2149	October 8, 2020
Tips on Consolidating Sparse Arrays [Python]	4	1090	April 14, 2023

Parallel gVCF import, possible?

Related topics