Runtime error in accessing s3 vcf data file

Runtime error in accessing s3 vcf data file

import tiledbvcf
import boto3
import os
import tempfile
import glob
import pandas as pd

small_ds = tiledbvcf.Dataset(‘small_dataset2’, mode = “w”)

with open(“s3-vcf-samples.txt”) as f:
sample_uris = [l.rstrip("\n") for l in f.readlines()]

scratch_space_path = tempfile.gettempdir(),

Error message:

RuntimeError Traceback (most recent call last)
in ()
4 sample_uris,
5 scratch_space_path = tempfile.gettempdir(),
----> 6 scratch_space_size=10
7 )

/home/ec2-user/SageMaker/tileDB-new/tiledbvcf/lib/python3.7/site-packages/tiledbvcf/ in ingest_samples(self, sample_uris, extra_attrs, checksum_type, allow_duplicates, scratch_space_path, scratch_space_size)
212 # Create is a no-op if the dataset already exists.
213 self.writer.create_dataset()
–> 214 self.writer.register_samples()
215 self.writer.ingest_samples()

RuntimeError: TileDB-VCF exception: Error processing sample; URI ‘s3://bucket/data_path/chr1-prefix.vcf.gz’ does not exist.

I tried exporting the access key and secret key as mentioned in the docs but still getting the same error.


Since you already have boto3 imported could you try using that to verify you’re able to access the specified VCF file?

This should do it:

s3 = boto3.client('s3')
s3.list_objects_v2(Bucket = "bucket", Prefix = "data_path")

Alternatively you could use the aws-cli and attempt to list the bucket’s contents with
aws s3 ls .

Yes, I’m able to access it using boto3.

Thanks for checking.

One other thought: is sample_uris a string with a single sample URI? Or is it a list of strings?
If it’s the former, the string needs to be converted to a list, ie

  scratch_space_path = tempfile.gettempdir(),

sample_uris is a list of URIs.

Could you share what version of tiledbvcf you’re using (tiledbvcf.version) and how you installed it?