Are files getting corrupted during dataset creation using tiledbvcf?

Hello,

I am using tiledbvcf to create a dataset. I have to ingest 1700 files in the dataset. When I started the dataset creation process, and I check for the files the process was going smoothly . I queried in between the dataset creation process ( ingested around 1000 files ) a query say “X”, which yields results. But after doing so again when the ingestion was about to complete, I didn;t get any results for the same query “X” ( got an empty data frame )

Can anyone please guide me about what is happening?

Hi @Vibhorgupta31,

The dataset should be consistent during dataset creation. When you run a query, it will open the dataset and include only the data committed prior to the dataset open time.

Do you get the expected result when running the query after ingestion completes?

Thank you @George_Powley for your reply. Please find the problem in more detail below :

I am running a script on a server that ingests files into a dataset. While ingestion is in process, I query the dataset. I understand queries will only return results on ingested data so far.

However, I see inconsistent behavior. During ingestion, queries return results. But the same queries after more files are ingested, or after all files are ingested, return no results - even though I expect to see initial results plus additional results from newly ingested data.

I have replicated this multiple times and see the same problem.

Please let me know if I summarized the key points correctly. I’m happy to help clarify further.

Thank you @Vibhorgupta31, I understand the symptom you are describing. The next step is to figure out why you are seeing the unexpected behavior.

Please run these commands and provide the output (except for the list of samples):

# report the version
tiledbvcf --version

# report high-level dataset stats
tiledbvcf stat --uri dataset_uri

# list names of samples in the dataset
tiledbvcf list --uri dataset_uri

Can you share the ingestion script and query code? I assume you cannot share the data.

Thanks again @George_Powley , and yes I cannot share the data due to privacy concerns.

But please find the output of the above commands

# report the version
tiledbvcf --version

TileDB-VCF version 0152168-modified
TileDB version 2.15.4
htslib version 1.16
# report high-level dataset stats

tiledbvcf stat --uri dataset_uri

Statistics for dataset 'full_dataset_tiledb_III':
- Version: 4
- Tile capacity: 10,000
- Anchor gap: 1,000
- Number of samples: 1,690
- Extracted attributes: none

I used tiledbvcf python module for the ingestion purposes, and the script is as below :

# Library Imports
import tiledbvcf
import pathlib
import shutil


directory = pathlib.Path("path to input directory") # Directory with all input files to be ingested by tiledbvcf
files = list(directory.glob("*.vcf.gz"))

uri = "full_dataset_tiledb_III"
dataset_path =pathlib.Path( "output_directory")/uri

# remove the dataset if already exists

if dataset_path.exists():
        shutil.rmtree(dataset_path)
        print("Dataset removed")

ds = tiledbvcf.Dataset(str(dataset_path),"w")
ds.create_dataset()

print("file ingestion started")

for file in files:
        try:
            	ds.ingest_samples([str(file)])
        except:
               	print(file.name)
print("Code Over")

The query script I used was as :

import tiledbvcf
import pandas as pd

uri = "full_dataset_tiledb_III"
ds = tiledbvcf.Dataset(uri,"r")

df = ds.read(attrs=["col_1", "col_2"])
print(df[df["col_1"].str.contains("X")])

I also ran the equivalent commands you mentioned in the python module and the output is as follows:

 tiledbvcf.version
'0.23.1'

ds.attributes(attr_type = "builtin")

['alleles', 'contig', 'filters', 'fmt', 'id', 'info', 'pos_end', 'pos_start', 'qual', 'query_bed_end', 'query_bed_line', 'query_bed_start', 'sample_name']

To debug further, I look at the number of lines in a particular file and I observe tha the number of lines in the file processed using Python (excluding the header) was fewer compared to the actual file. However, when exporting the same file from the dataset, it matched the ingested file.

Please let me know if the above output helps, or anything more is needed from end.

Thanks for sharing your scripts @Vibhorgupta31.

One suggestion is to add the following code to the beginning of your scripts, to help identify any issues:

tiledbvcf.config_logging("debug", "tiledbvcf.log")

Also, we recommend ingesting VCF files in batches of at least 10, which will improve the query performance.

Finally, our TileDB Cloud SaaS product includes APIs for one-line distributed VCF ingestion and one-line distributed VCF queries. These functions reduce the coding effort required to ingest and query VCF data and include best practices to optimize performance. If you are interested in TileDB Cloud, we would be happy to meet and discuss further.

Thank you, @George_Powley , for your response. I appreciate your suggestion, and I’ll definitely try it out and update you on my experience with TileDB Cloud.

I wanted to share a quick observation. When I attempted to retrieve data from the dataset using TileDB CLI using the same query I was utilizing with Python, I unexpectedly obtained the records with the results. This discrepancy suggests there might be an issue with the Python module. Please feel free to correct me if my assumption is mistaken. I’d love to hear any insights you might have on this matter.

That’s a good clue. You may be seeing an incomplete query in python. When there is an incomplete query, you need to call ds.continue_read() until the query is complete and concatenate the results. Something like this:

# Issue read query
dfs = [ ds.read(...) ]

# Loop over any incomplete queries
while not ds.read_completed():
    dfs.append(ds.continue_read())

# Combine any incomplete queries into a single dataframe
df = pandas.concat(dfs)

1 Like

Thanks, @George_Powley! Your suggestion while not ds.read_completed(): did the trick. I got the complete dataset as expected. Appreciate your help!

1 Like