Thanks again @George_Powley , and yes I cannot share the data due to privacy concerns.
But please find the output of the above commands
# report the version
tiledbvcf --version
TileDB-VCF version 0152168-modified
TileDB version 2.15.4
htslib version 1.16
# report high-level dataset stats
tiledbvcf stat --uri dataset_uri
Statistics for dataset 'full_dataset_tiledb_III':
- Version: 4
- Tile capacity: 10,000
- Anchor gap: 1,000
- Number of samples: 1,690
- Extracted attributes: none
I used tiledbvcf python module for the ingestion purposes, and the script is as below :
# Library Imports
import tiledbvcf
import pathlib
import shutil
directory = pathlib.Path("path to input directory") # Directory with all input files to be ingested by tiledbvcf
files = list(directory.glob("*.vcf.gz"))
uri = "full_dataset_tiledb_III"
dataset_path =pathlib.Path( "output_directory")/uri
# remove the dataset if already exists
if dataset_path.exists():
shutil.rmtree(dataset_path)
print("Dataset removed")
ds = tiledbvcf.Dataset(str(dataset_path),"w")
ds.create_dataset()
print("file ingestion started")
for file in files:
try:
ds.ingest_samples([str(file)])
except:
print(file.name)
print("Code Over")
The query script I used was as :
import tiledbvcf
import pandas as pd
uri = "full_dataset_tiledb_III"
ds = tiledbvcf.Dataset(uri,"r")
df = ds.read(attrs=["col_1", "col_2"])
print(df[df["col_1"].str.contains("X")])
I also ran the equivalent commands you mentioned in the python module and the output is as follows:
tiledbvcf.version
'0.23.1'
ds.attributes(attr_type = "builtin")
['alleles', 'contig', 'filters', 'fmt', 'id', 'info', 'pos_end', 'pos_start', 'qual', 'query_bed_end', 'query_bed_line', 'query_bed_start', 'sample_name']
To debug further, I look at the number of lines in a particular file and I observe tha the number of lines in the file processed using Python (excluding the header) was fewer compared to the actual file. However, when exporting the same file from the dataset, it matched the ingested file.
Please let me know if the above output helps, or anything more is needed from end.