Importing tiledb dataset created with tiledb-vcf into R

Hi everybody,
I have created a tiledb dataset using tiledb-vcf and now I am trying to import it in R

# if needed: install.packages('tiledb')
library(tiledb)
gwasdb_uri <- "/datasets/blood_traits/data"
gwasdb <- tiledb::tiledb_array(gwasdb_uri, "READ", as.data.frame = TRUE, is.sparse = TRUE)
str(gwasdb)

but I get this error: “Non-char var.num columns are not currently supported.”

Do you know if it is possible to import a dataset created using tiledb-vcf into R?
Do you have any suggestions on how to proceed?

Thank you

Hi @gmauro,

Thanks for filing this issue, which I can confirm. As of right now, one cannot access (non-character) columns with a variable number of elements into tiledb-r: columns have to be single vectors with a single element.

We plan of relaxing this constraint in both the regular access (i.e. via return_as="data.frame" and the data.table, tibble, … alternatives) and as Arrow data structures.

For now, you can adjust your query by excluding the column having multiple (non-character) elements. Here is an example array create by the C++ example variable_length (which I modified a little to be written as a sparse rather than dense array). Attributes a1 (character) and a2 (int) both have a variable number of entries per cell. When I request the full array, I get the error you saw. When I request only a1 it works (but of course I do not get a2).

$ Rscript -e 'print(tiledb::tiledb_array("variable_length_array", return_as="data.table")[])'
Non-char var.num columns are not currently supported.
Error: Expecting an external pointer: [type=NULL].
Execution halted
$ Rscript -e 'print(tiledb::tiledb_array("variable_length_array", return_as="data.table", attrs=c("a1"))[])'
     rows  cols     a1
    <int> <int> <char>
 1:     1     1      a
 2:     1     2     bb
 3:     1     3    ccc
 4:     1     4     dd
 5:     2     1    eee
 6:     2     2      f
 7:     2     3      g
 8:     2     4    hhh
 9:     3     1      i
10:     3     2    jjj
11:     3     3     kk
12:     3     4      l
13:     4     1      m
14:     4     2      n
15:     4     3     oo
16:     4     4      p
$ 

We look into adding retrieval as a data.frame with a list column and/or Arrow table with a list column.

Thanks, Dirk

Hi Dirk,

Thank you for the reply (I am in Gianmauro’s team who asked the original question:))

We are still struggling on how to do a basic import even in cases where there should not be any multi-allelic entries. For example, we have downloaded the vcf file from UKB publicly available from here: Trait: Non-cancer illness code self-reported: eczema/dermatitis - IEU OpenGWAS project

We import the study ukb-a-99 study into the TileDB dataset using the command:
tiledbvcf store
And when we try to import the vcf into R we still get the same error:

library(tiledb)

Trying to import it all

gwasdb = tiledb::tiledb_array(gwasdb_uri, “READ”, return_as=“data.frame”, is.sparse = TRUE)

gwasdb[“1”,]
Non-char var.num columns are not currently supported.

Error: Expecting an external pointer: [type=NULL].
schema(gwasdb)
Error: tiledb_dim tile UINT32 value not representable as an R integer

When we try to import some attributes, it works fine:

gwasdb = tiledb::tiledb_array(gwasdb_uri, “READ”,attrs = c(“real_start_pos”,“end_pos”,“alleles”,“id”),return_as = “data.frame”)

But adding some other columns from the vcf, we get error:

gwasdb = tiledb::tiledb_array(gwasdb_uri, “READ”,attrs = c(“real_start_pos”,“end_pos”,“alleles”,“id”,“filter_ids”),return_as = “data.frame”)
gwasdb[“1”,]
Non-char var.num columns are not currently supported.
Error: Expecting an external pointer: [type=NULL].

Adding some other columns from the vcf, for example Allele frequency that goes from 0 to 1 in the vcf file, we get error:

gwasdb = tiledb::tiledb_array(gwasdb_uri, “READ”,attrs = c(“real_start_pos”,“end_pos”,“alleles”,“id”,“fmt_AF”),return_as = “data.frame”)
gwasdb[“1”,]
Error in if (is.na(varnum) && !nullable) { :
missing value where TRUE/FALSE needed

Can you please help point to what we can do, or a tutorial, if there is a way to import vcf and read them from ukb for example into R?

Looking int the function, it looks like the function is importing from the pointer as NA instead of 1, but we hope you could help us understand why this would be happening.

Thank you very much for any pointers and help,
All Best,
Claudia and Gianmauro

Hello @clagiamba! It looks like what you all are really wanting to do is access the entire VCF dataset via R? It seems like an R API for TileDB-VCF would provide you with a more robust solution that includes the ability to query with genomic ranges, interpret and extract it various fmt and info fields and more. A R API has been on our roadmap and we’ve been looking for potential users.

Would you be open to discussing more about your use case? Feel free to also email me at seth@tiledb.com if you would like to discuss in private. We’d also love to have a short call to discuss more, would you all have availability this week or next?

Hi Seth! Thank you for the reply and yes about the R API and absolutely - a chat about this would be great - I will email you in private in the next few days. Thanks!