Storage of GWAS data (100 mio rows x 20 k columns)

Hello!

We want to investigate the storage of genome-wide association (GWAS) data in tiledb.

The data can be modeled as a table with rows containing genomic markers (some 100 mio), columns containing phenotypes (some 20k), and each cell containing 3 float attributes (intercept and slope from logistic regression, and pvalue).

Use case is fast retrieval of attributes

  • by genomic position (marker) for all phenotypes
  • or by phenotype for all markers

I saw the very interesting slides by Dirk Eddelbuettel and Aaron Wolen from useR! 2021 (https://dirk.eddelbuettel.com/papers/useR2021_tiledb_tutorial.pdf). There they showcase the use of tiledb for UK Biobank GWAS datasets.

They uses a sparse array with dimensions phenotype, chromosome, and position, and insert the data by phenotype. I understand that using sparse arrays allows using indexing by marker name, chromosome and position directly, making the approach very user-friendly, but was wondering about the efficiency of that approach.

Does the insertion by phenotype not yield tiles that mainly contain data from a single phenotype? That would imply that in order to access all phenotypes for a marker thousands of tiles need to be read, leading to large query times for markers, and fast query times for phenotypes.

Does the insertion of data phenotype by phenotype not yield a large number of fragments, requiring consolidation?

In our case, the data is created in blocks of 50 000 markers by 500 phenotypes (from a GPU). Would writing the data directly as dense blocks not be more efficient? One could then use sparse arrays to look up the dense matrix indices from coordinates.

Best regards,
Florian

Hi Florian,

Glad you found the GWAS demo useful!

Your characterization of the trade-offs involved with the schema we chose is spot on. It was built very specifically for the tutorial to emphasize user friendly queries (for particular regions within a phenotype), with little consideration given to performance.

No doubt the solution you proposed would be more performant, especially for slicing across phenotypes. Happy to discuss more.