Storage of GWAS data (100 mio rows x 20 k columns)

Florian · February 24, 2022, 4:42pm

Hello!

We want to investigate the storage of genome-wide association (GWAS) data in tiledb.

The data can be modeled as a table with rows containing genomic markers (some 100 mio), columns containing phenotypes (some 20k), and each cell containing 3 float attributes (intercept and slope from logistic regression, and pvalue).

Use case is fast retrieval of attributes

by genomic position (marker) for all phenotypes
or by phenotype for all markers

I saw the very interesting slides by Dirk Eddelbuettel and Aaron Wolen from useR! 2021 (https://dirk.eddelbuettel.com/papers/useR2021_tiledb_tutorial.pdf). There they showcase the use of tiledb for UK Biobank GWAS datasets.

They uses a sparse array with dimensions phenotype, chromosome, and position, and insert the data by phenotype. I understand that using sparse arrays allows using indexing by marker name, chromosome and position directly, making the approach very user-friendly, but was wondering about the efficiency of that approach.

Does the insertion by phenotype not yield tiles that mainly contain data from a single phenotype? That would imply that in order to access all phenotypes for a marker thousands of tiles need to be read, leading to large query times for markers, and fast query times for phenotypes.

Does the insertion of data phenotype by phenotype not yield a large number of fragments, requiring consolidation?

In our case, the data is created in blocks of 50 000 markers by 500 phenotypes (from a GPU). Would writing the data directly as dense blocks not be more efficient? One could then use sparse arrays to look up the dense matrix indices from coordinates.

Best regards,
Florian

aaron · February 25, 2022, 11:11pm

Hi Florian,

Glad you found the GWAS demo useful!

Your characterization of the trade-offs involved with the schema we chose is spot on. It was built very specifically for the tutorial to emphasize user friendly queries (for particular regions within a phenotype), with little consideration given to performance.

No doubt the solution you proposed would be more performant, especially for slicing across phenotypes. Happy to discuss more.

Topic		Replies	Views
Usage help -- disk space, parallel writes	5	845	June 21, 2022
N+1 problem and append	3	1117	June 17, 2020
Tile Dimensions affect the disk space, should it?	1	3202	June 23, 2019
Slow reads on large number of attributes	3	611	October 1, 2022
A few questions regarding efficient writes	17	2560	August 20, 2020

Storage of GWAS data (100 mio rows x 20 k columns)

Related topics