I have what seems like a very straightforward use case for tileDb – I’d like to store a 2d matrix of gene expression data. Typically I would update by samples and read by gene, but the opposite use case is also possible. I am playing around a little bit with a 20k x 50k dense array of floats, and I’m having several issues which make me think I’m doing something wrong.
Informal testing suggested that 1000 x 1000 tiles give the best performance tradeoff, so that’s what I went with.
-
Disk space. If I write the entire array at once (this matrix is still small enough to keep in memory), the array is ~8GB, exactly as expected. However, if I write single columns at a time, the size quickly balloons (e.g. 40GB after 100 writes), and it stays that way even after calling
.considate()
and.vacuum()
. -
Parallel writes. If I use multiprocessing to write from 25 concurrent threads (or processes), I get about a 4x speedup compared to writing in serial. Suprisingly, this seems to be true regardless of whether I’m writing to the same tile or spread across tiles! If I look at the running processes with
top
, most of them are in the “D” waiting for disk access state.
Here’s some of my testing code.
import tiledb
import numpy as np
import os
n_samples=20000
n_genes=50000
uri='test_tiledb_20k_by_50k'
dom = tiledb.Domain(tiledb.Dim(name="samples", domain=(0,n_samples-1), tile=1000, dtype=np.int32),
tiledb.Dim(name="genes", domain=(0,n_genes-1), tile=1000, dtype=np.int32))
schema = tiledb.ArraySchema(domain=dom, sparse=False,
attrs=[tiledb.Attr(name="a", dtype=np.float64)]
)
tiledb.DenseArray.create(uri, schema, overwrite=True)
def get_size(start_path = '.'):
total_size = 0
for dirpath, dirnames, filenames in os.walk(start_path):
for f in filenames:
fp = os.path.join(dirpath, f)
# skip if it is symbolic link
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return total_size
get_size(uri)*1.0/(2**20)
0.00017261505126953125
a=tiledb.array.DenseArray(uri, mode='w')
%timeit -n 5 -r 1 np.random.rand(n_genes)
434 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)
%timeit -n 5 -r 1 a[np.random.randint(n_samples),:]=np.random.rand(n_genes)
1.35 s ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)
get_size(uri)*1.0/(2**20)
1907.7264785766602
L1=np.arange(100)
%timeit -n 1 -r 1 for i in L1: a[i,:]=np.random.rand(n_genes)
2min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
get_size(uri)*1.0/(2**20)
40062.25247859955
a.consolidate()
tiledb.vacuum(uri)
get_size(uri)*1.0/(2**20)
40062.25247859955
import multiprocessing as mp
from multiprocessing import Pool
if mp.get_start_method() is None:
mp.set_start_method('spawn')
def write(i):
a[:,i]=np.random.rand(n_samples)
p = Pool(25)
%timeit -n 1 -r 1 p.map(write,L1)
24.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
L2=np.random.randint(n_samples,size=100)
%timeit -n 1 -r 1 for i in L2: a[i,:]=np.random.rand(n_genes)
2min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n 1 -r 1 p.map(write,L2)
24.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
get_size(uri)*1.0/(2**20)
108740.84340190887
a.consolidate()
tiledb.vacuum(uri)
get_size(uri)*1.0/(2**20)
108740.84340190887
(8.0*n_genes*n_samples)/(2**20)
7629.39453125