I have what seems like a very straightforward use case for tileDb – I’d like to store a 2d matrix of gene expression data. Typically I would update by samples and read by gene, but the opposite use case is also possible. I am playing around a little bit with a 20k x 50k dense array of floats, and I’m having several issues which make me think I’m doing something wrong.
Informal testing suggested that 1000 x 1000 tiles give the best performance tradeoff, so that’s what I went with.
-
Disk space. If I write the entire array at once (this matrix is still small enough to keep in memory), the array is ~8GB, exactly as expected. However, if I write single columns at a time, the size quickly balloons (e.g. 40GB after 100 writes), and it stays that way even after calling
.considate()and.vacuum(). -
Parallel writes. If I use multiprocessing to write from 25 concurrent threads (or processes), I get about a 4x speedup compared to writing in serial. Suprisingly, this seems to be true regardless of whether I’m writing to the same tile or spread across tiles! If I look at the running processes with
top, most of them are in the “D” waiting for disk access state.
Here’s some of my testing code.
import tiledb
import numpy as np
import os
n_samples=20000
n_genes=50000
uri='test_tiledb_20k_by_50k'
dom = tiledb.Domain(tiledb.Dim(name="samples", domain=(0,n_samples-1), tile=1000, dtype=np.int32),
tiledb.Dim(name="genes", domain=(0,n_genes-1), tile=1000, dtype=np.int32))
schema = tiledb.ArraySchema(domain=dom, sparse=False,
attrs=[tiledb.Attr(name="a", dtype=np.float64)]
)
tiledb.DenseArray.create(uri, schema, overwrite=True)
def get_size(start_path = '.'):
total_size = 0
for dirpath, dirnames, filenames in os.walk(start_path):
for f in filenames:
fp = os.path.join(dirpath, f)
# skip if it is symbolic link
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return total_size
get_size(uri)*1.0/(2**20)
0.00017261505126953125
a=tiledb.array.DenseArray(uri, mode='w')
%timeit -n 5 -r 1 np.random.rand(n_genes)
434 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)
%timeit -n 5 -r 1 a[np.random.randint(n_samples),:]=np.random.rand(n_genes)
1.35 s ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)
get_size(uri)*1.0/(2**20)
1907.7264785766602
L1=np.arange(100)
%timeit -n 1 -r 1 for i in L1: a[i,:]=np.random.rand(n_genes)
2min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
get_size(uri)*1.0/(2**20)
40062.25247859955
a.consolidate()
tiledb.vacuum(uri)
get_size(uri)*1.0/(2**20)
40062.25247859955
import multiprocessing as mp
from multiprocessing import Pool
if mp.get_start_method() is None:
mp.set_start_method('spawn')
def write(i):
a[:,i]=np.random.rand(n_samples)
p = Pool(25)
%timeit -n 1 -r 1 p.map(write,L1)
24.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
L2=np.random.randint(n_samples,size=100)
%timeit -n 1 -r 1 for i in L2: a[i,:]=np.random.rand(n_genes)
2min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n 1 -r 1 p.map(write,L2)
24.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
get_size(uri)*1.0/(2**20)
108740.84340190887
a.consolidate()
tiledb.vacuum(uri)
get_size(uri)*1.0/(2**20)
108740.84340190887
(8.0*n_genes*n_samples)/(2**20)
7629.39453125