Getting familiar with TileDB - I am at the point where most things are working, but I’m trying to figure out if I can do things more efficiently / faster.
The current pain point is write speed. I am using the python interface, and I am writing to the following schema:
dom = tiledb.Domain(tiledb.Dim(name="positions", domain=(0, total_size), tile=10000, dtype=np.uint32),
tiledb.Dim(name="samples", domain=(0, len(signal_files)), tile=10, dtype=np.uint32))
filters = [tiledb.LZ4Filter()]
schema = tiledb.ArraySchema(domain=dom, sparse=False, cell_order='C', tile_order='C', attrs=[tiledb.Attr(name="value", dtype=np.float32, filters=filters)])
The total array is 3 billion positions (human genome) for dim1, and an expected 100-1000 for dim2 (samples).
The data itself is dense - signal of various assays along the chromosomes.
I’ve read most of the documentation (which is very nice, but the topic is relatively complex with tiles and making sense of the global order of cells) - does the above look optimal? Typical reads are querying a few regions (around 500 cells long) along dim1, retrieving all samples from dim2 at once in most cases.
I profiled the writes and most of the time is spent by TileDB - I am sending data in row-order (default for numpy) which should fit the schema exactly, and I’m writing multiple exact tiles at once (again in schema order).
For the writing itself I am doing:
A[start_dim1:stop_dim1, start_dim2:stop_dim2] = data
The writing is very slow (takes ~20 hours for the full thing, which ends up being around 50GB compressed) and I’m wondering if there’s a way to speed it up - its only using one of my cores, and the activity on the hard drive only spikes up once every 30s for a write. I don’t really know what its doing in the mean time - compression filters or not it’s mostly the same time, so its not that either. I’m hypothesizing that it’s doing a lot of reordering internally to fit global order.
Something I observed is that if I try to make even smaller tiles (1000 for example) things get even slower.
The docs say that the fastest writing is to provide a buffer that’s directly in global order, but I don’t think the Python interface has this option anywhere.
Tips appreciated