A few questions regarding efficient writes

Getting familiar with TileDB - I am at the point where most things are working, but I’m trying to figure out if I can do things more efficiently / faster.

The current pain point is write speed. I am using the python interface, and I am writing to the following schema:

dom = tiledb.Domain(tiledb.Dim(name="positions", domain=(0, total_size), tile=10000, dtype=np.uint32),
                   tiledb.Dim(name="samples", domain=(0, len(signal_files)), tile=10, dtype=np.uint32))

filters = [tiledb.LZ4Filter()]

schema = tiledb.ArraySchema(domain=dom, sparse=False, cell_order='C', tile_order='C',                                 attrs=[tiledb.Attr(name="value", dtype=np.float32, filters=filters)])

The total array is 3 billion positions (human genome) for dim1, and an expected 100-1000 for dim2 (samples).

The data itself is dense - signal of various assays along the chromosomes.

I’ve read most of the documentation (which is very nice, but the topic is relatively complex with tiles and making sense of the global order of cells) - does the above look optimal? Typical reads are querying a few regions (around 500 cells long) along dim1, retrieving all samples from dim2 at once in most cases.

I profiled the writes and most of the time is spent by TileDB - I am sending data in row-order (default for numpy) which should fit the schema exactly, and I’m writing multiple exact tiles at once (again in schema order).

For the writing itself I am doing:

A[start_dim1:stop_dim1, start_dim2:stop_dim2] = data

The writing is very slow (takes ~20 hours for the full thing, which ends up being around 50GB compressed) and I’m wondering if there’s a way to speed it up - its only using one of my cores, and the activity on the hard drive only spikes up once every 30s for a write. I don’t really know what its doing in the mean time - compression filters or not it’s mostly the same time, so its not that either. I’m hypothesizing that it’s doing a lot of reordering internally to fit global order.

Something I observed is that if I try to make even smaller tiles (1000 for example) things get even slower.

The docs say that the fastest writing is to provide a buffer that’s directly in global order, but I don’t think the Python interface has this option anywhere.

Tips appreciated :slight_smile:

Yeah so the tile size on dim1 is apparently having a big impact - making it smaller (1000 instead of 10000) impacts negatively write size, but making it larger (100000) makes writes much faster.

I guess this will however affect read efficiency negatively - the target intervals on that axis are 100-500 cells longs.

So large (100000 cells) tiles are writing much faster than smaller (1000 cells) - 12 seconds vs 190 seconds for the same amount of data (sending 500 MB worth of tiles at once).

Here’s the output of the performance logs for 1 write using a tile size of 1000 (my actual target, but terrible write performance).

==== WRITE ====

  • Number of write queries: 1

  • Number of attributes written: 1

    • Number of fixed-sized attributes written: 1
  • Number of bytes written: 6511200 bytes (0.00606403 GB)

  • Number of write operations: 13027

  • Number of bytes filtered: 521248104 bytes (0.48545 GB)

  • Filtering deflation factor: 80.0541x

  • Total metadata written: 38354 bytes (3.57199e-05 GB)

    • Fragment metadata footer: 206 bytes (1.91852e-07 GB)
    • R-tree: 99 bytes (9.22009e-08 GB)
    • Fixed-sized tile offsets: 33113 bytes (3.08389e-05 GB)
    • Var-sized tile offsets: 2468 bytes (2.2985e-06 GB)
    • Var-sized tile sizes: 2468 bytes (2.2985e-06 GB)
  • Write time: 179.897 secs

    • Time to split the coordinates buffer: 6e-07 secs
    • Time to check out-of-bounds coordinates: 5e-07 secs
    • Time to initialize dense cell range iterators: 0.0028157 secs
    • Time to compute cell ranges: 1.86062 secs
    • Time to prepare and filter tiles: 1.14174 secs
    • Time to write tiles: 176.45 secs
    • Time to write fragment metadata: 0.418166 secs
  • Time to finalize write query: 6e-07 secs

All the time is spent ‘writing tiles’ - 6511200 bytes worth after compression. How is this possible?

Thanks Michael! We are looking into this, we’ll get back with an update very soon.

Thanks - same behavior on a slow HD and a fast SSD - same time as well. I have the impression that TileDB is doing some heavy reordering of the numpy arrays that I send, and maybe if I could just provide cells in global order it would help, but there doesn’t seem to be a way to do that with the Python interface.

I wanted to make sure I was writing along tiles - however there’s only statistics for tiles you read, not for writes (more write statistics would be very useful!).

So instead I used the read statistics afterwards, using the same intervals im using for writing and can confirm that I’m hitting tiles straight when writing.

However I noted something that might be odd - I tried writing literally 1 tile (1000 x 10, and im hitting only 1 tile confirmed by the above method) and the write statistics state 28 ‘write operations’ to write 5873 bytes (after compression). I guess a few of those are metadata related.

Yeah, this seems like a bug affecting performance, it is next in our list. We’ll definitely enhance the write stats as well.

Good, let me know if there’s anything you need!

For now it’d be helpful to know total_size and len(signal_files) so that we create the same 2D array (its contents probably don’t matter right now to debug performance in this use case).

3.2 billion cells for total size, 130 for signal_files at the moment, but would go bigger.

I tried having long tiles (not chunking dim2) but its the same so far.

No worries, we’ll try various configurations. We’ll post an update soon. Thanks again for bringing up the issue.

Hi @michael-imbeault

Sorry for the delay. I’m now looking at this issue. I’ve tried my best to reproduce your write scenario but am unable get anywhere near the slow write times that you observed. I tried on both TileDB-Py v0.6.1 (release in late May) and the current dev branch. Additionally, I tried on both a local and S3 array.

Before I ask about the write slowness, I do want to quickly answer your earlier question: “6511200 bytes worth after compression. How is this possible?”. It appears that the data you are writing is extremely compressible, given by the “Filtering deflation factor: 80.0541x”.

So my questions for you are:

  1. Do you recall which backend you used? local, S3, Azure, etc.
  2. Do you believe that your data was extremely compressible? An 80x deflation factor would be something like mostly zeros or one repeating sequence.

Either way, the ~170s write time you observed was/is definitely a bug. I am observing about ~1s writes with the following script on a local array. It similarly writes 500MB of highly compressible data with a nearly-identical schema. https://www.dropbox.com/s/bjlskno0j9opgux/write_test.py

1 Like

That’s all local (WSL2 Ubuntu 20.04, Windows 10, more on that in a sec) - also my question was more, how is it possible to take so much time writing so few bytes, not about the compression ratio.

Ran your script, same problem (on TileDB-py 0.6.2)

==== WRITE ====

  • Number of write queries: 1

  • Number of attributes written: 1

    • Number of fixed-sized attributes written: 1
  • Number of bytes written: 2718018 bytes (0.00253135 GB)

  • Number of write operations: 13029

  • Number of bytes filtered: 521248289 bytes (0.48545 GB)

  • Filtering deflation factor: 191.775x

  • Total metadata written: 40012 bytes (3.72641e-05 GB)

    • Array schema: 185 bytes (1.72295e-07 GB)
    • Fragment metadata footer: 206 bytes (1.91852e-07 GB)
    • R-tree: 99 bytes (9.22009e-08 GB)
    • Fixed-sized tile offsets: 34586 bytes (3.22107e-05 GB)
    • Var-sized tile offsets: 2468 bytes (2.2985e-06 GB)
    • Var-sized tile sizes: 2468 bytes (2.2985e-06 GB)
  • Write time: 102.972 secs

    • Time to split the coordinates buffer: 1.7e-06 secs
    • Time to check out-of-bounds coordinates: 5e-07 secs
    • Time to initialize dense cell range iterators: 0.0115608 secs
    • Time to compute cell ranges: 2.09711 secs
    • Time to prepare and filter tiles: 1.049 secs
    • Time to write tiles: 99.2879 secs
    • Time to write fragment metadata: 0.470067 secs
  • Time to finalize write query: 5e-07 secs

Updated to 0.6.6 - same problem.

Finally, tested writing the array locally on WSL2 Ubuntu’s ‘side’ and not on my Windows filesystem - now it runs in 3s.

This is extremely puzzling. It’s the only library I encountered that is paying such a hefty time cost to deal with writing to my Windows filesystem through WSL.

When the bug is present, the run time scales linearly with dim_positions_size (10x smaller, 10x shorter run time).

1 Like

So the “Number of write operations” seems to be the source of the problem - I don’t know why TileDB is doing 13k writes for 2718 kb of data - it seems that every write is paying an inefficiency cost of communication through WSL2 layer - a known issue (https://github.com/microsoft/WSL/issues/4197).

Especially since the writes are designed to be contiguous, targeting 1 block - why so many? That scales down linearly if I reduce dim_positions_size (ie write less data on the dimension that has small blocks).

Couldn’t TileDB compute the data to be written before and send only 1 write per block of contiguous tiles?

So obviously that’s not a huge problem as the production environment is Unix - but using WSL2 as a dev environment is getting quite popular, and many people might discover TileDB, test it using such a platform, and conclude it is too slow.

Hi @michael-imbeault, just a clarification on this. Every write goes through our VFS class and then branches out based on the backend (resolved via the array URI). For the cloud object stores (S3, GCS and Azure Blob Storage), TileDB caches the written data in sufficiently large buffers (with configurable size) and periodically flushes with multi-part uploads in parallel. For posix, win and hdfs, we don’t do any caching and rather rely on the filesystem for that. There has not been a performance issue until you brought up WSL, which is interesting and we will indeed look into optimizing for that.

As a final note, the reported stat counter for the write operations is at the VFS level. Based on the above, it is accurate for posix, win and hdfs, but for the object stores the number of actual IO operations will be (much) smaller. We will add more detailed counters for those cases soon.

Thanks!

@michael-imbeault I suspect the issue is not necessarily the number of writes, but the number of times that the file is opened and closed. On our backend implementation for POSIX file systems, we perform and open and close for nearly every write. Without explicitly syncing/flushing the write, these writes are typically buffered in memory by the kernel. If WSL is flushing more aggressively (perhaps on every call to close), that could explain the degraded performance on WSL.

Similar to how we perform our own write caching for our cloud backends, we could do something similar for our POSIX backend. Hopefully this would improve performance on WSL. Thanks!

1 Like

Make perfect sense - I’ll keep an eye out for a fix but in the mean time I have a workaround (just enlarging my VHD and keeping everything local there).