Hi, we’ve recently moved to TileDB for some of our internal storage, which is on gcs, and is used (and created) by a kubernetes cluster, running in gke. Since we’ve migrated to using tiledb, we’ve consistently seen a much higher amount of network traffic going from gcs into our processing cluster. Setting “vfs.min_batch_size” and “vfs.min_batch_gap” to 0 made quite a difference, but that’s as far as we’ve got. Other tunables had no noticable effect on the network traffic. In the end, we see a network traffic that is consistently over 2x higher than the amount of bytes read as reported by tiledb’s stats (and what we’d expect, given the size of the requested read and the average compression ratio of our data).
Below you’ll find a minimal reproducable example (python), that reads part of one of our arrays (for the sake of this example, I’ve copied it to a separate bucket with public access). It does a read of 5000x5000 elements of a dense float32 array, and reports the amount of bytes read per tiledb’s stats, and the amount of bytes received on the nic (make sure you shutdown all other network-using apps before running it, or run it in a docker, to avoid other network reading apps adding up to the amount of bytes).
Clearly, something is not as efficient as it could be. Of course there is some overhead, like headers, blob listings, etc. But I wouldn’t think this to be that much. With our previous storage format, overhead was in the order of 10-20%.
Any idea on how we could lower the amount of network traffic caused by tiledb reading?
Minimal reproducable example:
from humanize import naturalsize
import tiledb
import psutil
array_uri = 'gcs://s11-dev-vincent-tiledb-test-public/02xy6CpdVNT4N1uV5bNT63wkDXYV-O6c0EJDAJSMzBc=_S1A_IW_GRDH_1SDV_20200922T215732_20200922T215757_034477_040306_79C6_1.tiledb'
uly = 1382334
ulx = 4932167
height = 5000
width = 5000
context = tiledb.Ctx(
{
"vfs.min_batch_size": 0,
"vfs.min_batch_gap": 0,
}
)
def get_net_traffic(nic='eno1'): # change the name of the nic if it is not eno1, but e.g. eth0.
return psutil.net_io_counters(pernic=True)[nic].bytes_recv
tiledb.stats_enable()
tiledb.stats_reset()
traffic_0 = get_net_traffic()
with tiledb.tiledb.DenseArray(array_uri, mode='r', ctx=context) as tdb:
data = tdb[uly:uly+height, ulx:ulx+width]
traffic_1 = get_net_traffic()
stats = tiledb.stats_dump(json=True)
tiledb.stats_disable()
tiledb_data_bytes_read = stats['READ_BYTE_NUM']
tiledb_meta_bytes_read = (
stats['READ_ARRAY_SCHEMA_SIZE'] +
stats['READ_FRAG_META_SIZE'] +
stats['READ_RTREE_SIZE'] +
stats['READ_TILE_OFFSETS_SIZE']
)
tiledb_traffic = tiledb_data_bytes_read + tiledb_meta_bytes_read
traffic = traffic_1 - traffic_0
print(f'total bytes read: {traffic} [{naturalsize(traffic)}]')
print(f'bytes read according to tiledb stats: {tiledb_traffic} [{naturalsize(tiledb_traffic)}]')