Excessively high network traffic for gcs reads

Hi, we’ve recently moved to TileDB for some of our internal storage, which is on gcs, and is used (and created) by a kubernetes cluster, running in gke. Since we’ve migrated to using tiledb, we’ve consistently seen a much higher amount of network traffic going from gcs into our processing cluster. Setting “vfs.min_batch_size” and “vfs.min_batch_gap” to 0 made quite a difference, but that’s as far as we’ve got. Other tunables had no noticable effect on the network traffic. In the end, we see a network traffic that is consistently over 2x higher than the amount of bytes read as reported by tiledb’s stats (and what we’d expect, given the size of the requested read and the average compression ratio of our data).

Below you’ll find a minimal reproducable example (python), that reads part of one of our arrays (for the sake of this example, I’ve copied it to a separate bucket with public access). It does a read of 5000x5000 elements of a dense float32 array, and reports the amount of bytes read per tiledb’s stats, and the amount of bytes received on the nic (make sure you shutdown all other network-using apps before running it, or run it in a docker, to avoid other network reading apps adding up to the amount of bytes).

Clearly, something is not as efficient as it could be. Of course there is some overhead, like headers, blob listings, etc. But I wouldn’t think this to be that much. With our previous storage format, overhead was in the order of 10-20%.

Any idea on how we could lower the amount of network traffic caused by tiledb reading?

Minimal reproducable example:

from humanize import naturalsize
import tiledb
import psutil

array_uri = 'gcs://s11-dev-vincent-tiledb-test-public/02xy6CpdVNT4N1uV5bNT63wkDXYV-O6c0EJDAJSMzBc=_S1A_IW_GRDH_1SDV_20200922T215732_20200922T215757_034477_040306_79C6_1.tiledb'

uly = 1382334
ulx = 4932167
height = 5000
width = 5000

context = tiledb.Ctx(
		{
			"vfs.min_batch_size": 0,
			"vfs.min_batch_gap": 0,
		}
	)


def get_net_traffic(nic='eno1'):  # change the name of the nic if it is not eno1, but e.g. eth0.
	return psutil.net_io_counters(pernic=True)[nic].bytes_recv

tiledb.stats_enable()
tiledb.stats_reset()
traffic_0 = get_net_traffic()

with tiledb.tiledb.DenseArray(array_uri, mode='r', ctx=context) as tdb:
	data = tdb[uly:uly+height, ulx:ulx+width]

traffic_1 = get_net_traffic()
stats = tiledb.stats_dump(json=True)
tiledb.stats_disable()

tiledb_data_bytes_read = stats['READ_BYTE_NUM']
tiledb_meta_bytes_read = (
	stats['READ_ARRAY_SCHEMA_SIZE'] +
	stats['READ_FRAG_META_SIZE'] +
	stats['READ_RTREE_SIZE'] +
	stats['READ_TILE_OFFSETS_SIZE']
)
tiledb_traffic = tiledb_data_bytes_read + tiledb_meta_bytes_read
traffic = traffic_1 - traffic_0
print(f'total bytes read: {traffic} [{naturalsize(traffic)}]')
print(f'bytes read according to tiledb stats: {tiledb_traffic} [{naturalsize(tiledb_traffic)}]')

We are looking into this. Could you please share the array schema as well? Thanks!

Thanks for looking into this. This is the schema:

tiledb_filterlist = tiledb.FilterList(
	[
		tiledb.BitShuffleFilter(),
		tiledb.ZstdFilter(),
	]
)
dom = tiledb.Domain(
	tiledb.Dim(
		name="y",
		domain=(0, 2999999),
		tile=128,
		dtype=np.uint64,
	),
	tiledb.Dim(
		name="x",
		domain=(0, 5999999),
		tile=128,
		dtype=np.uint64,
	),
)
attr = tiledb.Attr(dtype=np.float32, filters=tiledb_filterlist)
schema = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[attr])

Context: those are arrays to store satellite data, in a worldwide grid, with a cellsize of 0.00006 degrees. First dim is latitude (y), second dim is longitude (x). Most of these have only a very small subarea of that whole domain filled with data, that’s why the offsets in the former example are so large. That’s where the actual satellite image is.

At first glance, your tile extent for both dimensions is too small. I would set tile = 1000 for both dimensions. That can affect the number of requests and extra data communicated over the network.

Some more questions:

  1. How many fragments are there in the array (i.e., number of .ok files in the array directory)?
  2. Could you please make a raw dump of the stats and paste here?

Thanks.

Sorry, after looking at the raw stats I realized I am testing on an older set of arrays, with a less optimal schema. Let’s forget this for now, I’ll do some more testing and report back here if there is still an issue or not.