Performance implication of sparse writes on dense arrays

qiuwei · May 11, 2021, 2:33pm

it seems there is huge performance difference between dense writes and sparse writes on dense arrays

import numpy as np
import tiledb
import time

dom = tiledb.Domain(
    tiledb.Dim(
        name="seq",
        domain=(0, 10000000),
        tile=10000,
        dtype=np.int32,
    ),
)
attr = tiledb.Attr(name="", dtype=np.float32)
schema = tiledb.ArraySchema(domain=dom, attrs=[attr])
tiledb.DenseArray.create("test1.tdb", schema)

from numpy.random import default_rng

a = np.random.random(7000000)
rng = default_rng()
numbers = rng.choice(7000000, size=6000000, replace=False)
t0 = time.time()
with tiledb.DenseArray("test1.tdb", "w") as A:
  A[numbers] = a[:6000000]
t1 = time.time()
print(f"sparse writes takes {t1-t0} seconds")

with tiledb.DenseArray("test1.tdb", "r") as A:
  b = A[:7000000]
t2 = time.time()
print(f"read for sparsely written array takes {t2 - t1} seconds")
print(b)

tiledb.DenseArray.create("test2.tdb", schema)
t0 = time.time()
with tiledb.DenseArray("test2.tdb", "w") as A:
  A[:6000000] = a[:6000000]
t1 = time.time()
print(f"dense writes take {t1-t0} seconds")

with tiledb.DenseArray("test2.tdb", "r") as A:
  b = A[:7000000]
t2 = time.time()
print(f"read for densely written array takes {t2 - t1} seconds")
print(b)

Result:

sparse writes takes 1.9559354782104492 seconds
read for sparsely written array take 1.9144158363342285 seconds
[0.72725064        nan 0.7601014  ... 0.15054736 0.8090051  0.11344466]
dense writes take 0.09776735305786133 seconds
read for densely written array takes 0.10718464851379395 seconds
[0.59625846 0.34717634 0.61473554 ...        nan        nan        nan]

Seems there is 20x performance difference for both writing and reading.
is there any documentation on this?

Thanks in advance!

stavros · May 29, 2021, 9:19am

Hi there,

This is rather expected. Writing sparse cells in dense arrays creates “sparse fragments”, which essentially involve sorting. Dense writes are much more efficient (especially after some changes we made in dev) as they don’t materialize the coordinates and “sorting” (i.e., cell shuffling to fit the tiles) is reduced to a linear copy of slabs.

In addition, reading a dense array with sparse fragments is expensive as the algorithm now needs to interleave dense cell slabs (efficient) with sparse points that can “cut” the slabs at random places.

The general recommendation here is that you should either store a dense or a sparse array, and you should not make large sparse writes to dense arrays. In fact, we are contemplating dropping sparse write support to dense arrays (still under internal discussion).

I am overhauling the docs currently, I’ll remember to make a note about this. Thanks for raising this issue!

Topic		Replies	Views
Dense or sparse array for incrementally generated dense data	6	1082	October 21, 2021
R: Optimising Read Performance - Range within a dimension	8	809	April 8, 2020
Tile Dimensions affect the disk space, should it?	1	3212	June 23, 2019
A few questions regarding efficient writes	17	2577	August 20, 2020
Usage help -- disk space, parallel writes	5	856	June 21, 2022

Performance implication of sparse writes on dense arrays

Related topics