Performance implication of sparse writes on dense arrays

it seems there is huge performance difference between dense writes and sparse writes on dense arrays

import numpy as np
import tiledb
import time

dom = tiledb.Domain(
    tiledb.Dim(
        name="seq",
        domain=(0, 10000000),
        tile=10000,
        dtype=np.int32,
    ),
)
attr = tiledb.Attr(name="", dtype=np.float32)
schema = tiledb.ArraySchema(domain=dom, attrs=[attr])
tiledb.DenseArray.create("test1.tdb", schema)

from numpy.random import default_rng

a = np.random.random(7000000)
rng = default_rng()
numbers = rng.choice(7000000, size=6000000, replace=False)
t0 = time.time()
with tiledb.DenseArray("test1.tdb", "w") as A:
  A[numbers] = a[:6000000]
t1 = time.time()
print(f"sparse writes takes {t1-t0} seconds")

with tiledb.DenseArray("test1.tdb", "r") as A:
  b = A[:7000000]
t2 = time.time()
print(f"read for sparsely written array takes {t2 - t1} seconds")
print(b)

tiledb.DenseArray.create("test2.tdb", schema)
t0 = time.time()
with tiledb.DenseArray("test2.tdb", "w") as A:
  A[:6000000] = a[:6000000]
t1 = time.time()
print(f"dense writes take {t1-t0} seconds")

with tiledb.DenseArray("test2.tdb", "r") as A:
  b = A[:7000000]
t2 = time.time()
print(f"read for densely written array takes {t2 - t1} seconds")
print(b)

Result:

sparse writes takes 1.9559354782104492 seconds
read for sparsely written array take 1.9144158363342285 seconds
[0.72725064        nan 0.7601014  ... 0.15054736 0.8090051  0.11344466]
dense writes take 0.09776735305786133 seconds
read for densely written array takes 0.10718464851379395 seconds
[0.59625846 0.34717634 0.61473554 ...        nan        nan        nan]

Seems there is 20x performance difference for both writing and reading.
is there any documentation on this?

Thanks in advance!

1 Like

Hi there,

This is rather expected. Writing sparse cells in dense arrays creates “sparse fragments”, which essentially involve sorting. Dense writes are much more efficient (especially after some changes we made in dev) as they don’t materialize the coordinates and “sorting” (i.e., cell shuffling to fit the tiles) is reduced to a linear copy of slabs.

In addition, reading a dense array with sparse fragments is expensive as the algorithm now needs to interleave dense cell slabs (efficient) with sparse points that can “cut” the slabs at random places.

The general recommendation here is that you should either store a dense or a sparse array, and you should not make large sparse writes to dense arrays. In fact, we are contemplating dropping sparse write support to dense arrays (still under internal discussion).

I am overhauling the docs currently, I’ll remember to make a note about this. Thanks for raising this issue!

1 Like