Hi M,
Thanks for cooking up a complete and reproducible example giving us something to chew over. We will most likely get back to you with a more substantial response in a few days but as a brief and more immediate response:
- The TileDB C++ ‘dev’ branch underwent a few changes of late, and you may cut your read time substantially by updating. On my (much smaller) machine I see “approximately” equal read and write times of between 3 1/2 and 4 minutes with your code (plus minor stylistic edits)
- A fairly large performance gain can be had by switching to dense arrays. What we have here is really just one attribute (in column 7) and six conditioning variables.
- You can possibly tune the partial read performance by playing around with the tile extent when creating – maybe try extent of one of 100K, 1M, 10M
- Compression will likely be beneficial for each attribute as well
Example timings from my six-core machine:
user system elapsed # fread
14.666 1.838 2.988
user system elapsed
1841.328 15.930 216.568 # tiledb write sparse
user system elapsed
2287.763 20.665 241.090 # tiledb read sparse
user system elapsed
18.722 3.080 4.296 # tiledb write dense
user system elapsed
5.156 10.290 10.267 # tiledb read dense
I am including my script below. Please let us know if you have any questions, and we will try to take a closer look at the sparse matrix performance for this multi-dimensional case.
Regards, Dirk
suppressMessages({
library(tiledb)
library(data.table)
tiledb::tiledb_version()
## 1 8 0
invisible(NULL)
})
removeIfFound <- function(arrayname) {
if (dir.exists(arrayname))
unlink(arrayname, recursive=TRUE)
invisible(NULL)
}
readCsvAndWriteTileDB <- function(datafile, arrayname, verbose=FALSE) {
print(system.time(x <- fread(datafile, header = FALSE, data.table = TRUE)))
## "data.table reads file in about 2 seconds"
## about 3 secs for me
## check config
if (verbose) {
ctx <- tiledb_ctx()
cfgptr <- tiledb:::libtiledb_ctx_config(ctx@ptr)
print(tiledb:::libtiledb_config_vector( tiledb:::libtiledb_ctx_config(ctx@ptr) ) )
}
dim1 <- tiledb_dim("dim1", c(1L, 57602L), type = "INT32")
dim2 <- tiledb_dim("dim2", c(1L, 94991975L), type = "INT32")
dim3 <- tiledb_dim("dim3", c(0L, 1L), type = "INT32")
dim4 <- tiledb_dim("dim4", c(0L, 1L), type = "INT32")
dim5 <- tiledb_dim("dim5", c(1L, 1000000L), type = "INT32")
dim6 <- tiledb_dim("dim6", c(1L, 219988984L), type = "INT32")
dom <- tiledb_domain(dims = c(dim1, dim2, dim3, dim4, dim5, dim6))
schema <- tiledb_array_schema(dom, attrs = c(tiledb_attr("att", type = "INT32")), sparse = TRUE)
tiledb_array_create(arrayname, schema)
A <- tiledb_sparse(uri = arrayname)
print(system.time(A[x$V1, x$V2, x$V3, x$V4, x$V5, x$V6] <- as.integer(x$V7)))
## "about 2 minutes to write the array"
## about 3 3/4 minutes for me
invisible(NULL)
}
readBack <- function(arrayname) {
B <- tiledb_sparse(uri = arrayname)
print(system.time(y <- B[1:57602, 1:94991975, 0:1, 0:1, 1:1000000, 1:219988984]))
## about 11 minutes to read the whole array in
## and just under four minutes for me
invisible(NULL)
}
readBackDense <- function(arrayname) {
B <- tiledb_dense(uri = arrayname)
system.time(y <- B[])
## about 11 minutes to read the whole array in
## and just under four minutes for me
invisible(NULL)
}
readCsvAndWriteTileDB_Dense <- function(datafile, arrayname, verbose=FALSE) {
print(system.time(x <- fread(datafile, header = FALSE, data.table = TRUE)))
## "data.table reads file in about 2 seconds"
## about 4.6 secs for me
print(system.time(fromDataFrame(x, arrayname)))
invisible(NULL)
}
datafile <- "array.csv"
#datafile <- "arraySample.csv"
arrayname <- "arrayCheck"
cfg <- tiledb_config()
cfg["sm.num_writer_threads"] <- 6
cfg["sm.num_reader_threads"] <- 6
cfg["vfs.num_threads"] <- 6
ctx <- tiledb_ctx(cfg, cached=FALSE)
removeIfFound(arrayname)
readCsvAndWriteTileDB(datafile, arrayname)
readBack(arrayname)
removeIfFound(arrayname)
readCsvAndWriteTileDB_Dense(datafile, arrayname)
readBackDense(arrayname)
cat("Done\n")