R: Optimising Read Performance - Range within a dimension

I have a sparse array written on disk - 6 dimensions (INT32) with a single attribute. Ranges for each dimension as follows;

1:60k, 1:100m, 1:2, 1:2, 1:1m, 1:10k

Queries work well in most cases - a single value for a dimension say, return all results fast - but ranges are much slower - The array is about 6GB uncompressed - and to load the whole array in, for example of a query in the limit, takes a long time (over 10 minutes on a 160GB RAM, 40 core linux machine) - consider that R’s ‘data.table’ package (using fread function) can load the csv equivalent in seconds.

I have played around with config, to no effect, and fear I am misunderstanding something more fundamental .

All help appreciated guys, thanks in advance
M

Dear MOR,

Thanks for bringing this to our attention. Which versions of the R package, and of TileDB do you use?
(You can see via, respectively, packageVersion("tiledb") and tiledb::tiledb_version().)

Could you possible provide a mock-up of your data via, say, a generating function producing a similar schema (possibly smaller, of course) exhibiting similar performance so that we can take a more detailed look?

Thanks, Dirk

Thanks for the quick return Dirk - Let we get on that and come back - Thanks - M

Hi there,

Some more questions:

  1. Is the array stored locally or on S3?
  2. What is the value of vfs.num_threads and sm.num_reader_threads in the config you pass to the context (they should be set equal to your number of cores, as they are currently defaulted to 1).
  3. How many fragments are there in the array (i.e., number of __* subfolders inside the array folder)?

I am about to merge some very important optimizations to dev (pertaining to both read parallelization and handling of numerous fragments). Therefore, in addition to potential tuning you can do with the config parameters above, please stay tuned to try out the upcoming improvements.

Finally got some stuff together that hopefully helps guys - thanks for your patience on this

I have placed some files in a public ftp share folder that I can reference here

The folder can be found at the url https://morgannwg.exavault.com/share/view/1xpf1-8fyeyjhm

array.csv - a toy file that mirrors the size and complexity of the files I am studying. 6.2 GB in size, no header, 7 columns, circa 190m rows.

array.R - A script file that reads the file in (from local SSD disk). Requires data.table and tiledb packages. tiledb version is 1.8.0. data.table reads the csv file into R in about 2 seconds (40 core linux machine), writes the tiledb array in about 2 minutes, and reads the whole tiledb array in about 11 minutes.

array - this is the tiledb array - one fragment

array.config - a text file showing the config on the machine in use when running the results

To state the issue again - I just wonder why the array takes 11 minutes to load, and I am sure I am making a basic error - I have played about with the tiling with blends of row and column major but it does not change the results.

Thanks again in advance guys

M

Thanks a lot! We’ll investigate over the next couple of days (apologies for the long turnaround).

Stavros

Hi M,

Thanks for cooking up a complete and reproducible example giving us something to chew over. We will most likely get back to you with a more substantial response in a few days but as a brief and more immediate response:

  • The TileDB C++ ‘dev’ branch underwent a few changes of late, and you may cut your read time substantially by updating. On my (much smaller) machine I see “approximately” equal read and write times of between 3 1/2 and 4 minutes with your code (plus minor stylistic edits)
  • A fairly large performance gain can be had by switching to dense arrays. What we have here is really just one attribute (in column 7) and six conditioning variables.
  • You can possibly tune the partial read performance by playing around with the tile extent when creating – maybe try extent of one of 100K, 1M, 10M
  • Compression will likely be beneficial for each attribute as well

Example timings from my six-core machine:

   user  system elapsed       # fread
 14.666   1.838   2.988 
    user   system  elapsed 
1841.328   15.930  216.568   # tiledb write sparse
    user   system  elapsed 
2287.763   20.665  241.090   # tiledb read sparse
   user  system elapsed 
 18.722   3.080   4.296      # tiledb write dense
   user  system elapsed 
  5.156  10.290  10.267      # tiledb read dense

I am including my script below. Please let us know if you have any questions, and we will try to take a closer look at the sparse matrix performance for this multi-dimensional case.

Regards, Dirk

suppressMessages({
    library(tiledb)
    library(data.table)
    tiledb::tiledb_version()
    ## 1     8     0
    invisible(NULL)
})

removeIfFound <- function(arrayname) {
    if (dir.exists(arrayname))
        unlink(arrayname, recursive=TRUE)
    invisible(NULL)
}

readCsvAndWriteTileDB <- function(datafile, arrayname, verbose=FALSE) {
    print(system.time(x <- fread(datafile, header = FALSE, data.table = TRUE)))
    ## "data.table reads file in about 2 seconds"
    ## about 3 secs for me

    ## check config
    if (verbose) {
        ctx <- tiledb_ctx()
        cfgptr <- tiledb:::libtiledb_ctx_config(ctx@ptr)
        print(tiledb:::libtiledb_config_vector( tiledb:::libtiledb_ctx_config(ctx@ptr) ) )
    }

    dim1 <- tiledb_dim("dim1", c(1L, 57602L), type = "INT32")
    dim2 <- tiledb_dim("dim2", c(1L, 94991975L),  type = "INT32")
    dim3 <- tiledb_dim("dim3", c(0L, 1L),  type = "INT32")
    dim4 <- tiledb_dim("dim4", c(0L, 1L),  type = "INT32")
    dim5 <- tiledb_dim("dim5", c(1L, 1000000L),  type = "INT32")
    dim6 <- tiledb_dim("dim6", c(1L, 219988984L), type = "INT32")
    dom <- tiledb_domain(dims = c(dim1, dim2, dim3, dim4, dim5, dim6))
    schema <- tiledb_array_schema(dom, attrs = c(tiledb_attr("att", type = "INT32")), sparse = TRUE)
    tiledb_array_create(arrayname, schema)
    A <- tiledb_sparse(uri = arrayname)

    print(system.time(A[x$V1, x$V2, x$V3, x$V4, x$V5, x$V6] <- as.integer(x$V7)))
    ## "about 2 minutes to write the array"
    ## about 3 3/4 minutes for me
    invisible(NULL)
}

readBack <- function(arrayname) {
    B <- tiledb_sparse(uri = arrayname)
    print(system.time(y <- B[1:57602, 1:94991975, 0:1, 0:1, 1:1000000, 1:219988984]))
    ## about 11 minutes to read the whole array in
    ## and just under four minutes for me
    invisible(NULL)
}

readBackDense <- function(arrayname) {
    B <- tiledb_dense(uri = arrayname)
    system.time(y <- B[])
    ## about 11 minutes to read the whole array in
    ## and just under four minutes for me
    invisible(NULL)
}

readCsvAndWriteTileDB_Dense <- function(datafile, arrayname, verbose=FALSE) {
    print(system.time(x <- fread(datafile, header = FALSE, data.table = TRUE)))
    ## "data.table reads file in about 2 seconds"
    ## about 4.6 secs for me

    print(system.time(fromDataFrame(x, arrayname)))
    invisible(NULL)
}

datafile <- "array.csv"
#datafile <- "arraySample.csv"
arrayname <- "arrayCheck"

cfg <- tiledb_config()
cfg["sm.num_writer_threads"] <- 6
cfg["sm.num_reader_threads"] <- 6
cfg["vfs.num_threads"] <- 6
ctx <- tiledb_ctx(cfg, cached=FALSE)

removeIfFound(arrayname)
readCsvAndWriteTileDB(datafile, arrayname)
readBack(arrayname)

removeIfFound(arrayname)
readCsvAndWriteTileDB_Dense(datafile, arrayname)
readBackDense(arrayname)

cat("Done\n")

A general comment on modeling a dataframe as a dense or sparse array.

If you wish to slice just based on rows, potentially sub-selecting on attributes, then you should model your array as a 1D dense, where the dimension stands for the implicit row ids. The fromDataFrame function above does that for you, but currently sets the tile extent to the entire domain (we might change that to a smaller default). You may want to set the tile extent to something like 10K or 100K. This is small enough to prevent bringing too much redundant data when you partially slice, but not too small if you want to use effective compression.

Now, if you wish to slice based on ranges on a subset of your columns, then you will have much better partial slicing performance if you model your array as sparse. You are free to choose any columns to become your dimensions, but I suggest you use the 2-3 most selective ones that also receive the most frequent range queries for your use case. In your example you used 6 and probably TileDB is struggling in a place where we have not parallelized yet. We’ll look into that shortly so that even in that case you get the performance you wish.

Thanks again for your message. We’ll get back to you with more updates soon.

1 Like

appreciate the responses on this guys - many thanks - will work through

great product

M