R: Optimising Read Performance - Range within a dimension

MOR · March 25, 2020, 7:53pm

I have a sparse array written on disk - 6 dimensions (INT32) with a single attribute. Ranges for each dimension as follows;

1:60k, 1:100m, 1:2, 1:2, 1:1m, 1:10k

Queries work well in most cases - a single value for a dimension say, return all results fast - but ranges are much slower - The array is about 6GB uncompressed - and to load the whole array in, for example of a query in the limit, takes a long time (over 10 minutes on a 160GB RAM, 40 core linux machine) - consider that R’s ‘data.table’ package (using fread function) can load the csv equivalent in seconds.

I have played around with config, to no effect, and fear I am misunderstanding something more fundamental .

All help appreciated guys, thanks in advance
M

Dirk_Eddelbuettel · March 25, 2020, 8:08pm

Dear MOR,

Thanks for bringing this to our attention. Which versions of the R package, and of TileDB do you use?
(You can see via, respectively, packageVersion("tiledb") and tiledb::tiledb_version().)

Could you possible provide a mock-up of your data via, say, a generating function producing a similar schema (possibly smaller, of course) exhibiting similar performance so that we can take a more detailed look?

Thanks, Dirk

MOR · March 25, 2020, 8:10pm

Thanks for the quick return Dirk - Let we get on that and come back - Thanks - M

stavros · March 26, 2020, 7:04pm

Hi there,

Some more questions:

Is the array stored locally or on S3?
What is the value of vfs.num_threads and sm.num_reader_threads in the config you pass to the context (they should be set equal to your number of cores, as they are currently defaulted to 1).
How many fragments are there in the array (i.e., number of __* subfolders inside the array folder)?

I am about to merge some very important optimizations to dev (pertaining to both read parallelization and handling of numerous fragments). Therefore, in addition to potential tuning you can do with the config parameters above, please stay tuned to try out the upcoming improvements.

MOR · April 7, 2020, 8:03pm

Finally got some stuff together that hopefully helps guys - thanks for your patience on this

I have placed some files in a public ftp share folder that I can reference here

The folder can be found at the url https://morgannwg.exavault.com/share/view/1xpf1-8fyeyjhm

array.csv - a toy file that mirrors the size and complexity of the files I am studying. 6.2 GB in size, no header, 7 columns, circa 190m rows.

array.R - A script file that reads the file in (from local SSD disk). Requires data.table and tiledb packages. tiledb version is 1.8.0. data.table reads the csv file into R in about 2 seconds (40 core linux machine), writes the tiledb array in about 2 minutes, and reads the whole tiledb array in about 11 minutes.

array - this is the tiledb array - one fragment

array.config - a text file showing the config on the machine in use when running the results

To state the issue again - I just wonder why the array takes 11 minutes to load, and I am sure I am making a basic error - I have played about with the tiling with blends of row and column major but it does not change the results.

Thanks again in advance guys

M

stavros · April 7, 2020, 8:06pm

Thanks a lot! We’ll investigate over the next couple of days (apologies for the long turnaround).

Stavros

Dirk_Eddelbuettel · April 8, 2020, 4:05pm

Hi M,

Thanks for cooking up a complete and reproducible example giving us something to chew over. We will most likely get back to you with a more substantial response in a few days but as a brief and more immediate response:

The TileDB C++ ‘dev’ branch underwent a few changes of late, and you may cut your read time substantially by updating. On my (much smaller) machine I see “approximately” equal read and write times of between 3 1/2 and 4 minutes with your code (plus minor stylistic edits)
A fairly large performance gain can be had by switching to dense arrays. What we have here is really just one attribute (in column 7) and six conditioning variables.
You can possibly tune the partial read performance by playing around with the tile extent when creating – maybe try extent of one of 100K, 1M, 10M
Compression will likely be beneficial for each attribute as well

Example timings from my six-core machine:

   user  system elapsed       # fread
 14.666   1.838   2.988 
    user   system  elapsed 
1841.328   15.930  216.568   # tiledb write sparse
    user   system  elapsed 
2287.763   20.665  241.090   # tiledb read sparse
   user  system elapsed 
 18.722   3.080   4.296      # tiledb write dense
   user  system elapsed 
  5.156  10.290  10.267      # tiledb read dense

I am including my script below. Please let us know if you have any questions, and we will try to take a closer look at the sparse matrix performance for this multi-dimensional case.

Regards, Dirk

suppressMessages({
    library(tiledb)
    library(data.table)
    tiledb::tiledb_version()
    ## 1     8     0
    invisible(NULL)
})

removeIfFound <- function(arrayname) {
    if (dir.exists(arrayname))
        unlink(arrayname, recursive=TRUE)
    invisible(NULL)
}

readCsvAndWriteTileDB <- function(datafile, arrayname, verbose=FALSE) {
    print(system.time(x <- fread(datafile, header = FALSE, data.table = TRUE)))
    ## "data.table reads file in about 2 seconds"
    ## about 3 secs for me

    ## check config
    if (verbose) {
        ctx <- tiledb_ctx()
        cfgptr <- tiledb:::libtiledb_ctx_config(ctx@ptr)
        print(tiledb:::libtiledb_config_vector( tiledb:::libtiledb_ctx_config(ctx@ptr) ) )
    }

    dim1 <- tiledb_dim("dim1", c(1L, 57602L), type = "INT32")
    dim2 <- tiledb_dim("dim2", c(1L, 94991975L),  type = "INT32")
    dim3 <- tiledb_dim("dim3", c(0L, 1L),  type = "INT32")
    dim4 <- tiledb_dim("dim4", c(0L, 1L),  type = "INT32")
    dim5 <- tiledb_dim("dim5", c(1L, 1000000L),  type = "INT32")
    dim6 <- tiledb_dim("dim6", c(1L, 219988984L), type = "INT32")
    dom <- tiledb_domain(dims = c(dim1, dim2, dim3, dim4, dim5, dim6))
    schema <- tiledb_array_schema(dom, attrs = c(tiledb_attr("att", type = "INT32")), sparse = TRUE)
    tiledb_array_create(arrayname, schema)
    A <- tiledb_sparse(uri = arrayname)

    print(system.time(A[x$V1, x$V2, x$V3, x$V4, x$V5, x$V6] <- as.integer(x$V7)))
    ## "about 2 minutes to write the array"
    ## about 3 3/4 minutes for me
    invisible(NULL)
}

readBack <- function(arrayname) {
    B <- tiledb_sparse(uri = arrayname)
    print(system.time(y <- B[1:57602, 1:94991975, 0:1, 0:1, 1:1000000, 1:219988984]))
    ## about 11 minutes to read the whole array in
    ## and just under four minutes for me
    invisible(NULL)
}

readBackDense <- function(arrayname) {
    B <- tiledb_dense(uri = arrayname)
    system.time(y <- B[])
    ## about 11 minutes to read the whole array in
    ## and just under four minutes for me
    invisible(NULL)
}

readCsvAndWriteTileDB_Dense <- function(datafile, arrayname, verbose=FALSE) {
    print(system.time(x <- fread(datafile, header = FALSE, data.table = TRUE)))
    ## "data.table reads file in about 2 seconds"
    ## about 4.6 secs for me

    print(system.time(fromDataFrame(x, arrayname)))
    invisible(NULL)
}

datafile <- "array.csv"
#datafile <- "arraySample.csv"
arrayname <- "arrayCheck"

cfg <- tiledb_config()
cfg["sm.num_writer_threads"] <- 6
cfg["sm.num_reader_threads"] <- 6
cfg["vfs.num_threads"] <- 6
ctx <- tiledb_ctx(cfg, cached=FALSE)

removeIfFound(arrayname)
readCsvAndWriteTileDB(datafile, arrayname)
readBack(arrayname)

removeIfFound(arrayname)
readCsvAndWriteTileDB_Dense(datafile, arrayname)
readBackDense(arrayname)

cat("Done\n")

stavros · April 8, 2020, 4:41pm

A general comment on modeling a dataframe as a dense or sparse array.

If you wish to slice just based on rows, potentially sub-selecting on attributes, then you should model your array as a 1D dense, where the dimension stands for the implicit row ids. The fromDataFrame function above does that for you, but currently sets the tile extent to the entire domain (we might change that to a smaller default). You may want to set the tile extent to something like 10K or 100K. This is small enough to prevent bringing too much redundant data when you partially slice, but not too small if you want to use effective compression.

Now, if you wish to slice based on ranges on a subset of your columns, then you will have much better partial slicing performance if you model your array as sparse. You are free to choose any columns to become your dimensions, but I suggest you use the 2-3 most selective ones that also receive the most frequent range queries for your use case. In your example you used 6 and probably TileDB is struggling in a place where we have not parallelized yet. We’ll look into that shortly so that even in that case you get the performance you wish.

Thanks again for your message. We’ll get back to you with more updates soon.

MOR · April 8, 2020, 6:37pm

appreciate the responses on this guys - many thanks - will work through

great product

M

Topic		Replies	Views
Need help optimizing read performance	2	856	June 1, 2021
How to speedup reading form a sparse TileDB array	3	667	February 21, 2023
Reads are suffering badly	4	1248	June 28, 2019
Optimizing the reads for sparse arrays	9	790	June 27, 2023
Tile Dimensions affect the disk space, should it?	1	3213	June 23, 2019

R: Optimising Read Performance - Range within a dimension

Related topics