TILEDB Data Frame in R

Hi, after reading the documentations at Welcome to the TileDB Docs! - TileDB Docs, I find TILEDB a great alternative database. I am attracted to the multi-dimensional database structure and it is more suitable for data science workflow. I hope to use TILEDB in the future. I am using R interface tiledb by Dirk Eddelbuettel. Thank you for the great work.

Below is my reproducible codes. I want to write typical dataset containing typical database dimensions used for slicing. My dimensions include ID (ie customer ID) and dates (day, without time components). For dates, I want to use data.table::as.IDate, date and time classes with integer storage for fast sorting and grouping.

I tried using TILEDB for my real database but I quickly ran out of memory and ran into error Error in libtiledb_query_buffer_alloc_ptr(arrptr, type, resrv, nullable) : std::bad_alloc. So below I reproduce a simplified, simulated real dataset and did quick comparison with data.table::fread for CSV. data.table::fread didn’t have issue at all for reading and writing, indeed it is very fast. But TILEDB is much slower and ran out of memory. I then used as.Date() for class Date and similar performance and out of memory issues persist. I am not sure which part I did wrong.

Thank you in advanced for the help.

library(tiledb)
library(data.table)

# simulate real dataset
createRandomString = function() {
  paste0(sample(letters, 3, replace=T),
         sample(LETTERS, 4, replace=T),
         sample(1:10, 2, replace=T),
         collapse = "")
}
simulateRealData = function(n=10e6, ngroup=2500) {
  uniqueRandomID = paste0(1:ngroup,
    sapply(1:ngroup, function(x) createRandomString()))
  dates = as.Date("1900-01-01") + 1:(n/ngroup)
  df = data.table(date = rep(dates,ngroup),
                  id = rep(uniqueRandomID, each = n/ngroup),
                  val1 = runif(n),
                  val2 = rnorm(n),
                  val3 = rnorm(n))
}
createDB <- function(uri) {
  intmax=.Machine$integer.max
  domain = tiledb_domain(
    dims = c(tiledb_dim("date", c(-intmax, intmax), 10000, "DATETIME_DAY"),
             tiledb_dim("id", NULL,NULL, "ASCII"))
  )
  schema = tiledb_array_schema(
    domain, sparse = TRUE,
    attrs = c(tiledb_attr("val1", "FLOAT64"),
              tiledb_attr("val2", "FLOAT64"),
              tiledb_attr("val3", "FLOAT64"))
  )
  tiledb_array_create(uri, schema)
}

#############################################
# Use data.table::as.IDate() - Smaller data #
#############################################

df = simulateRealData(n=10e6,ngroup=2500)
df[,date:=as.IDate(date)]
object.size(df)

tmp = tempfile()
system.time(data.table::fwrite(df, tmp))
#  user  system elapsed
# 4.312   0.660   1.410
system.time(data.table::fread(tmp))
#  user  system elapsed
# 3.815   0.261   1.072

uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
#  user  system elapsed
# 32.527   2.092  18.195

arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
#  user  system elapsed
# 70.041   5.944  51.262

############################################
# Use data.table::as.IDate() - Bigger data #
############################################

df = simulateRealData(n=20e6,ngroup=2500)  # Insufficient mem
df[,date:=as.IDate(date)]
object.size(df)

tmp = tempfile()
system.time(data.table::fwrite(df, tmp))
#  user  system elapsed
# 9.241   1.279   2.924
system.time(data.table::fread(tmp))
#  user  system elapsed
# 6.464   0.327   2.355

uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
#  user  system elapsed
# 70.429   4.560  31.655

arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
# Error in libtiledb_query_buffer_alloc_ptr(arrptr, type, resrv, nullable) : std::bad_alloc
# [tuncated...]
# Timing stopped at: 1.018 2.789 8.004


################################
# Use as.Date() - Smaller data #
################################

df = simulateRealData(n=10e6,ngroup=2500)
object.size(df)

uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
#  user  system elapsed
# 31.890   2.011  18.176

arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
#  user  system elapsed
# 74.928   6.074  53.618

################################
# Use as.Date() - Bigger data  #
################################

df = simulateRealData(n=20e6,ngroup=2500)
object.size(df)

uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
#  user  system elapsed
# 70.07    3.88   35.15

arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
# Error in libtiledb_query_buffer_alloc_ptr(arrptr, type, resrv, nullable) : std::bad_alloc
# (truncated output)
# Timing stopped at: 0.797 2.688 7.833

Hi, and thank you for taking the time to send us a full and reproducible example. One simple suggestion I would make may be to just create and write the array in one call to fromDataFrame() . It should generally create reasonable array schema and fill them for you.

Your second example (as well as the rest of the code) work fine on my normal development machine (with 32 gb of ram). At your end you could also try to just submit your data frame objects in several chunks as you can append to a TileDB array. Let us know if you would like a worked example on that and we will try to respond.

The timing on these dataset versus csv writers from data.table is not really an eye to eye comparison as csv formats will not let you slice and index. One of the advantage for storing your data in TileDB may be to get to chosen subsets faster, so comparing the full data set in read speed may not be the metric closest to your actual deployment.

1 Like

Hi Dirk, sorry for getting back to you so late, I have been busy with my family for the past weeks.

Thank you for the suggestion of using fromDataFrame(). I spent a few weeks learning about Tiledb and the R API. I am quite excited about it. Now I want to migrate my relational database to Tiledb. Sorry about the confusion for my first post. Yeah data.table was not a good comparison. It was meant to show the huge memory consumed by Tiledb. For this project I tried moving my data to Tiledb but ran out of memory so I am not sure which part I did wrong. I actually did peek the code of fromDataFrame(), learned from it and it is a light wrapper for creating Tiledb schema.

I have a lot of questions but lets focus on two issues right now, which is stopping me from moving to Tiledb

  1. I prefer to learn and create Tiledb schema from scratch using tiledb_domain and tiledb_array_schema, instead of relying on wrapper like fromDataFrame so I can learn and have more control for more complex data model. This project is a simple one, which involves typical database dimensions include ID (ie customer ID) and dates (day, without time components). For dates, I want to use data.table::as.IDate, date and time classes with integer storage for fast sorting and grouping. But for simplicity, let’s stick to Date class for now. So I want to know is there anything wrong with my createDB which uses tiledb_domain and tiledb_array_schema. And how can I improve it?

  2. When coding these out, comparing SQLite with Tiledb, SQLite works as aspected but for Tiledb I ran into out of memory in my laptop (6GB memory). The difference in memory usage is quite large between SQLite and Tiledb. So what have I done wrong with my tiledb_domain and tiledb_array_schema? Why does Tiledb consumes so much memory?

Edited: It would be really helpful, so I can learn from it, if you can provide an example of how would you create schema using tiledb_domain and tiledb_array_schema and what syntax would you use to slice the data on id and date dimension. Thank you.

Comparison without slicing

Same codes as above

library(tiledb)

# simulate real dataset
createRandomString = function() {
  paste0(sample(letters, 3, replace=T),
         sample(LETTERS, 4, replace=T),
         sample(1:10, 2, replace=T),
         collapse = "")
}
simulateRealData = function(n=10e6, ngroup=2500) {
  uniqueRandomID = paste0(1:ngroup,
    sapply(1:ngroup, function(x) createRandomString()))
  dates = as.Date("1900-01-01") + 1:(n/ngroup)
  df = data.table(date = rep(dates,ngroup),
                  id = rep(uniqueRandomID, each = n/ngroup),
                  val1 = runif(n),
                  val2 = rnorm(n),
                  val3 = rnorm(n))
}
createDB = function(uri) {
  intmax=.Machine$integer.max
  domain = tiledb_domain(
    dims = c(tiledb_dim("date", c(-intmax, intmax), 10000, "DATETIME_DAY"),
             tiledb_dim("id", NULL,NULL, "ASCII"))
  )
  schema = tiledb_array_schema(
    domain, sparse = TRUE,
    attrs = c(tiledb_attr("val1", "FLOAT64"),
              tiledb_attr("val2", "FLOAT64"),
              tiledb_attr("val3", "FLOAT64"))
  )
  tiledb_array_create(uri, schema)
}

Compare TILEDB and CRAN - Package RSQLite

First, try SQLite. Start a fresh R session R and run the first chunk before.

library(DBI)
con <- dbConnect(RSQLite::SQLite(), "tmp")
set.seed(139475)
df = simulateRealData(n=10e6,ngroup=2500)

memuse::Sys.procmem(gcFirst = TRUE)  # Default configuration
# Size:  761.117 MiB
# Peak:    1.116 GiB
system.time(dbWriteTable(con, "df", df))
  #  user  system elapsed
  # 9.659   1.141  10.800
memuse::Sys.procmem()
# Size:  761.121 MiB
# Peak:    1.116 GiB
system.time(dbReadTable(con, "df"))
  #  user  system elapsed
  # 8.823   0.510   9.333
memuse::Sys.procmem()
# Size:  761.266 MiB
# Peak:    1.509 GiB

Reading from SQLite consumes little memory. Try fromDataFrame. Start a fresh R session R and run the first chunk before.

set.seed(139475)
df = simulateRealData(n=10e6,ngroup=2500)
uri=tempfile()

# Compare ram used by the current R process, with garbage collection should be called before getting process memory usage. `gcFirst = TRUE` is default configuration.

memuse::Sys.procmem()
# Size:  759.688 MiB
# Peak:    1.116 GiB
system.time(fromDataFrame(df,uri))
#    user  system elapsed
#  20.220   3.478  16.873
memuse::Sys.procmem()
# Size:  1.024 GiB
# Peak:  2.610 GiB
arr = tiledb_array(uri, is.sparse = FALSE, as.data.frame = TRUE)
system.time(arr[])
# Error in libtiledb_query_get_buffer_ptr(buf, asint64) : std::bad_alloc
# Timing stopped at: 5.159 16.19 24.13
memuse::Sys.procmem()
# Size:  4.826 GiB
# Peak:  5.504 GiB

Use createDB. Start a fresh R session R and run the first chunk before.

set.seed(139475)
df = simulateRealData(n=10e6,ngroup=2500)
uri=tempfile()
createDB(uri)
# arr1 for read. arr2 for write.
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)

memuse::Sys.procmem()
# Size:  762.824 MiB
# Peak:    1.117 GiB
system.time({arr1[] = df})
#    user  system elapsed
#  33.080   2.859  18.659
memuse::Sys.procmem()
# Size:  1.137 GiB
# Peak:  2.632 GiB
system.time({df2 = arr2[]})
# Error: cannot allocate vector of size 610.4 Mb
# Timing stopped at: 69.98 24.73 62.26
memuse::Sys.procmem()
# Size:  5.375 GiB
# Peak:  5.497 GiB

Comparison with slicing

Start a fresh R session R and run the first chunk before

library(DBI)
library(dplyr)
library(dbplyr)
con = dbConnect(RSQLite::SQLite(), "tmp")
set.seed(139475)
df = simulateRealData(n=10e6,ngroup=2500)
ids = sample(unique(df$id), 1e3)

memuse::Sys.procmem()
# Size:  776.223 MiB
# Peak:    1.293 GiB
system.time(dbWriteTable(con, "df", df))
#    user  system elapsed
#  10.051   1.338  11.390
memuse::Sys.procmem()
# Size:  776.641 MiB
# Peak:    1.293 GiB
db = dplyr::tbl(con, "df")
system.time({dplyr::filter(db, id %in% ids) %>% collect()})
  #  user  system elapsed
  # 6.547   0.380   6.927
memuse::Sys.procmem()
# Size:  776.262 MiB
# Peak:    1.293 GiB

SQLite consumes little memory.

Use fromDataFrame. Start a fresh R session R and run the first chunk before. But this gives errors domain or tile_extent does not match dimension type.

set.seed(139475)
df = simulateRealData(n=10e6,ngroup=2500)
uri=tempfile()
ids = sample(unique(df$id), 1e3)

memuse::Sys.procmem()
system.time(fromDataFrame(df,uri,col_index = c("date", "id")))
# Error in libtiledb_dim(ctx@ptr, name, type, domain, tile) :
#   domain or tile_extent does not match dimension type
# Timing stopped at: 0.193 0.098 0.29

Use createDB. Start a fresh R session R and run the first chunk before.

set.seed(139475)
df = simulateRealData(n=10e6,ngroup=2500)
uri=tempfile()
ids = sample(unique(df$id), 1e3)
# arr1 for read. arr2 for write.
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)

memuse::Sys.procmem()
system.time({arr1[] = df})
memuse::Sys.procmem()
# Slice by id dimension
system.time({df2 = arr2[,ids]})
# Error: cannot allocate vector of size 520.9 Mb
# Timing stopped at: 35.58 10.69 50.69

Thank you, @Mike, for your very detailed comment and extended code snippet, we really appreciate you digging into this. We can replicate your results.

Concerning the peak memory use, we have identified the issue and had already been working actively on memory consumption and should have an improved implementation in the near future. In the meantime, you can (with smaller memory budgets) retrieve smaller chunks.

We would be more than happy to continue discussing this, maybe over on the Slack community workspace.

1 Like

Thank you, @Dirk_Eddelbuettel, for your overall contribution to the R ecosystem including Rcpp, Tiledb and so many more R packages. It’s great to know the memory issue is being worked on and I am looking forward to it. Is there a PR I can follow along the issue?

Yes in the mean time my workaround is to retrieve smaller chunks where I fetch batches with each batch containing 100 IDs and merge the data back in R. Thank you for the Slack invitation and I will check it out.

Hi @Dirk_Eddelbuettel , do you have any update on this? If not, how can I follow up with this issue? (i.e. a PR somewhere in the repo). Thank you.

Hi @Mike

We have made multiple updates and releases to the underlying C++ library and the R package so it is definitely worth trying the current CRAN release 0.10.2 of the R package. Also, and as I first wrote to you in July, I would probably write the example differently.

We have numerous customers writing large datasets, many time largers than memory, incrementally (even in parallel). There is no fundamental limit to the array sizes. The example we showed at useR! 2021 has hundreds of millions of row and we happily sliced small portions. So my recommendation would be to give the updated package another try, and maybe also rethink how you set the array up. For truly large sizes (and incremental writes) you probably want a sparse array.

Hope this helps, happy to discuss further on the Slack forum as suggested.

Best, Dirk