Hi, after reading the documentations at Welcome to the TileDB Docs! - TileDB Docs, I find TILEDB a great alternative database. I am attracted to the multi-dimensional database structure and it is more suitable for data science workflow. I hope to use TILEDB in the future. I am using R interface tiledb
by Dirk Eddelbuettel. Thank you for the great work.
Below is my reproducible codes. I want to write typical dataset containing typical database dimensions used for slicing. My dimensions include ID (ie customer ID) and dates (day, without time components). For dates, I want to use data.table::as.IDate
, date and time classes with integer storage for fast sorting and grouping.
I tried using TILEDB for my real database but I quickly ran out of memory and ran into error Error in libtiledb_query_buffer_alloc_ptr(arrptr, type, resrv, nullable) : std::bad_alloc
. So below I reproduce a simplified, simulated real dataset and did quick comparison with data.table::fread
for CSV. data.table::fread
didn’t have issue at all for reading and writing, indeed it is very fast. But TILEDB is much slower and ran out of memory. I then used as.Date()
for class Date
and similar performance and out of memory issues persist. I am not sure which part I did wrong.
Thank you in advanced for the help.
library(tiledb)
library(data.table)
# simulate real dataset
createRandomString = function() {
paste0(sample(letters, 3, replace=T),
sample(LETTERS, 4, replace=T),
sample(1:10, 2, replace=T),
collapse = "")
}
simulateRealData = function(n=10e6, ngroup=2500) {
uniqueRandomID = paste0(1:ngroup,
sapply(1:ngroup, function(x) createRandomString()))
dates = as.Date("1900-01-01") + 1:(n/ngroup)
df = data.table(date = rep(dates,ngroup),
id = rep(uniqueRandomID, each = n/ngroup),
val1 = runif(n),
val2 = rnorm(n),
val3 = rnorm(n))
}
createDB <- function(uri) {
intmax=.Machine$integer.max
domain = tiledb_domain(
dims = c(tiledb_dim("date", c(-intmax, intmax), 10000, "DATETIME_DAY"),
tiledb_dim("id", NULL,NULL, "ASCII"))
)
schema = tiledb_array_schema(
domain, sparse = TRUE,
attrs = c(tiledb_attr("val1", "FLOAT64"),
tiledb_attr("val2", "FLOAT64"),
tiledb_attr("val3", "FLOAT64"))
)
tiledb_array_create(uri, schema)
}
#############################################
# Use data.table::as.IDate() - Smaller data #
#############################################
df = simulateRealData(n=10e6,ngroup=2500)
df[,date:=as.IDate(date)]
object.size(df)
tmp = tempfile()
system.time(data.table::fwrite(df, tmp))
# user system elapsed
# 4.312 0.660 1.410
system.time(data.table::fread(tmp))
# user system elapsed
# 3.815 0.261 1.072
uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
# user system elapsed
# 32.527 2.092 18.195
arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
# user system elapsed
# 70.041 5.944 51.262
############################################
# Use data.table::as.IDate() - Bigger data #
############################################
df = simulateRealData(n=20e6,ngroup=2500) # Insufficient mem
df[,date:=as.IDate(date)]
object.size(df)
tmp = tempfile()
system.time(data.table::fwrite(df, tmp))
# user system elapsed
# 9.241 1.279 2.924
system.time(data.table::fread(tmp))
# user system elapsed
# 6.464 0.327 2.355
uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
# user system elapsed
# 70.429 4.560 31.655
arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
# Error in libtiledb_query_buffer_alloc_ptr(arrptr, type, resrv, nullable) : std::bad_alloc
# [tuncated...]
# Timing stopped at: 1.018 2.789 8.004
################################
# Use as.Date() - Smaller data #
################################
df = simulateRealData(n=10e6,ngroup=2500)
object.size(df)
uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
# user system elapsed
# 31.890 2.011 18.176
arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
# user system elapsed
# 74.928 6.074 53.618
################################
# Use as.Date() - Bigger data #
################################
df = simulateRealData(n=20e6,ngroup=2500)
object.size(df)
uri=tempfile()
createDB(uri)
arr1 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({arr1[] = df})
# user system elapsed
# 70.07 3.88 35.15
arr2 = tiledb_array(uri, as.data.frame = TRUE, is.sparse = TRUE)
system.time({df2 = arr2[]})
# Error in libtiledb_query_buffer_alloc_ptr(arrptr, type, resrv, nullable) : std::bad_alloc
# (truncated output)
# Timing stopped at: 0.797 2.688 7.833