Disable time traveling feature

Sometime, I do not wish to the disk usage grow as I modify the array or array meta data frequently, as tiledb create copies for each write operations. Is there way to disable this feature so that each write simply overwrite the original file without incurring extra copies?

I am aware of consolidate and vacuum mechanism, but still, it will be helpful to avoid having to do that manually.

Also, vacuum doesn’t seem to work for me, it freezes on me once hit on the vacuum call, e.g.

                    tiledb::Config config;
		config["sm.consolidation.mode"] = "array_meta";
		config["sm.vacuum.mode"] = "array_meta";
		tiledb::Array::consolidate_metadata(*ctx_, uri_, &config);
		tiledb::Array::vacuum(*ctx_, uri_, &config);
1 Like

Overwriting the metadata file is against the file immutability in TileDB (which goes beyond time traveling), so I’d be a bit skeptical to change that behavior.

We will certainly investigate the vacuuming issue shortly, thanks for pointing this out! Also it may make sense to add some automation around write+consolidate+vacuum in a single helper function.

I also believe this is something that should be added. I’m aware that it’s one of the principles of TileDB to have immutability & time-travelling, but for those who perform a lot of write-operations it may not be very useful to have such a high degree of fragmentation and having to constantly defragment. I think this is one of the hurdles to becoming ‘a universal storage engine’.

I’m also still having performance issues with consolidation+vacuuming. It takes approximately 2 days to perform these operations on all of our files. Sometimes a single 5-10 GB array can take up to 90 minutes to consolidate+vacuum.

@Mtrl_Scientist noted on the writes, but we should immediately look at the performance issues with consolidation+vacuuming. This sounds like a (perf) bug on our end. cc-ing @joe_maley and we will follow up shortly. In the meantime, could you please share some more information, e.g., array schema, number of fragments and size of each fragment. Thanks!

Thanks @stavros!

Number of arrays
2116

Total Number of fragments
Too many to count with walkdir, but it’s between 100-10’000 fragments per array where most fragments are just a few kb in size, but can also be several GB if previously consolidated.

Schemas

Schema 1

TileDB Config

config = tiledb.Config()
config["sm.num_reader_threads"] = "8"
config["sm.num_writer_threads"] = "8"

# Context
ctx = tiledb.Ctx(config)

# Domain
dom = tiledb.Domain(
    # tiles = 1 cent increment
    tiledb.Dim(ctx=ctx,name="agg_ID", domain=(0, 9e18), tile=1e6, dtype=np.int64),
    # tiles = 1 day increment
#     tiledb.Dim(ctx=ctx,name="date", domain=(0, 9e18), tile=86.4e6, dtype=np.int64))
    tiledb.Dim(name="date", domain=(np.datetime64('1980-01-01'), np.datetime64("2100-01-01")),
               tile=np.timedelta64(1, 'D'), dtype="datetime64[ns]"))

# List of available filters
bit_shuffle = tiledb.BitShuffleFilter()
byte_shuffle = tiledb.ByteShuffleFilter()
RLE = tiledb.RleFilter()
double_delta_encoding = tiledb.DoubleDeltaFilter()
positive_delta_encoding = tiledb.PositiveDeltaFilter()
bit_width_reduction = tiledb.BitWidthReductionFilter(window=int(1e3))
gzip = tiledb.GzipFilter(level=9)
lz4 = tiledb.LZ4Filter(level=9)
bzip2 = tiledb.Bzip2Filter(level=9)
zstd = tiledb.ZstdFilter(level=9)

# Attributes
attrs= [
    tiledb.Attr(name=i,dtype=np.float64,ctx=ctx,
                filters=tiledb.FilterList([zstd])) for i in df.iloc[:,2:].columns
]
# Schema
schema = tiledb.ArraySchema(domain=dom, sparse=True,
                            attrs=attrs,
                            cell_order="row-major",tile_order="row-major",
                            capacity=int(10e6),ctx=ctx)

Example:

Schema 2
# TileDB Config
config = tiledb.Config()
config["sm.num_reader_threads"] = "8"
config["sm.num_writer_threads"] = "8"

# Context
ctx = tiledb.Ctx(config)

# Domain
dom = tiledb.Domain(
    # tiles = 1 cent increment
    tiledb.Dim(ctx=ctx,name="agg_ID", domain=(0, 9e18), tile=1e6, dtype=np.int64),
    # tiles = 1 day increment
#     tiledb.Dim(ctx=ctx,name="date", domain=(0, 9e18), tile=86.4e6, dtype=np.int64))
    tiledb.Dim(name="date", domain=(np.datetime64('1980-01-01'), np.datetime64("2100-01-01")),
               tile=np.timedelta64(1, 'D'), dtype="datetime64[ns]"))

# List of available filters
bit_shuffle = tiledb.BitShuffleFilter()
byte_shuffle = tiledb.ByteShuffleFilter()
RLE = tiledb.RleFilter()
double_delta_encoding = tiledb.DoubleDeltaFilter()
positive_delta_encoding = tiledb.PositiveDeltaFilter()
bit_width_reduction = tiledb.BitWidthReductionFilter(window=int(1e3))
gzip = tiledb.GzipFilter(level=9)
lz4 = tiledb.LZ4Filter(level=9)
bzip2 = tiledb.Bzip2Filter(level=9)
zstd = tiledb.ZstdFilter(level=9)

# Attributes
attrs = [
    tiledb.Attr(name="price",dtype=np.float64,ctx=ctx,
                filters=tiledb.FilterList([zstd])),
    tiledb.Attr(name="volume",dtype=np.float64,ctx=ctx,
               filters=tiledb.FilterList([zstd])),
    tiledb.Attr(name="first_trade_ID",dtype=np.int64,ctx=ctx,
               filters=tiledb.FilterList([zstd])),
    tiledb.Attr(name="last_trade_ID",dtype=np.int64,ctx=ctx,
       filters=tiledb.FilterList([zstd])),
    tiledb.Attr(name="is_buyer_maker",dtype=np.int8,ctx=ctx,
               filters=tiledb.FilterList([bit_shuffle,zstd])),
    tiledb.Attr(name="is_best_price_match",dtype=np.int8,ctx=ctx,
               filters=tiledb.FilterList([bit_shuffle,zstd]))
]
# Schema
schema = tiledb.ArraySchema(domain=dom, sparse=True,
                            attrs=attrs,
                            cell_order="row-major",tile_order="row-major",
                            capacity=int(10e6),ctx=ctx)

Example:

Schema 3
price_tile = 1e-3 if ticker.split("_")[1]=="BTC" else 1e3

# Domain
dom = tiledb.Domain(
    # tiles = 1 cent increment
    tiledb.Dim(name="price", domain=(0, 1e9), tile=price_tile, dtype=np.float64),
    # tiles = 1 day increment
    tiledb.Dim(name="date", domain=(np.datetime64('1980-01-01'), np.datetime64("2100-01-01")),
               tile=np.timedelta64(1, 'D'), dtype="datetime64[ns]"))

# List of available filters
bit_shuffle = tiledb.BitShuffleFilter()
byte_shuffle = tiledb.ByteShuffleFilter()
RLE = tiledb.RleFilter()
double_delta_encoding = tiledb.DoubleDeltaFilter()
positive_delta_encoding = tiledb.PositiveDeltaFilter()
bit_width_reduction = tiledb.BitWidthReductionFilter(window=int(1e3))
gzip = tiledb.GzipFilter(level=9)
lz4 = tiledb.LZ4Filter(level=9)
bzip2 = tiledb.Bzip2Filter(level=9)
zstd = tiledb.ZstdFilter(level=5)

# Attributes
attrs = [
    tiledb.Attr(name="quantity",dtype=np.float64,
               filters=tiledb.FilterList([byte_shuffle,zstd])
               ),
]
# Schema
# Capacity needs to be low as not to overwhelm the tileDB buffer
schema = tiledb.ArraySchema(domain=dom, sparse=True,
                            attrs=attrs,
                            cell_order="col-major",tile_order="col-major",
                            capacity=int(5e3))

Example:
long-format (as stored)
image
wide-format (transformed)

(Old) Notebook with some data to play around with:
https://1drv.ms/u/s!ArP7_EkyioIBwqsYPCCBmMJvJVYv4A?e=iHWLah

Thanks @Mtrl_Scientist! We will look into the consolidation performance. We will also suggest configurations for performing consolidation in a sane way.

It is quite clear to me that we need to modify our write behavior to perform consolidation automatically given some default parameters I have in mind (e.g., consolidate when the fragments are < X MB, consolidate Y fragments at a time, etc). Of course a power user can also tweak those, we always expose all tuning with user-defined configs.

1 Like

Thanks @stavros, that would be most welcome!