Tiledb performance with sparse point cloud data

We are currently testing if we could use tiledb in our project but are unhappy with the performance, namely slicing from point cloud data seems to take quite a bit of time. Our current test case consists of a point cloud with 161059 points (xyz data), we are using the c++ api to read the data from the tiledb. Schema of our data looks like this:

ArraySchema(
  domain=Domain(*[
    Dim(name='z', domain=(-1000.0, 1000.0), tile=2000.0, dtype='float32'),
    Dim(name='y', domain=(-1000.0, 1000.0), tile=2000.0, dtype='float32'),
    Dim(name='x', domain=(-1000.0, 1000.0), tile=2000.0, dtype='float32'),
  ]),
  attrs=[
    Attr(name='floor', dtype='uint8', var=False, nullable=False),
    Attr(name='outlier', dtype='uint8', var=False, nullable=False),
  ],
  cell_order='hilbert',
  tile_order=None,
  capacity=10000,
  sparse=True,
  allows_duplicates=False,
)

Bounding box for the data is
non_empty_domain x: -31.0248 50.0115
non_empty_domain y: -23.6003 32.5186
non_empty_domain z: -3.32503 6.89223

Reading all that data from the database takes around 250-800ms which feels a bit slow, but when we increase the size of the database by just copying the input database around the bounding box above, e.g. shifting all the datapoints to right by adding maxX - minX to x coordinates and doing that to all directions so that the data is duplicated all around the above bounding box so that the new database has bounding box of size:
non_empty_domain x: -598.279 617.266
non_empty_domain y: -416.433 425.351
non_empty_domain z: -33.9768 37.544
Everything else stays the same, schema and all, only data amount is increased (and even that database is rather small for our purposes). Now getting the same original data, using the first bound box as slicing area, basically doing the same slice as before, but from this larger database, we get reading times of around 7-8 seconds which is too slow for our purposes. Is the expected performance? If it is not, how can the performance be increased? I guess usual data dimensios for our slicing are around 100x100x10 or so in meters. Maybe twice as much.

We have already consolidated the data and also vacuumed it, there used to be 104 fragments in the data but now after consolidation and vacuuming, there is only 4 fragments. But to us it seemed that the consolidatino/vacuuming had no effect on the read performance. Before consolidation it took around 8s to read from the larger bbox data area and it took the same time after the consolidation.

I got slight improvement on reading when I switched from libtiledb2.14 to libtiledb2.15, it shaved off around 1s from reading time, dropping it to around 6.8s.

Here is the code we use for reading the data. (And do the timing measurements)

pcl::PointCloud<pcl::PointXYZ>::Ptr readData(const tiledb::Config &config,
                                             const std::string &array_name,
                                             const std::array<float, 2> &xrange,
                                             const std::array<float, 2> &yrange,
                                             const std::array<float, 2> &zrange,
                                             const std::string &attribute = "",
                                             const uint8_t attributeValue = 0,
                                             const tiledb_query_condition_op_t queryComparison = TILEDB_EQ)
{
    tiledb::Context ctx(config);
    // Prepare the array for reading
    tiledb::Array array(ctx, array_name, TILEDB_READ);
    tiledb::Subarray subarray(ctx, array);
    subarray.add_range("z", zrange[0], zrange[1])
        .add_range("y", yrange[0], yrange[1])
        .add_range("x", xrange[0], xrange[1]);

    // will hold the query return status, if the input buffers are not large enough
    // to hold the return values, then this will be TILEDB_INCOMPLETE
    // and we must consume our data from buffer and the resubmit query to get the
    // rest of the data.
    tiledb::Query::Status status;

    // Prepare the query
    tiledb::Query query(ctx, array);
    query.set_subarray(subarray);
    //.set_layout(TILEDB_COL_MAJOR);
    // Get estimate for a fixed-length attribute or dimension (in bytes)
    uint64_t est_size = query.est_result_size("floor");
    std::cout << "estimated buffer size for floor: " << est_size << std::endl;
    est_size = query.est_result_size("outlier");
    std::cout << "estimated buffer size for outlier: " << est_size << std::endl;
    est_size = query.est_result_size("x");
    std::cout << "estimated buffer size for x: " << est_size << std::endl;
    est_size = query.est_result_size("y");
    std::cout << "estimated buffer size for y: " << est_size << std::endl;
    est_size = query.est_result_size("z");
    std::cout << "estimated buffer size for z: " << est_size << std::endl;

    const int buffer_max_limit = 1024 * 1024 * 256; // bytes
    if (est_size > buffer_max_limit)
    {
        est_size = buffer_max_limit;
        std::cout << "Limiting buffer size to 256MB (" << est_size << ")" << std::endl;
    }

    // if attribute is given, then we pass those conditions all the way to tiledb
    // so that it only returns those points that have the attribute condition met
    // thus making stuff easier since we don't have to check or filter based on
    // attribute values.
    // With this we can e.g. get only floor points from given xyz intervals.
    tiledb::QueryCondition querycond(ctx);
    if (attribute.length() > 0)
    {
        std::cout << "Using " << attribute << qc_op2str(queryComparison) << (attributeValue > 0) << " as query condition" << std::endl;
        querycond.init(attribute, &attributeValue, sizeof(uint8_t), queryComparison);
        query.set_condition(querycond);
    }

    // Prepare the vector that will hold the result.
    // We take an upper bound on the result size, as we do not
    // know a priori how big it is (since the array is sparse)
    // maximum number of points the sliced dataset can contain
    int maxPoints = (est_size / sizeof(float));
    std::cout << "points: " << maxPoints << std::endl;

    std::vector<float> datax(maxPoints, 0.1337f);
    std::vector<float> datay(maxPoints, 0.1337f);
    std::vector<float> dataz(maxPoints, 0.1337f);

    // attributes (metadata) for each point
    std::vector<uint8_t> dataf(maxPoints, 250);
    std::vector<uint8_t> datao(maxPoints, 250);
    auto start = std::chrono::high_resolution_clock::now(); // start the timer
    query.set_data_buffer("z", dataz)
        .set_data_buffer("y", datay)
        .set_data_buffer("x", datax)
        .set_data_buffer("floor", dataf)
        .set_data_buffer("outlier", datao);

    pcl::PointCloud<pcl::PointXYZ>::Ptr cloudPtr(new pcl::PointCloud<pcl::PointXYZ>);

    int floorCount = 0;
    int restCount = 0;
    // Submit the query and close the array.
    do
    {
        // tiledb::Stats::enable();
        query.submit();
        // tiledb::Stats::dump(stdout);
        // tiledb::Stats::disable();
        status = query.query_status();

        // IMPORTANT: check if there are any results, as your buffer
        // could have been too small to fit even a single result
        auto result_num = query.result_buffer_elements()["z"].second;

        if (status == tiledb::Query::Status::INCOMPLETE && (result_num == 0))
        {
            // You need to reallocate your buffers, otherwise
            // you will get an infinite loop
            std::cout << "Initial size for buffer is too small to even contain one data point! " << datax.capacity()
                      << std::endl;
            datax.resize(datax.capacity() * 2);
            datay.resize(datay.capacity() * 2);
            dataz.resize(dataz.capacity() * 2);
            dataf.resize(dataf.capacity() * 2);
            datao.resize(datao.capacity() * 2);
            query.set_data_buffer("z", dataz)
                .set_data_buffer("y", datay)
                .set_data_buffer("x", datax)
                .set_data_buffer("floor", dataf)
                .set_data_buffer("outlier", datao);
        }
        else if (result_num > 0)
        {
            // Do something with the results
            // You could set new buffers to the query here
            std::cout << "Data contained " << result_num << " points and status is "
                      << status << std::endl;
            // next store the data into pcl data structure.
            for (size_t i = 0; i < result_num; i++)
            {
                // std::cout << datax[i] << " " << datay[i] << " " << dataz[i] << std::endl;
                cloudPtr->points.push_back(pcl::PointXYZ(datax[i], datay[i], dataz[i]));
                if (dataf[i] > 0)
                    floorCount++;
                else
                    restCount++;
            }
        }
    } while (status == tiledb::Query::Status::INCOMPLETE);
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    std::cout << "Actual query duration: " << duration.count() << "ms" << std::endl;
    auto non_empty_domain_x = array.non_empty_domain<float>("x");
    auto non_empty_domain_y = array.non_empty_domain<float>("y");
    auto non_empty_domain_z = array.non_empty_domain<float>("z");
    std::cout << "non_empty_domain x: " << non_empty_domain_x.first << " " << non_empty_domain_x.second << std::endl;
    std::cout << "non_empty_domain y: " << non_empty_domain_y.first << " " << non_empty_domain_y.second << std::endl;
    std::cout << "non_empty_domain z: " << non_empty_domain_z.first << " " << non_empty_domain_z.second << std::endl;

    array.close();
    std::cout << "FloorCount: " << floorCount << " restCount: " << restCount << std::endl;
    return cloudPtr;
}

What also confuses us, is the size of the read buffer (in the code datax, datay, dataz etc.). When the size of the database increases, so does increase the need for the size of the datax etc. read buffers. This is a bit strange since the slice size should still stay the same, we should always get 161059 points. And we do, it’s just that the read buffer needs to be larger when tiledb database is larger.

So when only 161059 points are in the database, the read output looks like this:

estimated buffer size for floor: 161059
estimated buffer size for outlier: 161059
estimated buffer size for x: 644236 // in bytes, in vector<float> this is 161059 floats
estimated buffer size for y: 644236
estimated buffer size for z: 644236
points: 161059
Data contained 161055 points and status is COMPLETE
Actual query duration: 385ms
non_empty_domain x: -31.0248 50.0115
non_empty_domain y: -23.6003 32.5186
non_empty_domain z: -3.32503 6.89223
FloorCount: 83050 restCount: 78005

But with larger dataset:

estimated buffer size for floor: 3925053
estimated buffer size for outlier: 3925053
estimated buffer size for x: 15700211
estimated buffer size for y: 15700211
estimated buffer size for z: 15700211
points: 3925052 //Size of this has increased a lot
// and still we could not read all the data in one 
// iteration, had to read it twice
Data contained 51516 points and status is INCOMPLETE
Data contained 109541 points and status is COMPLETE
Actual query duration: 8304ms
non_empty_domain x: -598.279 617.266
non_empty_domain y: -416.433 425.351
non_empty_domain z: -33.9768 37.544
FloorCount: 83051 restCount: 78006

Basically we are still getting the same amount of points back, it just takes a lot larger read buffer to hold the intermediate result, why is this? I would imagine that the slice size is the same no matter how much there is data outside the slice volume unless this comes down to the capacity setting somehow.
All these tests are run on local HP elitebook laptop with Core i7 8th gen with 32Gb of ram.

Br.
Jarkko

p.s. The read data seems to be missing some data points. Input data had 161059 points and when slicing using the bounding box as a slicer, we are missing 4 points, since the output data only has 161055 points, maybe this is due to rounding happening when bounding box is returned using array.non_empty_domain(“x”) etc. or something. Have not investigated what points are left out.
I know that all the datapoints are there if I increase the slicing volume to be a tad bit bigger than the data bounding box in the tiledb, then I get all 161059 points.

Hello @Jagge ! Thanks for posting. Would it be possible to get a dump of TileDB statistics so we can see where the time for your query is spend on your machine?

I’ve also got a few additional questions:

  1. Is the array is stored locally on your laptop?
  2. What operating system are you running on?
  3. How large is the array in terms of bytes? Looks like its only in the 10-100MB range?
  4. Would you be willing to share the array with us so we can reproduce on your same data? If the data is sensitive we can provide a secure method for sharing.
  5. If you can’t share the array, can perform a tree or ls -laR on the array folder structure so we can see the physical layout after you’ve added the records and performed the consolidation and vacuuming?

If you cannot share the array, it would also be helpful to know something about the distribution of the points. In that case, we’ll try to create a reproduction writing some generated data – from Python first, and then benchmark/profile reading with both Python and PCL. But we’d like to make sure we generate similar data, if possible.

Hi,

The array is stored locally on a hard drive, we’re using Ubuntu 22.04 Virtual machine on Windows host. du -hs for the whole database directory gives 186M. I can share the data, the actual point cloud is only 1.9M so I can send it via e.g. email.

edit: fixed cloud size

Hi @Jagge, please do - isaiah <at> tiledb.com. If you need a place to upload it due to file size, I can send a file request. Thanks!

Hi @Jagge,

Thanks for sending the example code and data. We sent back a modified version with the following changes:

  • uses query.set_layout(TILEDB_UNORDERED) → this gives TileDB query time consistently around 400ms (whereas we also saw around 8s with the original code)
  • uses range bounds with more precision: the original bounds had only 4 decimal places, which excludes some points on the negative side where the full coordinate value is less than the given range, eg:
In [32]: -31.02484703 < -31.0248
Out[32]: True

Best,
Isaiah

Thank you, appreciate the help! Unfortunately we are unable to understand the fundamental difference between the program we submitted and that you sent back to us. We are using query.set_layout(TILEDB_UNORDERED) in the writeData function just like you do. We initialize the database to use TILEDB_HILBERT just like you do. So this does not explain the time difference. I do see that the data bounding box size in your returned code is smaller than in the example code we sent, but when I added the “incr” padding to the fillArea function, I got the same bounding box for the data that you got but the execution times still stay very long.

WriteData functions are the same:

    tiledb::Array array(ctx, array_name, TILEDB_WRITE);
    tiledb::Query query(ctx, array, TILEDB_WRITE);
    query.set_layout(TILEDB_UNORDERED);

What I did notice is that if I first write the “center block” of filled array like you do and then fill the rest using fillArea && writeData functions, then the read times drop to the range of 400-800ms. It seems that the write function has some high lag causing issue (maybe because the “center” fragment is written multiple times?) when it is used to write very first fragment of data into the database. If initial block is written like so:

    auto lg_array = tiledb::Array(ctx, dbName, TILEDB_WRITE);
    auto lg_query = tiledb::Query(ctx, lg_array);
    lg_query.set_layout(TILEDB_UNORDERED);
    lg_query.set_data_buffer("x", xyz_data[0]);
    lg_query.set_data_buffer("y", xyz_data[1]);
    lg_query.set_data_buffer("z", xyz_data[2]);
    lg_query.set_data_buffer("floor", fl_data);
    lg_query.set_data_buffer("outlier", o_data);
    auto lg_status = lg_query.submit();
    assert(lg_status == tiledb::Query::Status::COMPLETE);
    lg_array.close();

then the read times stay low even if the size of the database increases in size by using the fill area function. In fact we tried with db size of 9GB by making the fillArea function write more densily the data and still the seek times stayed to the aforementioned 400-800ms. Looks promising.

Followup question then. Is the delete/modify feature available yet? Edit: Found it and used it. No mention of it in the documentation at least.

It seems that the slowdown might be due to writing data on the same spot where the previous data was written. If the only way to currently delete data is mark it as deleted via metadata, then eventually searches will become slow from a volume where data has been “modified” multiple times? We deleted data from the center most fragment like so:

deleteData(config, dbName,
                {-31.02484703f, 50.01146697f},  // X range
                {-23.60031700f, 32.51863861f},  // Y range
                {-3.325037704f, 6.892231941f}); // Z range

But searching from that same area took longer after that and data contained only 5 points so delete was success but I guess each change on a volume cause new fragment on top of existing one so to speak and that slows down the slicing times from that particular volume?

edit2:
To be more clear. I can write same data on same location n times, after each write reading from that location becomes slower and slower. I can run consolidate “fragments” and do vacuum, but still after those steps, reading that same data from same location is almost as slow as it was without consolidation. Is there any way to get the performance back to same level as it was after just 1 write?

Yes, that’s correct. We missed that this was intentional in the initial analysis. From my colleague @KiterLuc:

The main difference is that their fill function to expand their dataset ends up writing a lot of duplicates, and the extra query time is spent in deduplication where we probably end up going only one cell at a time in some areas of their cloud point. We have an optimization that allows merging a full tile quickly when there is no interleaved data and that’s why our modified version is so much faster. It would be even faster if the user could guarantee that there is no duplicates and we could enable duplicates, but I’m not too familiar with what the user is trying to accomplish.

In order to deduplicate, we need to recalculate the Hilbert numbers and sort.

If you are ok with running consolidation, then you can issue a delete, then run consolidation with the "sm.consolidation.purge_deleted_cells" option set to "true".

However, based on your description of the performance you see in practice, we are going to run some additional tests and make sure the performance matches expectations.

Sorry for the oversight, added to main docs today: https://docs.tiledb.com/main/how-to/arrays/writing-arrays/deleting

Hello @Jagge! Just a few more details that might be helpful.

When you rewrite the same data in multiple fragments, you will incur a penalty when reading the data back as we need to sort the data. We also needs to read more data from disk and decompress it.

If you run consolidation, the data will be already sorted in the new fragment, but we do keep every written cells in the new fragment and what time they were written at. When reading the data back, we need to remove duplicate cells (cells with the same coordinates but with different timestamps) but as this is of linear complexity on already sorted data, it will be faster than an unconsolidated read. You will also read more data from disk and decompress it so it will be slower. FYI this is done so that Tiledb keeps its time traveling capability after consolidation.

If you run a delete query, the delete condition is only applied at read time so it will not help instantly. If you consolidate after a delete, we do apply the condition on the new fragment but the results are stored in a new field called “delete timestamp”. The data is still there in the new fragment as we again want to keep time traveling capabilities. You probably will not see much improvement with the consolidated delete here over the regular consolidation.

Now, if you use the purge deleted cells option, the deleted cells will be excluded from the new fragment altogether so you should see performance close to the original read there. Note that this option is experimental for a couple more releases until we get extra validation. Please let us know if you do not see better performance with this option.

There has been talk of having a new option where consolidation would not write duplicates which would loose time traveling capabilities but help your case greatly but it is not implemented yet.

1 Like