Concurrent read tile arrays from s3

mikejiang · March 25, 2021, 11:45pm

Can’t seem to get concurrent read work (in order to speed up )
here is my config

cfg["sm.compute_concurrency_level"] = 10;
  cfg["sm.io_concurrency_level"] = 10;
  cfg["vfs.min_parallel_size"] =
      10485760;  // set it to default for now ,yet to figure out how to use it
                 // (set it too small as 100k seems to slow down read)

and here is my tile 2d array layout

domain.add_dimension(tiledb::Dimension::create<int>(
        ctx, "cell", {1, ncell},
        ncell));  // @suppress("Invalid arguments") // @suppress("Symbol is not
                  // resolved")
    domain.add_dimension(tiledb::Dimension::create<int>(
        ctx, "channel", {1, nch == 0 ? 1 : nch},
        1));  // @suppress("Invalid arguments") // @suppress("Symbol is not
              // resolved")
    tiledb::ArraySchema schema(ctx, TILEDB_DENSE);
    schema.set_domain(domain);
    schema.add_attribute(tiledb::Attribute::create<float>(ctx, "mat"));
    schema.set_tile_order(TILEDB_COL_MAJOR).set_cell_order(TILEDB_COL_MAJOR);

basically store cells x channels 2d array, each row is a cell, each col is a channel, and access pattern is reading by channel, thus I am tiling by channel.

I wonder how the concurrent read request can be applied here to optimize reading multi-channel at once, say using 10 threads to fetch 10 channels concurrently.
here is what my query looks like

  tiledb::Query query(*ctxptr_, *mat_array_ptr_);
  query.set_layout(TILEDB_COL_MAJOR);
  int ncol = 0;
  int nrow = 0;
  int dim_idx = 0;
  ncol = cidx.size();
  nrow = dims[0];
  dim_idx = 1;
  query.add_range<int>(0, 1, nrow);  // select all rows

  // tiledb idx starting from 1
  for (int i : cidx) {
    query.add_range<int>(dim_idx, i + 1, i + 1);
  }

  arma::Mat<float> buf(nrow, ncol);

  query.set_buffer("mat", buf.memptr(), nrow * ncol);
  query.submit();
  query.finalize();

Hope someone can point me to the right direction here

stavros · March 30, 2021, 5:37pm

Thanks @mikejiang! Just I notice that we don’t parallelize over attributes. We will open a ticket and address this very soon.

mikejiang · March 30, 2021, 7:22pm

parallelize over attributes
what does it mean?So currently tiledb only parallel over different arrays but not concurrent read different tiles within the same array?

stavros · March 30, 2021, 7:32pm

Apologies, I’ve misread your comment. You currently seem to have one attribute (mat).

I think you might be having just a single tile and a single attribute in the array (and no filters per attribute), therefore TileDB can’t seem to find anything to parallelize on :). Typically TileDB parallelizes over attributes, over tiles per attribute and over chunks per tile when there are filters.

Can you define a tile extent per dimension which does not cover the whole dimension domain? In other words, use a last argument in add_dimension that is smaller than ncell or nch.

Also to be sure, can you please enable and dump the stats for your reads?

Thanks!

mikejiang · April 3, 2021, 1:15am

No. I do have multiple tiles for ‘mat’ attributes. As you see above, for cell dims(i.e. row) I defined the extent as ncell, which make all cells as one tile, for channel dim (i.e. col), I defined the tile extent as 1,
so basically our access pattern is fetching 1 or more columns(channels) each time.
With this 2d layout, I expect tiledb will perform concurrent read upon multi-columns IO requests. But based on my testing, using 1 thread or 17 threads doesn’t make any time difference when I reading 17-column data from s3. (each column is about 4.5MB).

stavros · April 5, 2021, 4:16pm

TileDB should be parallelizing across tiles as long as nch > 1. Could you please enable and dump the stats as I suggest above for your reads in the case of nthreads = 1 and nthreads = 17 (assigning appropriately your concurrency config options, and setting cfg to the context in the very beginning of your program)? We should be able to see what’s happening from the stats.

Topic		Replies	Views
Slow read performance for local and S3 2D sparse array	6	1000	April 7, 2022
Reads are suffering badly	4	1242	June 28, 2019
Optimizing TileDB Query Performance for GEDI Data	0	27	February 27, 2025
How does locking work in tile-db?	2	762	January 15, 2020
How to speed up the reading from tiledb	5	1983	October 8, 2020

Concurrent read tile arrays from s3

Related topics