Non-empty domain after write in global layout

We want to store dense chunked data in the tiledb format using the C++ API and the TILEDB_GLOBAL_ORDER layout for writing. For doing so, fill values are added to the data such that it fits the specified tiling. Writing this data works, however getting the non-empty domain of the resulting array, seems to include the fill values whereas we would expect them to be excluded from the domain as they represent empty cells. An example of this behaviour is given below. Is this behaviour a bug, or do we just misunderstand the way in which the fill values or the non_empty domain work?

/*
Creates an 1-dimensional array of the form [nan, 1 | 2, 3] and read the non-empty domain of it.
Expected Output: [1, 3]
Received Output: [0, 3]
*/

// Create the array
std::vector<float> data {std::nanf(""), 1, 2, 3};
tiledb::Context Context {};

tiledb::Domain domain(Context);
domain.add_dimension(tiledb::Dimension::create<uint64_t>(Context, "test_dimension", {0, 3}, 2));

tiledb::ArraySchema schema(Context, TILEDB_DENSE);
schema.set_domain(domain);
schema.add_attribute(tiledb::Attribute::create<float>(Context, "test"));

std::filesystem::path path("test.tdb");
tiledb::Array::create(path, schema);
tiledb::Array array(Context, path, TILEDB_WRITE);

// Write data to the array
tiledb::Subarray subarray(Context, array);
subarray.add_range<uint64_t>(0, 0, 3);
tiledb::Query query(Context, array, TILEDB_WRITE);
query.set_layout(TILEDB_GLOBAL_ORDER);
query.set_subarray(subarray);
query.set_data_buffer("test", data);
query.submit();
query.finalize();
array.close();

// Get non-empty domain
array.open(TILEDB_READ);
std::cout << array.non_empty_domain<uint64_t>()[0].second.first << "     "
<< array.non_empty_domain<uint64_t>()[0].second.second << std::endl;

Hi @fabian-na, fill values are not filtered out on write, so any cell that you explicitly write into will be included in the nonempty domain. Fill values are returned on read for dense arrays for either: 1. unwritten cells within the query ranges; or 2. written cells which do not match the query condition, if provided.

Okay, but for writing in global layout I have to add fill (or other) values to the data in order to make it fit the tile shape. If non_empty_domain does not work on explicit writes, does that mean that there is no possibility of knowing which cells contain values just added for fitting the tile shape without processing the whole array (e.g. using a query condition)?

Yes, correct. Dense writes only record the bounds of the write for each dimension. One option is to use query conditions, as you mentioned; another would be to use nullable attributes. Nullable attributes are useful when you don’t want to lose/reserve any values from the datatype range for a marker (among other reasons), but if you are 1. only using floating point, and 2. you are ok with using the default quiet NaN (or another NaN) to indicate empty cells, then that might be preferable because nullable attributes require a little bit more i/o.

Finally, if the data is sparse enough, then TILEDB_SPARSE could be an option. In that case you can write/retrieve only the nonempty cells, rather than the entire rectangle for each tile.

1 Like

To add one point of clarification here: changing the write subarray range to something less than the dimension bounds will work fine:

subarray.add_range<uint64_t>(0, 1, 3);

And will give you a nonempty_domain of (1,3).