Optimizing the reads for sparse arrays

Ak_J · June 6, 2023, 3:55pm

Hi,
I am exploring data storage in TileDB array using Python API and I am quite fascinated by its Time Travel feature.
I am using the TileDB-core version 2.15.2 and TileDB-Py version 0.21.3 .

I store a 2-D sparse dataset in a sparse TileDB array. It has only one attribute.
The number of data-points is 10^7.
I write these datasets in chunks of 10^5 data-points each time. So, I need to write it 100 times and this gives me 100 fragments/versions. The write operation is quite fast.

I was inspecting the read operation performance using discrete datasets for different sizes.
By discrete dataset, I mean that the points to be queried will be equally distributed over whole of the dataset.
For example, a discrete dataset of size 10^3 will have 10^3 points. Every two data-points are spaced at an interval of 10^4.
In this way, I ensure that the query will have to go through all the 100 fragments/versions created.
Similarly, I perform reads for datasets of size 10^4, 10^5 and 10^6.

My observation is :
1.) read queries for datasets of size 10^3 and 10^4 take time around 200-250 milli-seconds
2.) read queries for dataset of size 10^5 take time around 400 milli-seconds
3.) But, the time sky-rockets for read queries for dataset of size 10^6 . It takes around 2.2seconds or 2200 milli-seconds

I am unable to figure out the reason for this big jump.

I tried consolidation of fragments which reduces the time to nearly 1.5 seconds or 1500 milli-seconds.

But, another important feature for me is the Time Travel. So, I can’t vacuum the fragments.
Also, we don’t prefer consolidation because it slows down the Time Travel feature by a large extent when we query the same 10^6 dataset in different versions.

Changing the tile extent didn’t have much impact on this time. Increasing the tile capacity made the process faster, but only by a small extent. May be I am unable to set the capacity and extent correctly.

The code used by me is attached hereby along with the performance statistics for the 10^6 case.
It stores a 2-D sparse dataset where the co-ordinates are of the form (a,a) and the corresponding attribute has a value = a. a varies from 1 to 10^7

Please guide on how to reduce the time of reads preferably to an extent of 700-900 milli-seconds

import tiledb, time
import numpy as np

version_list = []

for i in range(100) :
  arr = np.arange(i*100000 + 1 , (i+1)*100000 + 1)
  version_list.append(arr)

d1 = tiledb.Dim(name="d1", domain=(np.iinfo(np.int32).min/2, np.iinfo(np.int32).max/2), tile=1000000, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(np.iinfo(np.int32).min/2, np.iinfo(np.int32).max/2), tile=1000000, dtype=np.int32)

dom1 = tiledb.Domain(d1, d2)
a = tiledb.Attr(name="a", dtype=np.int32)
schema1 = tiledb.ArraySchema(domain=dom1, sparse=True, attrs=[a], capacity = 1000000)
tiledb.Array.create('~/TileDB/sparse/array_sparse_1', schema1)

for i in range(100) :
  d1_data = version_list[i]
  d2_data = version_list[i]
  a_data = version_list[i]
  with tiledb.open('~/TileDB/sparse/array_sparse_1', 'w') as A :
      A[d1_data, d2_data] = a_data

number_of_queries = [1000, 1e4, 1e5, 1e6]
discrete_dataset = []
data_size = 1e7

for n in number_of_queries :
  query_array = np.arange(1, 1e7 + 1, data_size/n)
  discrete_dataset.append(query_array)

time_list = []

for i in range(0,3) : 
  with tiledb.open('~/TileDB/sparse/array_sparse_1') as A :
      start = time.time()
      r = A.multi_index[discrete_dataset[i], discrete_dataset[i]]
      end = time.time()
  time_list.append(end - start)

print(time_list)

with tiledb.open('~/TileDB/sparse/array_sparse_1') as A :
  tiledb.stats_enable()
  start = time.time()
  r = A.multi_index[discrete_dataset[3], discrete_dataset[3]]
  end = time.time()
  tiledb.stats_dump()
  tiledb.stats_disable
print(end - start)

Time outputs are :
time_list = [0.26061177253723145, 0.24928975105285645, 0.44175052642822266]
Time taken by 10^6 query = 2.1791791915893555
Times are in seconds

Here you can see the time taken by queries of size 10^3, 10^4, 10^5 are stored in the list time_list and
a sudden jump is observed in case of 10^6 queries.

Performance statistics are :

TileDB Embedded Version: (2, 15, 2)
TileDB-Py Version: 0.21.3

[
  {
    "timers": {
      "Context.StorageManager.write_store_frag_meta.sum": 4.01336,
      "Context.StorageManager.write_store_frag_meta.avg": 0.0401336,
      "Context.StorageManager.write_meta.sum": 7.9073e-05,
      "Context.StorageManager.write_meta.avg": 7.9073e-07,
      "Context.StorageManager.subSubarray.sort_ranges.sum": 0.0617387,
      "Context.StorageManager.subSubarray.sort_ranges.avg": 0.0123477,
      "Context.StorageManager.subSubarray.read_load_relevant_rtrees.sum": 0.105452,
      "Context.StorageManager.subSubarray.read_load_relevant_rtrees.avg": 0.0210905,
      "Context.StorageManager.subSubarray.read_compute_simple_tile_overlap.sum": 2.33885,
      "Context.StorageManager.subSubarray.read_compute_simple_tile_overlap.avg": 0.467769,
      "Context.StorageManager.subSubarray.compute_relevant_frags.sum": 0.0699658,
      "Context.StorageManager.subSubarray.compute_relevant_frags.avg": 0.0139932,
      "Context.StorageManager.sm_load_fragment_metadata.sum": 0.0636737,
      "Context.StorageManager.sm_load_fragment_metadata.avg": 0.0127347,
      "Context.StorageManager.sm_load_array_schemas_and_fragment_metadata.sum": 0.0740409,
      "Context.StorageManager.sm_load_array_schemas_and_fragment_metadata.avg": 0.0148082,
      "Context.StorageManager.sm_load_array_schema_from_uri.sum": 0.14677,
      "Context.StorageManager.sm_load_array_schema_from_uri.avg": 0.00139781,
      "Context.StorageManager.sm_load_all_array_schemas.sum": 0.147889,
      "Context.StorageManager.sm_load_all_array_schemas.avg": 0.00140847,
      "Context.StorageManager.array_open_write_load_schemas.sum": 0.141187,
      "Context.StorageManager.array_open_write_load_schemas.avg": 0.00141187,
      "Context.StorageManager.array_open_write_load_directory.sum": 0.0514619,
      "Context.StorageManager.array_open_write_load_directory.avg": 0.000514619,
      "Context.StorageManager.array_open_read_load_schemas_and_fragment_meta.sum": 0.0740776,
      "Context.StorageManager.array_open_read_load_schemas_and_fragment_meta.avg": 0.0148155,
      "Context.StorageManager.array_open_read_load_directory.sum": 0.174057,
      "Context.StorageManager.array_open_read_load_directory.avg": 0.0348114,
      "Context.StorageManager.array_open_WRITE.sum": 0.193658,
      "Context.StorageManager.array_open_WRITE.avg": 0.00193658,
      "Context.StorageManager.array_open_READ.sum": 0.248261,
      "Context.StorageManager.array_open_READ.avg": 0.0496521,
      "Context.StorageManager.VFS.ArrayDirectory.list_root_uris.sum": 0.00230594,
      "Context.StorageManager.VFS.ArrayDirectory.list_root_uris.avg": 0.000461187,
      "Context.StorageManager.VFS.ArrayDirectory.list_fragment_meta_uris.sum": 0.00219286,
      "Context.StorageManager.VFS.ArrayDirectory.list_fragment_meta_uris.avg": 0.000438573,
      "Context.StorageManager.VFS.ArrayDirectory.list_commit_uris.sum": 0.170299,
      "Context.StorageManager.VFS.ArrayDirectory.list_commit_uris.avg": 0.0340598,
      "Context.StorageManager.VFS.ArrayDirectory.list_array_schema_uris.sum": 0.0376046,
      "Context.StorageManager.VFS.ArrayDirectory.list_array_schema_uris.avg": 0.000358139,
      "Context.StorageManager.VFS.ArrayDirectory.list_array_meta_uris.sum": 0.00208563,
      "Context.StorageManager.VFS.ArrayDirectory.list_array_meta_uris.avg": 0.000417126,
      "Context.StorageManager.Query.Writer.write_tiles.sum": 1.53089,
      "Context.StorageManager.Query.Writer.write_tiles.avg": 0.00510298,
      "Context.StorageManager.Query.Writer.write_num_tiles.sum": 0.568311,
      "Context.StorageManager.Query.Writer.write_num_tiles.avg": 0.00568311,
      "Context.StorageManager.Query.Writer.split_coords_buff.sum": 6.9238e-05,
      "Context.StorageManager.Query.Writer.split_coords_buff.avg": 6.9238e-07,
      "Context.StorageManager.Query.Writer.sort_coords.sum": 1.04779,
      "Context.StorageManager.Query.Writer.sort_coords.avg": 0.0104779,
      "Context.StorageManager.Query.Writer.prepare_tiles.sum": 0.104241,
      "Context.StorageManager.Query.Writer.prepare_tiles.avg": 0.00104241,
      "Context.StorageManager.Query.Writer.finalize.sum": 6.617e-05,
      "Context.StorageManager.Query.Writer.finalize.avg": 6.617e-07,
      "Context.StorageManager.Query.Writer.filter_tiles.sum": 0.0779903,
      "Context.StorageManager.Query.Writer.filter_tiles.avg": 0.000779903,
      "Context.StorageManager.Query.Writer.filter_tile.sum": 0.185368,
      "Context.StorageManager.Query.Writer.filter_tile.avg": 0.000617895,
      "Context.StorageManager.Query.Writer.dowork.sum": 6.27254,
      "Context.StorageManager.Query.Writer.dowork.avg": 0.0627254,
      "Context.StorageManager.Query.Writer.compute_coord_meta.sum": 0.0555017,
      "Context.StorageManager.Query.Writer.compute_coord_meta.avg": 0.000555017,
      "Context.StorageManager.Query.Writer.check_coord_oob.sum": 0.130682,
      "Context.StorageManager.Query.Writer.check_coord_oob.avg": 0.00130682,
      "Context.StorageManager.Query.Writer.check_coord_dups.sum": 0.0571989,
      "Context.StorageManager.Query.Writer.check_coord_dups.avg": 0.000571989,
      "Context.StorageManager.Query.Reader.unfilter_coord_tiles.sum": 0.126041,
      "Context.StorageManager.Query.Reader.unfilter_coord_tiles.avg": 0.00504165,
      "Context.StorageManager.Query.Reader.unfilter_attr_tiles.sum": 0.0137227,
      "Context.StorageManager.Query.Reader.unfilter_attr_tiles.avg": 0.00274454,
      "Context.StorageManager.Query.Reader.tile_offset_sizes.sum": 0.00284501,
      "Context.StorageManager.Query.Reader.tile_offset_sizes.avg": 0.000569002,
      "Context.StorageManager.Query.Reader.read_tiles.sum": 0.262638,
      "Context.StorageManager.Query.Reader.read_tiles.avg": 0.0131319,
      "Context.StorageManager.Query.Reader.read_coordinate_tiles.sum": 0.216984,
      "Context.StorageManager.Query.Reader.read_coordinate_tiles.avg": 0.0216984,
      "Context.StorageManager.Query.Reader.read_attribute_tiles.sum": 0.0457224,
      "Context.StorageManager.Query.Reader.read_attribute_tiles.avg": 0.00457224,
      "Context.StorageManager.Query.Reader.read_and_unfilter_coords.sum": 0.343704,
      "Context.StorageManager.Query.Reader.read_and_unfilter_coords.avg": 0.0687408,
      "Context.StorageManager.Query.Reader.read_and_unfilter_attributes.sum": 0.0592662,
      "Context.StorageManager.Query.Reader.read_and_unfilter_attributes.avg": 0.0118532,
      "Context.StorageManager.Query.Reader.process_slabs.sum": 0.156089,
      "Context.StorageManager.Query.Reader.process_slabs.avg": 0.0312179,
      "Context.StorageManager.Query.Reader.merge_result_cell_slabs.sum": 0.802831,
      "Context.StorageManager.Query.Reader.merge_result_cell_slabs.avg": 0.160566,
      "Context.StorageManager.Query.Reader.load_tile_var_sizes.sum": 0.00271931,
      "Context.StorageManager.Query.Reader.load_tile_var_sizes.avg": 0.000543862,
      "Context.StorageManager.Query.Reader.load_tile_offsets.sum": 0.233687,
      "Context.StorageManager.Query.Reader.load_tile_offsets.avg": 0.0155792,
      "Context.StorageManager.Query.Reader.load_initial_data.sum": 2.34185,
      "Context.StorageManager.Query.Reader.load_initial_data.avg": 0.46837,
      "Context.StorageManager.Query.Reader.dowork.sum": 4.08559,
      "Context.StorageManager.Query.Reader.dowork.avg": 0.817118,
      "Context.StorageManager.Query.Reader.dedup_tiles_with_timestamps.sum": 0.00259952,
      "Context.StorageManager.Query.Reader.dedup_tiles_with_timestamps.avg": 0.000519904,
      "Context.StorageManager.Query.Reader.dedup_fragments_with_timestamps.sum": 0.00270662,
      "Context.StorageManager.Query.Reader.dedup_fragments_with_timestamps.avg": 0.000541324,
      "Context.StorageManager.Query.Reader.create_result_tiles.sum": 0.00273004,
      "Context.StorageManager.Query.Reader.create_result_tiles.avg": 0.000546008,
      "Context.StorageManager.Query.Reader.copy_fixed_data_tiles.sum": 0.0160737,
      "Context.StorageManager.Query.Reader.copy_fixed_data_tiles.avg": 0.00107158,
      "Context.StorageManager.Query.Reader.compute_tile_bitmaps.sum": 0.07404,
      "Context.StorageManager.Query.Reader.compute_tile_bitmaps.avg": 0.014808,
      "Context.StorageManager.Query.Reader.compute_results_count_sparse.sum": 1.68677,
      "Context.StorageManager.Query.Reader.compute_results_count_sparse.avg": 0.00168677,
      "Context.StorageManager.Query.Reader.compute_result_cell_slab.sum": 0.802887,
      "Context.StorageManager.Query.Reader.compute_result_cell_slab.avg": 0.160577,
      "Context.StorageManager.Query.Reader.apply_query_condition.sum": 0.0031681,
      "Context.StorageManager.Query.Reader.apply_query_condition.avg": 0.000633621
    },
    "counters": {
      "Context.StorageManager.write_tile_var_sizes_size": 39600,
      "Context.StorageManager.write_tile_var_offsets_size": 39600,
      "Context.StorageManager.write_tile_validity_offsets_size": 39600,
      "Context.StorageManager.write_tile_offsets_size": 39600,
      "Context.StorageManager.write_sums_size": 41700,
      "Context.StorageManager.write_rtree_size": 11392,
      "Context.StorageManager.write_processed_conditions_size": 9900,
      "Context.StorageManager.write_null_counts_size": 52900,
      "Context.StorageManager.write_mins_size": 40020,
      "Context.StorageManager.write_maxs_size": 40010,
      "Context.StorageManager.write_frag_meta_footer_size": 48600,
      "Context.StorageManager.write_filtered_byte_num": 68995,
      "Context.StorageManager.write_array_schema_size": 195,
      "Context.StorageManager.read_unfiltered_byte_num": 60475,
      "Context.StorageManager.read_tile_offsets_size": 24000,
      "Context.StorageManager.read_rtree_size": 16000,
      "Context.StorageManager.read_frag_meta_size": 247000,
      "Context.StorageManager.read_array_schema_size": 20475,
      "Context.StorageManager.VFS.write_ops_num": 7502,
      "Context.StorageManager.VFS.write_byte_num": 120467891,
      "Context.StorageManager.VFS.read_ops_num": 8815,
      "Context.StorageManager.VFS.read_byte_num": 600790205,
      "Context.StorageManager.VFS.ls_num": 125,
      "Context.StorageManager.VFS.is_object_num": 107,
      "Context.StorageManager.VFS.file_size_num": 500,
      "Context.StorageManager.Query.Writer.write_filtered_byte_num": 120000000,
      "Context.StorageManager.Query.Writer.tile_num": 100,
      "Context.StorageManager.Query.Writer.dim_num": 200,
      "Context.StorageManager.Query.Writer.dim_fixed_num": 200,
      "Context.StorageManager.Query.Writer.cell_num": 10000000,
      "Context.StorageManager.Query.Writer.attr_num": 100,
      "Context.StorageManager.Query.Writer.attr_fixed_num": 100,
      "Context.StorageManager.Query.Reader.result_num": 2111000,
      "Context.StorageManager.Query.Reader.read_unfiltered_byte_num": 400000000,
      "Context.StorageManager.Query.Reader.num_tiles_read": 1500,
      "Context.StorageManager.Query.Reader.loop_num": 5,
      "Context.StorageManager.Query.Reader.ignored_tiles": 0,
      "Context.StorageManager.Query.Reader.dim_num": 10,
      "Context.StorageManager.Query.Reader.dim_fixed_num": 10,
      "Context.StorageManager.Query.Reader.cell_num": 50000000,
      "Context.StorageManager.Query.Reader.attr_num": 5,
      "Context.StorageManager.Query.Reader.attr_fixed_num": 5
    }
  }
]

==== Python Stats ====

  py.core_read_query_initial_submit_time : 1.56537
  py.core_read_query_total_time : 1.5656
  py.getitem_time : 2.01062
  py.getitem_time.add_ranges : 0.334357
  py.query_retries_count : 0

Another query I would like to ask is how to query specific data-points in dense arrays which are non-contiguous because using multi_index takes the cross-product of all the dimensions queried which leads us to get some extra data as the output .

Thanks for giving your time.

KiterLuc · June 8, 2023, 10:47am

Hello Ak_J,

Thanks for reporting this issue with a standalone script. It will make driving this issue to resolution much faster. I’ve taken the time to run your code and I see the issue you reported as well. One thing to clarify first… A few releases ago, we implemented the consolidation with timestamps feature, which means you can now consolidate you data while keeping the time traveling feature, even after you vacuum.

Running your code, what I have noticed is that while it seems like there is a bump that seems like it might be exponential in your query times, it is actually quite linear and caused by our tile selection algorithm. When you request indexes on a dimension in python, it translates to ranges in the core so I will use the term ranges from now on. The tile selection algorithm starts as a very small time that is shadowed by other things (like reading data off disk) for the 10^4 ranges query time but when it becomes 100X larger, for the 10^6 ranges query, it becomes more noticeable relative to the rest. I believe we are going to be able to get the tile selection algorithm complexity to be logarithmic soon, but after following up with the team, it might take a few releases before it makes it up the priority list.

Looking at your code, I see that you are setting the tile extent and capacity correctly. But, the fragments end up having only one tile which will always be used by your test queries. At this point, if you would have only one fragment with one very large tile, it would reduce the time for the tile selection algorithm. I’m very hesitant to recommend to increase your tile capacity to 10^7 and consolidate your array because I think there is something better we can do if we understand the scenario you want to support. Is the code submitted really representative of the real world scenario you are trying to use our product for? Could you shed any more light on what the real world scenario you are trying to use TileDB for is?

Side note, another thing that can help improving your query time slightly is to specify a larger initial buffer size if you already know the size of your results. It is specified on the TileDB context when you open an array. For example:

ctx = tiledb.Ctx({
  "py.init_buffer_bytes": 10_073_741_824,
})

with tiledb.open('~/array_sparse_1', ctx=ctx) as A :

Finally, we are thinking of developing an API that will enable you to specify a list of points rather than the cross product of ranges for a query, but again it hasn’t made it up the priority list to make it in a release yet. My colleague @ihnorton might have more insight of when that might make it.

Hope this helps and please let us know how we can assist you further.

Best,
Luc

Ak_J · June 12, 2023, 8:49am

Hi KiterLuc,
Thanks for the response.

Here are the answers to your queries.

The real dataset mostly represent timeseries data.
Most of the datasets have two dimensions, date and an integer(let’s call it number). There could be other dimensions too.
There could only be one attribute or multiple attributes which could either be integers, floats, strings etc.

The query pattern could be:

The user could query 1 date and all possible number worth of data.
The user could query a number and all possible date worth of data.
The user could query a month of data.
The user could query a year, or multiple years worth of data.
The user could query a particular month end data for all years.
The user could query a particular date data over all years.
etc.

We want the data-read queries to be fast in these cases. Since we may query data for different versions, we want the time-travel read queries to be fast as well.

Ak_J · June 14, 2023, 1:59pm

I guess the time-traveling feature works only for the sparse arrays after vacuum. For dense arrays, the time-travel information is lost.

Am I right ?

KiterLuc · June 14, 2023, 5:58pm

Thank you for the details Ak_J. It seems like the code you submit and the query pattern you describe have a slight disconnect. In the code you submitted, you end up asking for 10^6 data points per dimension, which translates to 1 million ranges per dimensions. This is what is taking most of the extra time for your query.

In the patterns you sent, it doesn’t look like you would ask for so many ranges. Let see them one by one:

The user could query 1 date and all possible number worth of data.
This will query one range on the date dimension and no ranges on the number dimension.
r = A.multi_index[date,]
The user could query a number and all possible date worth of data.
This will query one range on the number dimension and no range for the date dimension.
r = A.multi_index[,number]
The user could query a month of data.
This would be one range on the date dimension and no range for the number dimension.
r = A.multi_index[(start_date,end_date),]
The user could query a year, or multiple years worth of data.
Same as the previous one.
The user could query a particular month end data for all years.
This would be one range for the month end in question times the number of years.
r = A.multi_index[(start_date_year_1,end_date_year_1), … (start_date_year_N,end_date_year_N),]
The user could query a particular date data over all years.
This would again be one range times the number of years.
r = A.multi_index[[date_year_1, … date_year_N],]

So again, depending on the number of years, you wouldn’t see so many ranges. Is there something I misunderstand in your scenarios or any missing ones that might have more ranges?

On your second question on dense arrays, you are correct. Dense arrays do not support consolidation with timestamps.

Thanks,
Luc

Ak_J · June 15, 2023, 10:37am

Hi Luc,
Thanks for your responses.

I would like to inform that we also have a scenario, where we would like to select only some numbers for all dates or only some numbers for some dates.

For eg. we have 10^5 dates from 1 to 10^5 and we have 10^6 numbers from 1 to 10^6.

I may also query only numbers [1, 100, 200, 500, . . .] for all the 10^5 dates, which is what you answered.

Or we may query the above numbers only for dates [1, 1000, 1200, 1300, . . . ] like this.

In these cases, we will have ranges across multiple dimensions .

There may also be cases such as :
For number = 1, I need data for 3 dates (100, 200, 300)
For number = 2, I need data for 2 dates (100,200) (Dates may also be (400, 500))

My testcase was more reflective of these query patterns as I made the queries completely discrete.
We wish to optimize the reads in such cases too.

So, are you suggesting us to split the queries in distinct continuous ranges, by what you did here ?

Thanks

Ak_J · June 15, 2023, 4:00pm

Also, may you tell, when are you planning to support time traveling after vacuuming is done ?

Just asking because continuous updates and insertion of new entries leads to a huge file size , which is not desirable in many cases.

Thanks

KiterLuc · June 17, 2023, 4:57pm

Thanks for the clarifications Ak_J. Asking for ranges across multiple dimension is not going to be a problem. The problem I saw when running your case is really when you ask for a very large amount of ranges for any given dimension. To answer your question, if you know that you want a full month of data, it will be better to request it by a continuous range than requesting every day by itself as a range of one month only counts for one range whilst 30 individual days might count as 30 different ranges. But, if the dates you want are not next to each other, you probably don’t want to get the extra data in your results. If you have queries that ask for ranges across multiple dimensions, that should be fine today as long as you request 10^5 ranges per dimension or less. From what you said in your example, since you have 10^5 possible dates, you won’t reach the limit on that dimension. For the numbers, you said: “I may also query only numbers [1, 100, 200, 500, . . .]”… This seems like you wouldn’t query more than 10^5 individual ranges of values there either as there is only 10^6 possible numbers and the example you gave has large gaps between values. So, I think you might be fine there too. Am I wrong?

Again, we will be able to fix the performance issue for queries asking more than 10^5 independent ranges per dimension in the next couple releases but I have to follow up with the team for an exact release. It will be helpful to confirm for your case if you’ll need more than 10^5 ranges for any dimension or not to help us prioritize so please let us know.

About your other question on supporting time travelling for dense arrays after consolidation… Because of the way we store dense data, it was determined that storing all of the time travelling data in a single fragment wouldn’t lead to any performance improvements or size gains so we didn’t implement it. Could you expand a little more on what the scenario you want to support for a dense array that would lead to a huge file size is? If we want to keep all time traveling data for a dense array, I’m not sure this is possible to avoid the size issue unfortunately.

Thanks,
Luc

Ak_J · June 26, 2023, 3:59pm

Thanks for your clarifications Luc. It has helped us in understanding the query pattern in a better way.

We have another query. In case of sparse arrays, the dictionary returned after the multi_index operation contains both the query dimensions arrays and the respective attribute array.

But, the order of the dimension arrays are unsorted.
For eg. if i create a sparse array and write the code in the following way

# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(0, 3), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(0, 3), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom1 = tiledb.Domain(d1, d2)
# Create an attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=True` to indicate a sparse array
schema1 = tiledb.ArraySchema(domain=dom1, sparse=True, attrs=[a], tile_order = 'row-major', cell_order='row-major')

# Create the array on disk (it will initially be empty)
tiledb.Array.create('array_sparse_2', schema1)

# Prepare some data in numpy arrays
d1_data = np.array([2, 0, 3, 1, 0, 2], dtype=np.int32)
d2_data = np.array([0, 1, 1, 2, 3, 3], dtype=np.int32)
a_data = np.array([7, 1, 6, 8, 10, 20], dtype=np.int32)

# Open the array in write mode and write the data in COO format
with tiledb.open('array_sparse_2', 'w') as A:
    A[d1_data, d2_data] = a_data

Now, if I query the data from this sparse array as

# Open the array in read mode

A = tiledb.open('array_sparse_2', 'r')
print(A.multi_index[([0,1,2,3],[0,1,3])])

The output is
OrderedDict([('d1', array([0, 0, 2, 3, 2], dtype=int32)), ('d2', array([1, 3, 0, 1, 3], dtype=int32)), ('a', array([ 1, 10, 7, 6, 20], dtype=int32))])

which is unsorted with respect to the dimensions.

We desire the output to be :
OrderedDict([('d1', array([0, 0, 2, 2, 3], dtype=int32)), ('d2', array([1, 3, 0, 3, 1], dtype=int32)), ('a', array([ 1, 10, 7, 20, 6], dtype=int32))])

Is there any API supporting the above which we might have missed ?

Thanks

ihnorton · June 27, 2023, 2:13pm

Hi @Ak_J,

You can use the following to set the query layout – by default it is unordered for sparse arrays, but row- and column-major are available in addition to global order:

A = tiledb.open('array_sparse_2', 'r')
print(A.query(order="C").multi_index[([0,1,2,3],[0,1,3])])

Results in:

py310 ❯ python repro.py
OrderedDict([('d1', array([0, 0, 2, 2, 3], dtype=int32)), ('d2', array([1, 3, 0, 3, 1], dtype=int32)), ('a', array([ 1, 10,  7, 20,  6], dtype=int32))])

Best,
Isaiah

Topic		Replies	Views
How to speedup reading form a sparse TileDB array	3	632	February 21, 2023
Managing Large Geospatial Arrays with TileDB	3	683	September 29, 2023
How to speed up the reading from tiledb	5	1952	October 8, 2020
Tiledb performance with sparse point cloud data	8	906	March 23, 2023
How can we boost write/read performance in our scenario?	1	1028	June 10, 2021

Optimizing the reads for sparse arrays

Related topics