Hi,
I am exploring data storage in TileDB array using Python API and I am quite fascinated by its Time Travel feature.
I am using the TileDB-core version 2.15.2 and TileDB-Py version 0.21.3 .
I store a 2-D sparse dataset in a sparse TileDB array. It has only one attribute.
The number of data-points is 10^7.
I write these datasets in chunks of 10^5 data-points each time. So, I need to write it 100 times and this gives me 100 fragments/versions. The write operation is quite fast.
I was inspecting the read operation performance using discrete datasets for different sizes.
By discrete dataset, I mean that the points to be queried will be equally distributed over whole of the dataset.
For example, a discrete dataset of size 10^3 will have 10^3 points. Every two data-points are spaced at an interval of 10^4.
In this way, I ensure that the query will have to go through all the 100 fragments/versions created.
Similarly, I perform reads for datasets of size 10^4, 10^5 and 10^6.
My observation is :
1.) read queries for datasets of size 10^3 and 10^4 take time around 200-250 milli-seconds
2.) read queries for dataset of size 10^5 take time around 400 milli-seconds
3.) But, the time sky-rockets for read queries for dataset of size 10^6 . It takes around 2.2seconds or 2200 milli-seconds
I am unable to figure out the reason for this big jump.
I tried consolidation of fragments which reduces the time to nearly 1.5 seconds or 1500 milli-seconds.
But, another important feature for me is the Time Travel. So, I can’t vacuum the fragments.
Also, we don’t prefer consolidation because it slows down the Time Travel feature by a large extent when we query the same 10^6 dataset in different versions.
Changing the tile extent didn’t have much impact on this time. Increasing the tile capacity made the process faster, but only by a small extent. May be I am unable to set the capacity and extent correctly.
The code used by me is attached hereby along with the performance statistics for the 10^6 case.
It stores a 2-D sparse dataset where the co-ordinates are of the form (a,a) and the corresponding attribute has a value = a. a varies from 1 to 10^7
Please guide on how to reduce the time of reads preferably to an extent of 700-900 milli-seconds
import tiledb, time
import numpy as np
version_list = []
for i in range(100) :
arr = np.arange(i*100000 + 1 , (i+1)*100000 + 1)
version_list.append(arr)
d1 = tiledb.Dim(name="d1", domain=(np.iinfo(np.int32).min/2, np.iinfo(np.int32).max/2), tile=1000000, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(np.iinfo(np.int32).min/2, np.iinfo(np.int32).max/2), tile=1000000, dtype=np.int32)
dom1 = tiledb.Domain(d1, d2)
a = tiledb.Attr(name="a", dtype=np.int32)
schema1 = tiledb.ArraySchema(domain=dom1, sparse=True, attrs=[a], capacity = 1000000)
tiledb.Array.create('~/TileDB/sparse/array_sparse_1', schema1)
for i in range(100) :
d1_data = version_list[i]
d2_data = version_list[i]
a_data = version_list[i]
with tiledb.open('~/TileDB/sparse/array_sparse_1', 'w') as A :
A[d1_data, d2_data] = a_data
number_of_queries = [1000, 1e4, 1e5, 1e6]
discrete_dataset = []
data_size = 1e7
for n in number_of_queries :
query_array = np.arange(1, 1e7 + 1, data_size/n)
discrete_dataset.append(query_array)
time_list = []
for i in range(0,3) :
with tiledb.open('~/TileDB/sparse/array_sparse_1') as A :
start = time.time()
r = A.multi_index[discrete_dataset[i], discrete_dataset[i]]
end = time.time()
time_list.append(end - start)
print(time_list)
with tiledb.open('~/TileDB/sparse/array_sparse_1') as A :
tiledb.stats_enable()
start = time.time()
r = A.multi_index[discrete_dataset[3], discrete_dataset[3]]
end = time.time()
tiledb.stats_dump()
tiledb.stats_disable
print(end - start)
Time outputs are :
time_list = [0.26061177253723145, 0.24928975105285645, 0.44175052642822266]
Time taken by 10^6 query = 2.1791791915893555
Times are in seconds
Here you can see the time taken by queries of size 10^3, 10^4, 10^5 are stored in the list time_list and
a sudden jump is observed in case of 10^6 queries.
Performance statistics are :
TileDB Embedded Version: (2, 15, 2)
TileDB-Py Version: 0.21.3
[
{
"timers": {
"Context.StorageManager.write_store_frag_meta.sum": 4.01336,
"Context.StorageManager.write_store_frag_meta.avg": 0.0401336,
"Context.StorageManager.write_meta.sum": 7.9073e-05,
"Context.StorageManager.write_meta.avg": 7.9073e-07,
"Context.StorageManager.subSubarray.sort_ranges.sum": 0.0617387,
"Context.StorageManager.subSubarray.sort_ranges.avg": 0.0123477,
"Context.StorageManager.subSubarray.read_load_relevant_rtrees.sum": 0.105452,
"Context.StorageManager.subSubarray.read_load_relevant_rtrees.avg": 0.0210905,
"Context.StorageManager.subSubarray.read_compute_simple_tile_overlap.sum": 2.33885,
"Context.StorageManager.subSubarray.read_compute_simple_tile_overlap.avg": 0.467769,
"Context.StorageManager.subSubarray.compute_relevant_frags.sum": 0.0699658,
"Context.StorageManager.subSubarray.compute_relevant_frags.avg": 0.0139932,
"Context.StorageManager.sm_load_fragment_metadata.sum": 0.0636737,
"Context.StorageManager.sm_load_fragment_metadata.avg": 0.0127347,
"Context.StorageManager.sm_load_array_schemas_and_fragment_metadata.sum": 0.0740409,
"Context.StorageManager.sm_load_array_schemas_and_fragment_metadata.avg": 0.0148082,
"Context.StorageManager.sm_load_array_schema_from_uri.sum": 0.14677,
"Context.StorageManager.sm_load_array_schema_from_uri.avg": 0.00139781,
"Context.StorageManager.sm_load_all_array_schemas.sum": 0.147889,
"Context.StorageManager.sm_load_all_array_schemas.avg": 0.00140847,
"Context.StorageManager.array_open_write_load_schemas.sum": 0.141187,
"Context.StorageManager.array_open_write_load_schemas.avg": 0.00141187,
"Context.StorageManager.array_open_write_load_directory.sum": 0.0514619,
"Context.StorageManager.array_open_write_load_directory.avg": 0.000514619,
"Context.StorageManager.array_open_read_load_schemas_and_fragment_meta.sum": 0.0740776,
"Context.StorageManager.array_open_read_load_schemas_and_fragment_meta.avg": 0.0148155,
"Context.StorageManager.array_open_read_load_directory.sum": 0.174057,
"Context.StorageManager.array_open_read_load_directory.avg": 0.0348114,
"Context.StorageManager.array_open_WRITE.sum": 0.193658,
"Context.StorageManager.array_open_WRITE.avg": 0.00193658,
"Context.StorageManager.array_open_READ.sum": 0.248261,
"Context.StorageManager.array_open_READ.avg": 0.0496521,
"Context.StorageManager.VFS.ArrayDirectory.list_root_uris.sum": 0.00230594,
"Context.StorageManager.VFS.ArrayDirectory.list_root_uris.avg": 0.000461187,
"Context.StorageManager.VFS.ArrayDirectory.list_fragment_meta_uris.sum": 0.00219286,
"Context.StorageManager.VFS.ArrayDirectory.list_fragment_meta_uris.avg": 0.000438573,
"Context.StorageManager.VFS.ArrayDirectory.list_commit_uris.sum": 0.170299,
"Context.StorageManager.VFS.ArrayDirectory.list_commit_uris.avg": 0.0340598,
"Context.StorageManager.VFS.ArrayDirectory.list_array_schema_uris.sum": 0.0376046,
"Context.StorageManager.VFS.ArrayDirectory.list_array_schema_uris.avg": 0.000358139,
"Context.StorageManager.VFS.ArrayDirectory.list_array_meta_uris.sum": 0.00208563,
"Context.StorageManager.VFS.ArrayDirectory.list_array_meta_uris.avg": 0.000417126,
"Context.StorageManager.Query.Writer.write_tiles.sum": 1.53089,
"Context.StorageManager.Query.Writer.write_tiles.avg": 0.00510298,
"Context.StorageManager.Query.Writer.write_num_tiles.sum": 0.568311,
"Context.StorageManager.Query.Writer.write_num_tiles.avg": 0.00568311,
"Context.StorageManager.Query.Writer.split_coords_buff.sum": 6.9238e-05,
"Context.StorageManager.Query.Writer.split_coords_buff.avg": 6.9238e-07,
"Context.StorageManager.Query.Writer.sort_coords.sum": 1.04779,
"Context.StorageManager.Query.Writer.sort_coords.avg": 0.0104779,
"Context.StorageManager.Query.Writer.prepare_tiles.sum": 0.104241,
"Context.StorageManager.Query.Writer.prepare_tiles.avg": 0.00104241,
"Context.StorageManager.Query.Writer.finalize.sum": 6.617e-05,
"Context.StorageManager.Query.Writer.finalize.avg": 6.617e-07,
"Context.StorageManager.Query.Writer.filter_tiles.sum": 0.0779903,
"Context.StorageManager.Query.Writer.filter_tiles.avg": 0.000779903,
"Context.StorageManager.Query.Writer.filter_tile.sum": 0.185368,
"Context.StorageManager.Query.Writer.filter_tile.avg": 0.000617895,
"Context.StorageManager.Query.Writer.dowork.sum": 6.27254,
"Context.StorageManager.Query.Writer.dowork.avg": 0.0627254,
"Context.StorageManager.Query.Writer.compute_coord_meta.sum": 0.0555017,
"Context.StorageManager.Query.Writer.compute_coord_meta.avg": 0.000555017,
"Context.StorageManager.Query.Writer.check_coord_oob.sum": 0.130682,
"Context.StorageManager.Query.Writer.check_coord_oob.avg": 0.00130682,
"Context.StorageManager.Query.Writer.check_coord_dups.sum": 0.0571989,
"Context.StorageManager.Query.Writer.check_coord_dups.avg": 0.000571989,
"Context.StorageManager.Query.Reader.unfilter_coord_tiles.sum": 0.126041,
"Context.StorageManager.Query.Reader.unfilter_coord_tiles.avg": 0.00504165,
"Context.StorageManager.Query.Reader.unfilter_attr_tiles.sum": 0.0137227,
"Context.StorageManager.Query.Reader.unfilter_attr_tiles.avg": 0.00274454,
"Context.StorageManager.Query.Reader.tile_offset_sizes.sum": 0.00284501,
"Context.StorageManager.Query.Reader.tile_offset_sizes.avg": 0.000569002,
"Context.StorageManager.Query.Reader.read_tiles.sum": 0.262638,
"Context.StorageManager.Query.Reader.read_tiles.avg": 0.0131319,
"Context.StorageManager.Query.Reader.read_coordinate_tiles.sum": 0.216984,
"Context.StorageManager.Query.Reader.read_coordinate_tiles.avg": 0.0216984,
"Context.StorageManager.Query.Reader.read_attribute_tiles.sum": 0.0457224,
"Context.StorageManager.Query.Reader.read_attribute_tiles.avg": 0.00457224,
"Context.StorageManager.Query.Reader.read_and_unfilter_coords.sum": 0.343704,
"Context.StorageManager.Query.Reader.read_and_unfilter_coords.avg": 0.0687408,
"Context.StorageManager.Query.Reader.read_and_unfilter_attributes.sum": 0.0592662,
"Context.StorageManager.Query.Reader.read_and_unfilter_attributes.avg": 0.0118532,
"Context.StorageManager.Query.Reader.process_slabs.sum": 0.156089,
"Context.StorageManager.Query.Reader.process_slabs.avg": 0.0312179,
"Context.StorageManager.Query.Reader.merge_result_cell_slabs.sum": 0.802831,
"Context.StorageManager.Query.Reader.merge_result_cell_slabs.avg": 0.160566,
"Context.StorageManager.Query.Reader.load_tile_var_sizes.sum": 0.00271931,
"Context.StorageManager.Query.Reader.load_tile_var_sizes.avg": 0.000543862,
"Context.StorageManager.Query.Reader.load_tile_offsets.sum": 0.233687,
"Context.StorageManager.Query.Reader.load_tile_offsets.avg": 0.0155792,
"Context.StorageManager.Query.Reader.load_initial_data.sum": 2.34185,
"Context.StorageManager.Query.Reader.load_initial_data.avg": 0.46837,
"Context.StorageManager.Query.Reader.dowork.sum": 4.08559,
"Context.StorageManager.Query.Reader.dowork.avg": 0.817118,
"Context.StorageManager.Query.Reader.dedup_tiles_with_timestamps.sum": 0.00259952,
"Context.StorageManager.Query.Reader.dedup_tiles_with_timestamps.avg": 0.000519904,
"Context.StorageManager.Query.Reader.dedup_fragments_with_timestamps.sum": 0.00270662,
"Context.StorageManager.Query.Reader.dedup_fragments_with_timestamps.avg": 0.000541324,
"Context.StorageManager.Query.Reader.create_result_tiles.sum": 0.00273004,
"Context.StorageManager.Query.Reader.create_result_tiles.avg": 0.000546008,
"Context.StorageManager.Query.Reader.copy_fixed_data_tiles.sum": 0.0160737,
"Context.StorageManager.Query.Reader.copy_fixed_data_tiles.avg": 0.00107158,
"Context.StorageManager.Query.Reader.compute_tile_bitmaps.sum": 0.07404,
"Context.StorageManager.Query.Reader.compute_tile_bitmaps.avg": 0.014808,
"Context.StorageManager.Query.Reader.compute_results_count_sparse.sum": 1.68677,
"Context.StorageManager.Query.Reader.compute_results_count_sparse.avg": 0.00168677,
"Context.StorageManager.Query.Reader.compute_result_cell_slab.sum": 0.802887,
"Context.StorageManager.Query.Reader.compute_result_cell_slab.avg": 0.160577,
"Context.StorageManager.Query.Reader.apply_query_condition.sum": 0.0031681,
"Context.StorageManager.Query.Reader.apply_query_condition.avg": 0.000633621
},
"counters": {
"Context.StorageManager.write_tile_var_sizes_size": 39600,
"Context.StorageManager.write_tile_var_offsets_size": 39600,
"Context.StorageManager.write_tile_validity_offsets_size": 39600,
"Context.StorageManager.write_tile_offsets_size": 39600,
"Context.StorageManager.write_sums_size": 41700,
"Context.StorageManager.write_rtree_size": 11392,
"Context.StorageManager.write_processed_conditions_size": 9900,
"Context.StorageManager.write_null_counts_size": 52900,
"Context.StorageManager.write_mins_size": 40020,
"Context.StorageManager.write_maxs_size": 40010,
"Context.StorageManager.write_frag_meta_footer_size": 48600,
"Context.StorageManager.write_filtered_byte_num": 68995,
"Context.StorageManager.write_array_schema_size": 195,
"Context.StorageManager.read_unfiltered_byte_num": 60475,
"Context.StorageManager.read_tile_offsets_size": 24000,
"Context.StorageManager.read_rtree_size": 16000,
"Context.StorageManager.read_frag_meta_size": 247000,
"Context.StorageManager.read_array_schema_size": 20475,
"Context.StorageManager.VFS.write_ops_num": 7502,
"Context.StorageManager.VFS.write_byte_num": 120467891,
"Context.StorageManager.VFS.read_ops_num": 8815,
"Context.StorageManager.VFS.read_byte_num": 600790205,
"Context.StorageManager.VFS.ls_num": 125,
"Context.StorageManager.VFS.is_object_num": 107,
"Context.StorageManager.VFS.file_size_num": 500,
"Context.StorageManager.Query.Writer.write_filtered_byte_num": 120000000,
"Context.StorageManager.Query.Writer.tile_num": 100,
"Context.StorageManager.Query.Writer.dim_num": 200,
"Context.StorageManager.Query.Writer.dim_fixed_num": 200,
"Context.StorageManager.Query.Writer.cell_num": 10000000,
"Context.StorageManager.Query.Writer.attr_num": 100,
"Context.StorageManager.Query.Writer.attr_fixed_num": 100,
"Context.StorageManager.Query.Reader.result_num": 2111000,
"Context.StorageManager.Query.Reader.read_unfiltered_byte_num": 400000000,
"Context.StorageManager.Query.Reader.num_tiles_read": 1500,
"Context.StorageManager.Query.Reader.loop_num": 5,
"Context.StorageManager.Query.Reader.ignored_tiles": 0,
"Context.StorageManager.Query.Reader.dim_num": 10,
"Context.StorageManager.Query.Reader.dim_fixed_num": 10,
"Context.StorageManager.Query.Reader.cell_num": 50000000,
"Context.StorageManager.Query.Reader.attr_num": 5,
"Context.StorageManager.Query.Reader.attr_fixed_num": 5
}
}
]
==== Python Stats ====
py.core_read_query_initial_submit_time : 1.56537
py.core_read_query_total_time : 1.5656
py.getitem_time : 2.01062
py.getitem_time.add_ranges : 0.334357
py.query_retries_count : 0
Another query I would like to ask is how to query specific data-points in dense arrays which are non-contiguous because using multi_index takes the cross-product of all the dimensions queried which leads us to get some extra data as the output .
Thanks for giving your time.