Hello, @Seth!
Thank you for your answer, I will consider your advice! Here are the answers on the questions you asked.
- My version of tiledb is 0.6.0
- In the directory of an array there are 4276 directories called like this: __1600892167166_1600892167166_6932d4c4f53e4f0e87659124549e145c_5
I believe that these are fragments. And consolidation doesnāt make this number any smaller. The write pattern is the following:
for space_numbers in tiles_space:
for unit_numbers in tiles_units:
arr[space_numbers[0]:space_numbers[-1] + 1, unit_numbers[0]:unit_numbers[-1] + 1] = {āest_countsā: expressions[āest_countsā][unit_numbers].loc[space_numbers].values,
āTPMā: expressions[āTPMā][unit_numbers].loc[space_numbers].values}
āest_countsā
and āTPMā
are attributes that I store. In this way the writing pattern is similar to the read pattern: the data is cut to fragments that have consecutive indices and these fragments are being uploaded separately.
3. Thatās right, I hope to find a single fragment after consolidation and I donāt want to keep the old fragments.
4. No, I donāt run the vacuum stage.
5. The data is stored in local filesystem (in a mounted directory from a separate machine, to be exact).
6. The array schema is implemented in the following function:
TILE_SIZE = db_dict[āmeasuresā][āhorizontal_tile_sizeā]
def create_tiledb(name, unit_name, max_units, column_tile_size):
if not td.object_type(name) == 'array':
max_runs_samples = np.iinfo(np.uint64).max - TILE_SIZE
dom = td.Domain(td.Dim(name=unit_name, domain=(1, max_units),
tile=column_tile_size, dtype=np.uint64),
td.Dim(name='Object', domain=(1, max_runs_samples),
tile=TILE_SIZE, dtype=np.uint64))
schema = td.ArraySchema(domain=dom, cell_order='col-major',
tile_order='col-major', sparse=False,
attrs=[td.Attr(name='est_counts', dtype=np.float64),
td.Attr(name='TPM', dtype=np.float64)])
td.DenseArray.create(name, schema)
class ExpressionBase():
def __init__(self):
"""
Creates database if it is not created yet.
"""
db_names = db_dict['database_names']
measures = db_dict['measures']
self.space_name = lambda s : 'Gene' if s == 'genes' else 'ENST'
for units, space, subset in itertools.product(('runs', 'samples'), ('transcripts', 'genes'),
('coding', 'all')):
try:
db_name = db_names[units][space][subset]
except KeyError:
pass
else:
create_tiledb(db_name, self.space_name(space), measures['max_data'][space],
measures['vertical_tile_sizes'][space])
The corresponding db_dict is here:
{
ādatabase_namesā: {
ārunsā: {
ātranscriptsā: {
āallā: ā{expression_storage_path}/runs_transcriptsā
},
āgenesā: {
ācodingā: ā{expression_storage_path}/runs_genes_codingā,
āallā: ā{expression_storage_path}/runs_genes_allā
}
},
āsamplesā: {
ātranscriptsā: {
ācodingā: ā{expression_storage_path}/samples_transcripts_codingā
},
āgenesā: {
ācodingā: ā{expression_storage_path}/samples_genes_codingā
}
}
},
āmeasuresā: {
āmax_dataā: {
āgenesā: 1000000,
ātranscriptsā: 10000000
},
āhorizontal_tile_sizeā: 5,
āvertical_tile_sizesā: {
āgenesā: 1500,
ātranscriptsā: 7000
}
},
āconsolidationā: {
āsm.consolidation.amplificationā: 1000,
āsm.consolidation.step_size_ratioā: 0.000001,
āsm.consolidation.step_min_fragsā: 2,
āsm.consolidation.buffer_sizeā: 1000000000
}
}
7. Here are the stats for the query in 8:
==== READ ====
-
Number of read queries: 1
-
Number of attempts until results are found: 1
-
Number of attributes read: 2
- Number of fixed-sized attributes read: 2
-
Number of dimensions read: 2
- Number of fixed-sized dimensions read: 2
-
Number of logical tiles overlapping the query: 174
-
Number of physical tiles read: 696
- Number of physical fixed-sized tiles read: 696
-
Number of cells read: 6090000
-
Number of result cells: 4729056
-
Precentage of usefull cells read: 77.6528%
-
Number of bytes read: 97841399 bytes (0.0911219 GB)
-
Number of read operations: 1670
-
Number of bytes unfiltered: 97442996 bytes (0.0907509 GB)
-
Unfiltering inflation factor: 0.995928x
-
Time to compute estimated result size: 0.0312368 secs
- Time to compute tile overlap: 0.0304857 secs
Time to compute relevant fragments: 0.00147552 secs
Time to load relevant fragment R-trees: 0.0286785 secs
Time to compute relevant fragment tile overlap: 0.000135883 secs
-
Time to open array: 19.5175 secs
- Time to load array schema: 0.0428444 secs
- Time to load consolidated fragment metadata: 1.458e-06 secs
- Time to load fragment metadata: 18.3245 secs
-
Total metadata read: 378766 bytes (0.000352753 GB)
- Array schema: 175 bytes (1.62981e-07 GB)
- Fragment metadata: 375770 bytes (0.000349963 GB)
- R-tree: 21 bytes (1.95578e-08 GB)
- Fixed-sized tile offsets: 2800 bytes (2.6077e-06 GB)
-
Time to initialize the read state: 0.000223201 secs
-
Read time: 3.14742 secs
- Time to compute next partition: 0.000801199 secs
- Time to compute tile coordinates: 9.7209e-05 secs
- Time to compute result coordinates: 6.3288e-05 secs
Time to compute sparse result tiles: 6.1234e-05 secs
- Time to compute dense result cell slabs: 1.38742 secs
- Time to copy result attribute values: 0.801412 secs
Time to read attribute tiles: 0.164169 secs
Time to unfilter attribute tiles: 0.355877 secs
Time to copy fixed-sized attribute values: 0.280759 secs
- Time to fill dense coordinates: 0.951867 secs
-
Total read query time (array open + init state + read): 3.14764 secs
- The read query is the following:
with td.DenseArray(database_name, mode=ārā) as arr:
datas = [[ arr.query(attrs=attrs_to_get, coords=True)[ sp[0]:sp[-1] + 1, s[0]:s[-1] + 1 ] for s in slices] for sp in space_slices]
slices: [[66775,
66776,
66777,
66778,
66779,
66780,
66781,
66782,
66783,
66784,
66785,
66786,
66787,
66788,
66789,
66790,
66791,
66792,
66793,
66794,
66795,
66796,
66797,
66798]]
space_slices: [[1, 2, 3, ā¦, 197044]]
This is not the case where we read many different fragments consequences of indices, but the performance is rather slow even here. In most cases the time can be much bigger.