Reading array at previous timestamp after schema evolution

nardi · July 10, 2022, 4:12pm

I’ve been trying something and may have found a bug (probably in the Python wrapper), but I’m very new to TileDB so I may be doing something wrong.

I’m trying to do the following:

Create and write an array
Add an attribute to the array schema
Write new data to the array
Read the ‘old’ version of the array

However, this seems to crash the library without any error message or stacktrace. So my question is, is this something that should be supported, or is it simply an edge case that the Python wrapper should check for and not allow?

Here is my code:

import pandas as pd
import numpy as np
import tiledb as td

# Flag to add extra attribute or not
add_attr = True

index = list('xyz')
columns = list('abc')
df1 = pd.DataFrame(
    np.random.rand(len(index), len(columns)),
    index=index,
    columns=columns,
)

try:
    print('Creating initial array')
    td.from_pandas(
        'temp-array', df1,
        sparse=True, allows_duplicates=False,
        full_domain=True,
    )

    print('Check fragment info')
    fragments_info = td.array_fragments('temp-array')
    print(fragments_info)

    print('Reading array')
    with td.open('temp-array') as array:
        array_df = array.df[:]
        print(array_df)
        pd.testing.assert_frame_equal(array_df, df1)

    if add_attr:
        print('Add a column/attribute')
        columns += ['d']
        se = td.ArraySchemaEvolution(td.default_ctx())
        se.add_attribute(td.Attr('d', dtype=np.float64))
        se.array_evolve('temp-array')

    print('Rewrite array')
    df2 = pd.DataFrame(
        np.random.rand(len(index), len(columns)),
        index=index,
        columns=columns,
    )
    td.from_pandas(
        'temp-array', df2,
        mode='append',
    )

    print('Check fragment info')
    fragments_info = td.array_fragments('temp-array')
    print(fragments_info)

    # Get the array timestamps
    (t1, _), (t2, _) = fragments_info.timestamp_range

    print('Reading newly written array')
    with td.open('temp-array', timestamp=t2) as array:
        array_df = array.df[:]
        print(array_df)
        pd.testing.assert_frame_equal(
            array_df, df2,
            check_names=False,
        )

    print('Reading previously written array')
    with td.open('temp-array', timestamp=t1) as array:
        array_df = array.df[:]
        print(array_df)
        pd.testing.assert_frame_equal(
            array_df, df1,
            check_names=False,
        )
finally:
    print('Removing temp-array')
    vfs = td.VFS(ctx=td.default_ctx())
    vfs.remove_dir('temp-array')

If I run it with add_attr = False everything goes right, but if I run it like this I get the following output:

> python tiledb-test.py 
Creating initial array
[2022-07-10 18:00:01.851] [Process: 3232] [error] [Global] [TileDB::Array] Error: Cannot open array; Array does not exist
Check fragment info
{'array_schema_name': ('__1657468801855_1657468801855_2dd8fc6d327c43edbd7bce189e19f68c',),
 'array_uri': 'temp-array',
 'cell_num': (3,),
 'has_consolidated_metadata': (False,),
 'nonempty_domain': ((('x', 'z'),),),
 'sparse': (True,),
 'timestamp_range': ((1657468801894, 1657468801894),),
 'to_vacuum': (),
 'unconsolidated_metadata_num': 1,
 'uri': ('[...]/temp-array/__fragments/__1657468801894_1657468801894_cb3b1a70af694bf0bc6ed35dcd56822c_14',),
 'version': (14,)}
Reading array
          a         b         c
x  0.211619  0.604100  0.395342
y  0.031285  0.098245  0.432116
z  0.591643  0.436595  0.026439
Add a column/attribute
Rewrite array
Check fragment info
{'array_schema_name': ('__1657468801855_1657468801855_2dd8fc6d327c43edbd7bce189e19f68c',
                       '__1657468801989_1657468801989_6671225cdce9450bbe34fc78d38f4c06'),
 'array_uri': 'temp-array',
 'cell_num': (3, 3),
 'has_consolidated_metadata': (False, False),
 'nonempty_domain': ((('x', 'z'),), (('x', 'z'),)),
 'sparse': (True, True),
 'timestamp_range': ((1657468801894, 1657468801894),
                     (1657468802006, 1657468802006)),
 'to_vacuum': (),
 'unconsolidated_metadata_num': 2,
 'uri': ('[...]/temp-array/__fragments/__1657468801894_1657468801894_cb3b1a70af694bf0bc6ed35dcd56822c_14',
         '[...]/temp-array/__fragments/__1657468802006_1657468802006_0908ddb858484032b7310a4dbcad101b_14'),
 'version': (14, 14)}
Reading newly written array
          a         b         c         d
x  0.694410  0.344363  0.308965  0.002191
y  0.290289  0.367782  0.621800  0.914875
z  0.065450  0.602449  0.579679  0.882256
Reading previously written array

As you can see it crashes without any message when trying to read at the old timestamp.
(Also not sure why it gives an error when creating the array while everything works out in the end?)

nardi · July 10, 2022, 4:25pm

Ah I have an update already: if I read only the old attributes with array.query(attrs=list('abc')).df[:] it doesn’t crash. So is this a bug, or should I implement a workaround for it manually? If so, is there a way to get the array schema at a certain timestamp, or should I keep track of the changes myself in the array metadata?

ihnorton · July 11, 2022, 2:11am

Hi @nardi, thanks for posting. The code looks fine to me, and in no case should the library be crashing. We’ll take a look and either debug or suggest an update – reading at the prior timestamp as you’ve done should work correctly.

Topic		Replies	Views
Adding attributes to existing arrays	9	1592	October 4, 2021
Nonvalid / Inconsistent Timestamp Records	3	1041	December 11, 2018
Improved performance	2	1411	December 28, 2020
Debugging Segmentation Fault While Loading an Array	3	532	March 14, 2023
Python Documentation	1	881	February 25, 2021

Reading array at previous timestamp after schema evolution

Related topics