Reading array at previous timestamp after schema evolution

I’ve been trying something and may have found a bug (probably in the Python wrapper), but I’m very new to TileDB so I may be doing something wrong.

I’m trying to do the following:

  1. Create and write an array
  2. Add an attribute to the array schema
  3. Write new data to the array
  4. Read the ‘old’ version of the array

However, this seems to crash the library without any error message or stacktrace. So my question is, is this something that should be supported, or is it simply an edge case that the Python wrapper should check for and not allow?

Here is my code:

import pandas as pd
import numpy as np
import tiledb as td

# Flag to add extra attribute or not
add_attr = True

index = list('xyz')
columns = list('abc')
df1 = pd.DataFrame(
    np.random.rand(len(index), len(columns)),
    index=index,
    columns=columns,
)

try:
    print('Creating initial array')
    td.from_pandas(
        'temp-array', df1,
        sparse=True, allows_duplicates=False,
        full_domain=True,
    )

    print('Check fragment info')
    fragments_info = td.array_fragments('temp-array')
    print(fragments_info)

    print('Reading array')
    with td.open('temp-array') as array:
        array_df = array.df[:]
        print(array_df)
        pd.testing.assert_frame_equal(array_df, df1)

    if add_attr:
        print('Add a column/attribute')
        columns += ['d']
        se = td.ArraySchemaEvolution(td.default_ctx())
        se.add_attribute(td.Attr('d', dtype=np.float64))
        se.array_evolve('temp-array')

    print('Rewrite array')
    df2 = pd.DataFrame(
        np.random.rand(len(index), len(columns)),
        index=index,
        columns=columns,
    )
    td.from_pandas(
        'temp-array', df2,
        mode='append',
    )

    print('Check fragment info')
    fragments_info = td.array_fragments('temp-array')
    print(fragments_info)

    # Get the array timestamps
    (t1, _), (t2, _) = fragments_info.timestamp_range

    print('Reading newly written array')
    with td.open('temp-array', timestamp=t2) as array:
        array_df = array.df[:]
        print(array_df)
        pd.testing.assert_frame_equal(
            array_df, df2,
            check_names=False,
        )

    print('Reading previously written array')
    with td.open('temp-array', timestamp=t1) as array:
        array_df = array.df[:]
        print(array_df)
        pd.testing.assert_frame_equal(
            array_df, df1,
            check_names=False,
        )
finally:
    print('Removing temp-array')
    vfs = td.VFS(ctx=td.default_ctx())
    vfs.remove_dir('temp-array')

If I run it with add_attr = False everything goes right, but if I run it like this I get the following output:

> python tiledb-test.py 
Creating initial array
[2022-07-10 18:00:01.851] [Process: 3232] [error] [Global] [TileDB::Array] Error: Cannot open array; Array does not exist
Check fragment info
{'array_schema_name': ('__1657468801855_1657468801855_2dd8fc6d327c43edbd7bce189e19f68c',),
 'array_uri': 'temp-array',
 'cell_num': (3,),
 'has_consolidated_metadata': (False,),
 'nonempty_domain': ((('x', 'z'),),),
 'sparse': (True,),
 'timestamp_range': ((1657468801894, 1657468801894),),
 'to_vacuum': (),
 'unconsolidated_metadata_num': 1,
 'uri': ('[...]/temp-array/__fragments/__1657468801894_1657468801894_cb3b1a70af694bf0bc6ed35dcd56822c_14',),
 'version': (14,)}
Reading array
          a         b         c
x  0.211619  0.604100  0.395342
y  0.031285  0.098245  0.432116
z  0.591643  0.436595  0.026439
Add a column/attribute
Rewrite array
Check fragment info
{'array_schema_name': ('__1657468801855_1657468801855_2dd8fc6d327c43edbd7bce189e19f68c',
                       '__1657468801989_1657468801989_6671225cdce9450bbe34fc78d38f4c06'),
 'array_uri': 'temp-array',
 'cell_num': (3, 3),
 'has_consolidated_metadata': (False, False),
 'nonempty_domain': ((('x', 'z'),), (('x', 'z'),)),
 'sparse': (True, True),
 'timestamp_range': ((1657468801894, 1657468801894),
                     (1657468802006, 1657468802006)),
 'to_vacuum': (),
 'unconsolidated_metadata_num': 2,
 'uri': ('[...]/temp-array/__fragments/__1657468801894_1657468801894_cb3b1a70af694bf0bc6ed35dcd56822c_14',
         '[...]/temp-array/__fragments/__1657468802006_1657468802006_0908ddb858484032b7310a4dbcad101b_14'),
 'version': (14, 14)}
Reading newly written array
          a         b         c         d
x  0.694410  0.344363  0.308965  0.002191
y  0.290289  0.367782  0.621800  0.914875
z  0.065450  0.602449  0.579679  0.882256
Reading previously written array

As you can see it crashes without any message when trying to read at the old timestamp.
(Also not sure why it gives an error when creating the array while everything works out in the end?)

Ah I have an update already: if I read only the old attributes with array.query(attrs=list('abc')).df[:] it doesn’t crash. So is this a bug, or should I implement a workaround for it manually? If so, is there a way to get the array schema at a certain timestamp, or should I keep track of the changes myself in the array metadata?

Hi @nardi, thanks for posting. The code looks fine to me, and in no case should the library be crashing. We’ll take a look and either debug or suggest an update – reading at the prior timestamp as you’ve done should work correctly.