Duplicate reads when reading from array

I am writing and reading from a sparse array looking like this:

    dom = tiledb.Domain(
                tiledb.Dim(name="variantkey", domain=(0, 2 ** 64 - 2), tile=2, dtype=np.uint64),
                tiledb.Dim(name="sample_idx", domain=(0, 9000), tile=2, dtype=np.uint64))
            schemaGT = tiledb.ArraySchema(domain=dom, sparse=True,
                                        attrs=(tiledb.Attr(name="GT", dtype=np.int8),
                                               ))

I am writing into said array:

            with tiledb.SparseArray(path, mode='w') as A:
                A[variantkey, sample] = {path: input}

with

variantkey= [5  6  7  8  5  6  7  8  5  6  7  8]
sample = [1 1 1 1 2 2 2 2 3 3 3 3]
input = [ 2  2  1  5 10  1  2  6  2  1  3  7]

Now I am trying to get the input for sample 1:

    variantkey = [5 6 7 8]
    sample = [1]
    with tiledb.SparseArray( path, mode='r') as A:
           output = A.multi_index[list(variantkey), list(sample)]["GT"]

The output I get looks like this:

[2. 2. 2. 2. 1. 1. 1. 5. 5. 5.]
When it should like this : [ 2. 2. 1. 5.]

Why do I suddenly have duplicates of my initial input?

@voss This might be related to a bug we fixed in TileDB 2.0.3. What version of TileDB or TileDB-Py are you using?

If you are not using TileDB 2.0.3 or TileDB-Py 0.6.2, can you upgrade? If you are using pip you just need to update the tiledb package. If you are using conda please update both tiledb and tiledb-py.

@seth I updgraded to 0.6.2 now but I still have the same problem

With the quick fix

output = pd.DataFrame(output, columns=output.keys()).drop_duplicates()

it works without any problems, but it would be interesting to know the reason behind the duplicates

Hi @voss,

I can reproduce the problem with 0.6.1, but cannot reproduce with 0.6.2. So, please double-check the version, and also please try the script below so that we are on the same page with a complete example.

Here is my test script:

import tiledb, numpy as np
import tempfile

path = tempfile.mkdtemp()
#path = "test_dups"

dom = tiledb.Domain(
                tiledb.Dim(name="variantkey", domain=(0, 2 ** 64 - 2), tile=2, dtype=np.uint64),
                tiledb.Dim(name="sample_idx", domain=(0, 9000), tile=2, dtype=np.uint64))
schemaGT = tiledb.ArraySchema(domain=dom, sparse=True,
                                        attrs=(tiledb.Attr(name="GT", dtype=np.int8),
                                               ))

tiledb.SparseArray.create(path, schemaGT)

variantkey= [5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8]
sample = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
input = [ 2,  2,  1,  5, 10,  1,  2,  6,  2,  1,  3,  7]


with tiledb.SparseArray(path, mode='w') as A:
    A[variantkey, sample] = {'GT': input}

with tiledb.SparseArray(path, mode='w') as A:
    A[variantkey, sample] = {'GT': input}

variantkey = [5, 6, 7, 8]
sample = [1]
with tiledb.SparseArray(path, mode='r') as A:
       output = A.multi_index[list(variantkey), list(sample)]["GT"]

print(output)

which prints [2 2 1 5] under 0.6.2. If you are using conda, make sure to run conda install tiledb==2.0.3 to upgrade the underlying library.

Hope that helps!

Yes, you were right! When I installed tiledb 2.0.3 it worked.
Thank you very much for your help!