Slicing of arrays in python

Hello. Nick from CZI here getting going with tileDB in python for the first time.

I was able to pull your quickstart_dense array down but ran into something unexpected with the following

array_name = "tiledb://TileDB-Inc/quickstart_dense"
my_array = tiledb.open(array_name, 'r', ctx=tiledb.Ctx(config))
print(type(my_array)) # tiledb.array.DenseArray
print(my_array.nattr) # 1
print(type(my_array[:])) # collections.OrderedDict

According to the dense-array docs

“If the dense array has a single attribute than a Numpy array of corresponding shape/dtype is returned for that attribute. If the array has multiple attributes, a collections.OrderedDict is returned with dense Numpy subarrays for each attribute.”

So I wasn’t expecting my_array[:] to be an OrderedDict but another array. Is there maybe another link to a description of how attributes are meant to be used on arrays, as it’s not quite clear to me when I’d index into a dense array and not get another dense array again.

I’m also curious in the context of tileDB cloud what get’s pulled to the local client on slicing if slicing returns a numpy array. Could slicing return another tileDB array that stays on the cloud? I only want a numpy array back when I call np.asarray on the array, otherwise things should stay in the cloud I think.

@sofroniewn Thank you for bringing this to our attention. Returning as a single numpy array is only supported in a special case where the single attribute was a “anonymous” attribute of TileDB. This behavior was intended to be removed, and does not generalize. An ordered dictionary will be returned in all cases (including your case). The documentation will be updated in the next TIleDB-Py release.

Is there maybe another link to a description of how attributes are meant to be used on arrays, as it’s not quite clear to me when I’d index into a dense array and not get another dense array again.

Yes, please check the attribute section of our docs . The main reason to have attributes are for non-index values you want to associate with particular coordinates of the array. TileDB currently requires all attributes to be written for a cell of an array. So for dense array all the attributes will be dense inside the ordered dictionary. The ordered dictionary in python serves the purpose of handling when there are multiple attributes associates with an array, and for uniformity we use this format for cases when there is a single attribute.

I’m also curious in the context of tileDB cloud what get’s pulled to the local client on slicing if slicing returns a numpy array. Could slicing return another tileDB array that stays on the cloud? I only want a numpy array back when I call np.asarray on the array, otherwise things should stay in the cloud I think.

Currently TileDB when slicing a TileDB array all the data from your slice is returned to your client. This works in the same way that when you slice a local or s3 array all the data from your slice is loaded into memory. TileDB the c++ library does support an idea of incomplete queries, but currently this is not available in Python and is not directly related to your idea of np.asarray. Happy to go into more details if you are interested in incomplete queries in general.

I suspect you are probably interested in our Serverless Array UDFs. With the serverless array UDFs the slicing happens entirely in the cloud environment and only the results of your function are returned. This prevents having to fetch large results over the network to your machine (or even to a hosted notebook) and works great with any type of reduce operation.

1 Like

Ok, thanks for that info @seth. I’m mostly used to working with single attribute array data (images, with luminance values) and naively would have expected slicing or an np.asarray call to just return another array or a numpy array.

I will followup on the incomplete queries and serverless array udfs if I start going down that path. The serverless array udfs sound very interesting, thanks for those links.

I wonder if tiledb is looking at this python array standardization effort, https://data-apis.org/blog/announcing_the_consortium/, being part of this effort and conforming to these standards would mean that I wouldn’t have to do special casing for tiledb arrays inside the tools I am building, which would be a huge advantage for everyone.

@sofroniewn Yes, absolutely, we are following closely and looking to contribute. TileDB’s data model is richer than in any other software out there (as it encompasses, arrays, dataframes, key-values, metadata, axis labels, etc., all in a unified way as dense or sparse multi-dimensional arrays). We are interested in seeing how the community will react.