Slicing of arrays in python

sofroniewn · August 10, 2020, 6:52pm

Hello. Nick from CZI here getting going with tileDB in python for the first time.

I was able to pull your quickstart_dense array down but ran into something unexpected with the following

array_name = "tiledb://TileDB-Inc/quickstart_dense"
my_array = tiledb.open(array_name, 'r', ctx=tiledb.Ctx(config))
print(type(my_array)) # tiledb.array.DenseArray
print(my_array.nattr) # 1
print(type(my_array[:])) # collections.OrderedDict

According to the dense-array docs

“If the dense array has a single attribute than a Numpy array of corresponding shape/dtype is returned for that attribute. If the array has multiple attributes, a collections.OrderedDict is returned with dense Numpy subarrays for each attribute.”

So I wasn’t expecting my_array[:] to be an OrderedDict but another array. Is there maybe another link to a description of how attributes are meant to be used on arrays, as it’s not quite clear to me when I’d index into a dense array and not get another dense array again.

I’m also curious in the context of tileDB cloud what get’s pulled to the local client on slicing if slicing returns a numpy array. Could slicing return another tileDB array that stays on the cloud? I only want a numpy array back when I call np.asarray on the array, otherwise things should stay in the cloud I think.

seth · August 10, 2020, 9:31pm

@sofroniewn Thank you for bringing this to our attention. Returning as a single numpy array is only supported in a special case where the single attribute was a “anonymous” attribute of TileDB. This behavior was intended to be removed, and does not generalize. An ordered dictionary will be returned in all cases (including your case). The documentation will be updated in the next TIleDB-Py release.

Is there maybe another link to a description of how attributes are meant to be used on arrays, as it’s not quite clear to me when I’d index into a dense array and not get another dense array again.

Yes, please check the attribute section of our docs . The main reason to have attributes are for non-index values you want to associate with particular coordinates of the array. TileDB currently requires all attributes to be written for a cell of an array. So for dense array all the attributes will be dense inside the ordered dictionary. The ordered dictionary in python serves the purpose of handling when there are multiple attributes associates with an array, and for uniformity we use this format for cases when there is a single attribute.

I’m also curious in the context of tileDB cloud what get’s pulled to the local client on slicing if slicing returns a numpy array. Could slicing return another tileDB array that stays on the cloud? I only want a numpy array back when I call np.asarray on the array, otherwise things should stay in the cloud I think.

Currently TileDB when slicing a TileDB array all the data from your slice is returned to your client. This works in the same way that when you slice a local or s3 array all the data from your slice is loaded into memory. TileDB the c++ library does support an idea of incomplete queries, but currently this is not available in Python and is not directly related to your idea of np.asarray. Happy to go into more details if you are interested in incomplete queries in general.

I suspect you are probably interested in our Serverless Array UDFs. With the serverless array UDFs the slicing happens entirely in the cloud environment and only the results of your function are returned. This prevents having to fetch large results over the network to your machine (or even to a hosted notebook) and works great with any type of reduce operation.

sofroniewn · August 22, 2020, 6:31pm

Ok, thanks for that info @seth. I’m mostly used to working with single attribute array data (images, with luminance values) and naively would have expected slicing or an np.asarray call to just return another array or a numpy array.

I will followup on the incomplete queries and serverless array udfs if I start going down that path. The serverless array udfs sound very interesting, thanks for those links.

I wonder if tiledb is looking at this python array standardization effort, https://data-apis.org/blog/announcing_the_consortium/, being part of this effort and conforming to these standards would mean that I wouldn’t have to do special casing for tiledb arrays inside the tools I am building, which would be a huge advantage for everyone.

stavros · August 24, 2020, 11:25am

@sofroniewn Yes, absolutely, we are following closely and looking to contribute. TileDB’s data model is richer than in any other software out there (as it encompasses, arrays, dataframes, key-values, metadata, axis labels, etc., all in a unified way as dense or sparse multi-dimensional arrays). We are interested in seeing how the community will react.

Topic		Replies	Views
Suppress index array in returned ordered dict	1	682	February 8, 2021
Slow AWS Data Slicing	5	1040	June 12, 2020
Non-contiguous reads/writes for dense arrays	1	876	April 11, 2019
Variable-length Attributes in Python	2	896	August 1, 2019
Benchmarking tiledb read performance	5	3137	October 18, 2023

Slicing of arrays in python

Related topics