Dask Array to TileDB with Google Cloud Storage Issue

I am trying to convert a Dask Array and write directly to Google Cloud Storage in Python, but I am getting the error:
TileDBError: [TileDB::Array] Error: Cannot open array; Array does not exist whenever I attempt to use da.to_tiledb("gcs://mybucket/array-name.tldb").

I have configured the GOOGLE_APPLICATION_CREDENTIALS environment variable with my bucket key, set config['vfs.gcs.project_id'] = 'my-project-id', as well as attempting to pass the credentials file to the storage_options keyword argument, with all yielding the same error. The Dask documentation mentions that this method should work with “any TileDB-supported URI, including local disk, S3, or HDFS.”

Writing to local disk works as expected, leading me to believe that perhaps Google Cloud functionality is not supported with this method. I could create an in-memory numpy array of my data, but I would much rather do this out-of-core to eventually convert larger datasets.

Here is my code, as well as some information about the numpy.ndarray I am attempting to write to TileDB Embedded:

Note: I know that a Python adapter to convert NetCDF to TileDB exists, but I’ve been having a lot of trouble getting it to work in my environment. Because of this I switched to the Dask Array to TileDB method.

Hi @jgreen, thanks for reaching out! Someone else will follow-up with you on the GCS portions of this question, but I would like to check in on the issues you were having with converting NetCDF files with the Python library. First, which library did you attempt to use? The one we are supporting internally is tiledb-cf, but it sometimes gets buried under the tiledb_netcdf library developed by the Met Office Informatics Lab. If you were using tiledb-cf, would you be willing to provide more information about the issues you were hitting?

1 Like

Thanks for the reply! I was unaware that there were two libraries, but I was indeed using the tiledb_netcdf library instead of tiledb-cf. Switching to latter and using the command line interface worked without issue. Looking forward to going through the API documentation, thanks again!

1 Like

UPDATE: I figured out why I was receiving the TileDBError: [TileDB::Array] Error: Cannot open array; Array does not exist. When I exported the environment variable, it was on another command line instance. I am working in a proprietary middleware system, which is still in development so sometimes exhibits odd behavior when more than one terminal is open in a different browser tab.

Even when using the os module, odd behavior is observed (once again, thanks to the software I am using to run everything on). To set the correct environment variable with my credentials, I found the most reliable method to be %%bash cell magic, although I will continue to experiment for the most general way to do this.

Hi @jgreen,

I’ve added a story on our backlog to look at setting GCS credentials without use of environment variables.

Best,
Isaiah

1 Like