TileDB in AWS behind a corporate proxy server

I’m looking at using the Python Tiledb bindings to create & read arrays in S3. We have direct access to S3 on one site and I can create Tiledb arrays, on the site that I’m based we still have to use a corporate proxy server for this and tiledb.SparseArray.create fails with

>     TileDBError: [TileDB::S3] Error: Failed to create multipart request for object '/cdf_data.tiledb/__array_schema.tdb
>     Exception:  
>     Error message:  Unable to connect to endpoint

Has anyone managed to use Tiledb in AWS behind a proxy server? I thought I’d check before submitting a bug report.

I’m using 0.6.6 of the TileDB bindings on Python 3.6 on Centos7. If it makes a difference we use a non-standard port on the proxy.

Hi @nickholway,

Please try to override the default value for the vfs.s3.proxy_scheme configuration option to http. Our current internal default is https, which is not usually applicable (we will update).

I was able to reproduce part of your error locally, and resolved it with the proxy_schema override – here’s an example that I tested locally with the squid proxy in this docker image:

import tiledb

cfg = {
'vfs.s3.proxy_host': 'localhost',
'vfs.s3.proxy_port': '3128',
'vfs.s3.aws_access_key_id': '...',
'vfs.s3.aws_secret_access_key': '...',
'vfs.s3.proxy_scheme': 'http',
#’vfs.s3.logging_level': 'TRACE'
}
tiledb.default_ctx(config=cfg)

a = tiledb.open("s3://bucket-name/array_path")
print(a.schema)

If overriding vfs.s3.proxy_scheme does not fix the problem, the proxy may require authentication; try setting vfs.s3.proxy_username and vfs.s3.proxy_password.

Otherwise, two less likely possibilities are:

  • proxy setup actually does use https, and requires a certificate override
  • that the proxy requires SOCKS5 support

The last two may take some modifications on our side or in the AWS library, so please ping us at hello <at> tiledb.com or isaiah <at> if the first two suggestions don’t help. We want to make sure TileDB is usable for proxied environments, and will need to figure out which update to prioritize.

Best,
Isaiah

Hi Isasiah,
Thanks for the response.
Your suggestion of setting vfs.s3.proxy_host, vfs.s3.proxy_port & vfs.s3.proxy_scheme worked for me.
A couple of suggestion:

  • For the S3 docs, it’d be useful it you could include the context setting on the page as otherwise a developer has to hunt for it.
  • On our systems (Jupyter hub in this case), the proxies are normally set with environment variables, it’d be good if TileDB could pick these up and use them - when troubleshooting the above I used boto3 and it “just worked”

Thanks
Nick

Hi Nick,

I’m glad that it worked, and thanks for the suggestion: I’ve added the proxy settings to the AWS config docs.

Good point about the environment variables; I’ve opened a ticket to take a look at this, because it would definitely simplify such use-cases. Are you using the cURL-style HTTP_PROXY/HTTPS_PROXY variables?

Best,
Isaiah

I’m glad that it worked, and thanks for the suggestion: I’ve added the proxy settings to the AWS config docs .

Great, that’s much clearer

Are you using the cURL-style HTTP_PROXY / HTTPS_PROXY variables?

Correct. It’d also be good to be able to override them through a) $NO_PROXY and b) via vfs.s3.something in case you have a direct connection or are using an internal S3-like object store.

Nick

1 Like