TileDB-Py trying to connect to HDFS - but I don't want it to

Hello,

I’m using TileDB-Py in a Docker container. As I try to create the TileDB context, with a configuration that points to S3 storage, the call fails with TileDB trying to connect to an non-existent HDFS:

hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)

The Docker image contains the Hadoop client libraries, but I don’t need to access HDFS and I don’t want to use them. Is there something I can do to “shut off” or avoid turning on HDFS-related functionality in TileDB?

The configuration I’m using is quite simple:

{
"vfs.s3.endpoint_override": "minio:9000",
"vfs.s3.scheme": "http",
"vfs.s3.region": "",
"vfs.s3.verify_ssl": "false",
"vfs.s3.use_virtual_addressing": "false",
"vfs.s3.use_multipart_upload": "false",
"vfs.s3.aws_access_key_id": "abc",
"vfs.s3.aws_secret_access_key": "def"
}

Cheers,

Luca

Hi @ilveroluca,

My suspicion is that you may have a very old version of TileDB/TileDB-Py? A few questions so we can try to understand what is happening here:

  • what version of TileDB-Py, and how did you install it?
  • is it your own Dockerfile?
  • can you please share the protocol of the URI you are connecting to? eg s3:// azure:// etc.

Thanks,
Isaiah

Updating: after some discussion, we see the issue – libtiledb is trying to initialize the HDFS client unconditionally, whenever the HDFS library is present (which usually is not true in our TileDB-Py test setup).

We will make the HDFS setup completely lazy (on-demand) in the next release to eliminate the startup error. Thank you for pointing this out @ilveroluca.

You’re quite welcome! FWIW, I managed to work around the issue by eliminating all HADOOP* environment variables.