S3 first access very slow with 3D tiled dense array

Hi,
I am currently working on a POC with tileDB.
I’m really impress by your documentation. Really great jobs guy, we definitely want the same level for our API.
I visualize 3D dense array data from 100MB to 10TB.
I’m quite happy with local usage but when I try on S3 storage I’m not so happy.

I more or less workaround all my problems but a last one persist. It is simply the first access to my object:
tiledb::Object::object( m_ctx, m_dirname ).type() takes 16.5s

It is always 16.5s and it doesn’t depend on the size of my data. (even with a 100MB data it is 16.5s) So I suspect just an issue on my side since I’m not so familiar with S3.
I have tried different buckets from different location without success.
tiledb::Stats::dump aren’t available at this point so I didn’t know how to debug.

Do you have an idea ? A S3 configuration of tiledb ? A S3 configuration of my desktop ?

Regards

Hello @Julien! Thank you for your kind words about our documentation.

For your performance issue with the object type check, this is likely caused by the object API performing a listing operation. On local filesystems listing is very performant but with S3 and other object stores listings can be more costly in terms of time and latency. This is relevant because in TileDB 2.5 and older we list the directory and check for certain files to determine the object type. There is an obvious improvement here to avoid the listing and I’ve already opened a pull request to change the behavior. This should improve the performance substantially for you. We will include this in TileDB 2.6 which we are aiming to release next week.

S3’s listing operations take longer for a few reasons. The main thing that effect this are the fact that S3 will only return 1000 objects per request. For example, if you have an array that has 5000 fragments, this means there is 10002 objects in the directory (5k fragments, 5k “ok” files, __meta folder and __schema folder). In this example it would take 11 s3 API calls to get a listing of all the objects. This latency can add up, especially if you are remote/over the internet listing to s3.

A question I have, is how many fragments are in your array? You can either use the FragmentInfo api or simply perform a aws s3 ls on the array URI.

For the listing to take 16.5s it seems either you have a large number of fragments or you are having high network latency from your desktop to s3. If you have a large number of fragments (several thousand) I’d like to perform some additional checks, such as average size of the fragments, to make sure you achieve optimal performance on s3.

Lastly I want to mention when working with S3 we strongly recommend you perform fragment metadata consolidation. This will improve the performance of opening the array by allowing less IO operations for fetching fragment metadata.

Thank you for the quick and interesting answer.

I forget to tell that I’m on v2.5.2.

I haven’t mentioned the way I create my array because it appears to not affect at all the issue. If I change my tile size or if I use small data I have still the same 16.5s.

But I could also explain the way I store data if you want ? (a tiledb::Group of several tileDB::Array representing several LOD)

If i do a “time aws s3 ls s3://MYBUCKET/res_0” listing 723 files and 242 folders it takes 4s.

But I could also explain the way I store data if you want ? (a tiledb::Group of several tileDB::Array representing several LOD)

It would be helpful if you could, this would give us some insights and allow us to make any recommendations to improve the overall performance on s3.

If i do a “time aws s3 ls s3://MYBUCKET/res_0” listing 723 files and 242 folders it takes 4s.

Thank you for these details. Is “res_0” the top level group or is this an array? Also is “res_0” what you see taking 16.5s with tiledb::Object::object( m_ctx, m_dirname ).type() ?

I have cherry-picked the commit 32f5a691297842aa0bd84cc86a21bd7c5225bdb1 but it doesn’t help. Maybe because when i call tiledb::Object::object( m_ctx, m_dirname ).type(), m_dirname is a tiledb::Group containing several dir for each tiledb::array (really rare to have more than 6 LOD => more than 6 arrays)

I have always only an attribute in my dense array (a c++ basic type).

Write is done once in my use case and is done by several tiledb::Query where subarray exactly correspond to tile size.

(example Volume 512 x 512 x 512 and a tile size of 256 x 256 x 256, means I have 8 tiles and will do 8 Query, so with a dataset of 2560^3 (17GB) and tileSize 256^3 I will do 1000 tiledb::Query)

Do you think consolidation on fragment metadata will be relevant since I have the same number of write queries and tiles in my array? To be honest I have tried without success.

Since writing operation is only perform once in our workflow it could take a lot of time if its speedup the read operation.

At the end of the write operation I will have several arrays in a tiledb::Group (representing a Multi resolution dataset)

Example 512x512x512 dataset, tile 256x256x256 (Here I will have only 2 levels of resolution)

DIR Tiledb::Group

  • __tiledb_group.tdb
  • DIR Res_0 => Array domain (dim 512x512x512, tilesize 256)
    • DIR __XXXXXX_10 (8 dirs)
      • __fragment_metadata.tdb
      • tile.tdb
    • DIR __meta
    • DIR __schema
    • __XXXXXX_10.ok (8 files)
    • __lock.tdb
  • DIR Res_1 => Array domain (dim 256x256x256, tilesize 256)
    • DIR __XXXXXX_10 (1 dir)
      • __fragment_metadata.tdb
      • tile.tdb
    • DIR __meta
    • DIR __schema
    • __XXXXXX_10.ok (1 file)
    • __lock.tdb

The critical part of my workflow is the read operation. Read for now (like write) are done only by block representing exactly the tile in the tiledb::Array.

Thank you for these details. Is “res_0” the top level group or is this an array? Also is “res_0” what you see taking 16.5s with tiledb::Object::object( m_ctx, m_dirname ).type() ?

No because my command isn’t a good exemple. It is more like:
“time aws s3 ls s3://MYBUCKET/my3Ddata256/res_0”
Here i will do tiledb::Object::object( m_ctx, "s3://MYBUCKET/my3Ddata256" ).type()

Hi @Julien,

I’ve tried to reproduce the object_type issue, so far without success (not using @seth patch either, yet). I created 6 small arrays under a group, and called tiledb.object_type in the Python API from an EC2 instance and from my local computer – I get 70 ms consistently on EC2 and under 200 ms locally.

In [1]: import tiledb, numpy as np

In [2]: %time tiledb.object_type("s3://<bucket>/debug/group1/")
CPU times: user 8.56 ms, sys: 2.34 ms, total: 10.9 ms
Wall time: 105 ms
Out[2]: 'group'
In [3]: vfs = tiledb.VFS()

In [4]: vfs.ls("s3://<bucket>/debug/group1/")
Out[4]:
['s3://<bucket>/debug/group1/__tiledb_group.tdb',
 's3://<bucket>/debug/group1/a1',
 's3://<bucket>/debug/group1/a2',
 's3://<bucket>/debug/group1/a3',
 's3://<bucket>/debug/group1/a4',
 's3://<bucket>/debug/group1/a5',
 's3://<bucket>/debug/group1/a6']

Could you try a few things:

  • please double-check the time after running make install-tiledb with the cherry-pick, to make sure that the installed libtiledb is updated.

    • (also, just to make sure: this is a release build, correct?)
  • if possible, try installing TileDB-Py and call:

    import tiledb
    tiledb.object_type(<uri>)
    

    (if you do not have AWS credentials permanently configured, make sure they are exported in the environment before starting python and importing tiledb)

  • does the slow tiledb::Object::object( m_ctx, m_dirname ).type() call reproduce in a minimal main.cc with only that call? Just wondering if there could be any other confounds in the “full” program (thread pools and such).

Thanks,
Isaiah

Hi @ihnorton,

C++ build double checked.

same results with python. I notice second call takes only 241ms

import tiledb
ctx = tiledb.Ctx({'vfs.s3.region': 'eu-west-3'})
tiledb.libtiledb.version()

(2, 5, 3)

%time tiledb.object_type("s3://XXXXX/tiledb/XXXX256", ctx)
Wall time: **16.4 s**
'group'
%time tiledb.object_type("s3://XXXXX/tiledb/XXXX256", ctx)
Wall time: **241 ms**
'group'

Same results since even a simple python reproduce the issue. Definitly the issue seems to be on my side.

I have tried to use boto3 with python and listing my s3 buckets takes 100ms.

import tiledb
vfs = tiledb.VFS()
vfs.ls("s3://")

Takes 16.5s
But

import boto3
s3 = boto3.resource('s3')

for bucket in s3.buckets.all():
   print(bucket.name)

Takes 100ms

Thanks,
Julien

Hi @Julien,

Thanks for checking those. I’m not sure what to make of this, yet, but we are looking in to it.

  • I’m testing the following simple script for array creation – if you run this, does the resulting tiledb.object_type call still take 16s?
import tiledb, numpy as np
base_uri = "s3://tiledb-isaiah2/debug"
tiledb.group_create(base_uri + "/group2")
for i in range(6):
    tiledb.from_numpy(base_uri+f"/group2/array{i}", np.random.rand(100))
tiledb.object_type(base_uri + "/group2")
  • how many files and directories are in the group prefix in your case?

(next, I’ll try the above on eu-west-3 myself just in case)

Best,
Isaiah

I get about 400ms on first query with the above script, coming from us-east-1 to eu-west-3.

Yes I have also same “bad” result with us-east-1 or eu-west-3

import tiledb
base_uri = "s3://XXXXX-us/tiledb"
%time tiledb.group_create(base_uri + "/group")

Wall time: 17.3 s
Stop lossing time with me a lit bit :wink:
I will try to involve colleague to test it on their environment.

I made a test on Ubuntu 20.04, it takes 4.5s. it is a lot better but still strange.
Thanks for the help but issue is on my side, I will continue investigation with IT department. But I’m still wondering why with boto3 it is so quick

Thanks for the update (did not see it at first, b/c it was a comment edit). We were a bit stumped, so this is very helpful.

What platform was the initial 16s time on? If it’s on Windows, in particular, then there are a lot of factors which can contribute to slowness (eg anti-virus scanners). 4s on Ubuntu is still slower than expected.

At this point, my suggestion for further debugging would be with tracing, either at the client level or with a tool like Wireshark. You can enable client trace-level logging in python with:

tiledb.default_ctx({"vfs.s3.logging_level": "TRACE"})

(Or set that key/value pair on a C++ config object)

When that option is set, TileDB will create a file in the current working directory called tiledb_s3_<date>.log with a lot of details about curl connections and such. Please do not post that logfile publicly because it can contain access key ids, however it may be useful for you or your IT to look at. If you want, we can also take a look at a log privately, in order to try to debug further – email is isaiah <at> tiledb .combut note: please do a search/replace for your access key id to blank out, before sharing!

Best,
Isaiah

I should have start from here, thanks ! It gives really good information.
i have tried on Ubuntu where I have a 4ms issue.

I see that there is 4 timeout of 1000ms (hey 4ms :slight_smile: ) then it seems to try a new way to connect and boom !
See below :

[ERROR] 2022-01-13 17:06:04.138 CurlHttpClient [X] Curl returned error code 28 - Timeout was reached
[DEBUG] 2022-01-13 17:06:04.138 CurlHandleContainer [X] Destroy curl handle: 0x2bea220
[DEBUG] 2022-01-13 17:06:04.138 CurlHandleContainer [X] Created replacement handle and released to pool: 0x2bea220
[ERROR] 2022-01-13 17:06:04.138 EC2MetadataClient [X] Http request to retrieve credentials failed
[ERROR] 2022-01-13 17:06:04.138 EC2MetadataClient [X] Can not retrive resource from http://X.X.X.X/latest/meta-data/placement/availability-zone
[INFO] 2022-01-13 17:06:04.138 EC2MetadataClient [X] Unable to pull region from instance metadata service 
[INFO] 2022-01-13 17:06:04.138 EC2MetadataClient [X] Creating AWSHttpResourceClient with max connections 2 and scheme http
[INFO] 2022-01-13 17:06:04.138 CurlHandleContainer [X] Initializing CurlHandleContainer with size 2
[INFO] 2022-01-13 17:06:04.138 InstanceProfileCredentialsProvider [X] Creating Instance with default EC2MetadataClient and refresh rate 300000
[INFO] 2022-01-13 17:06:04.138 DefaultAWSCredentialsProviderChain [X] Added EC2 metadata service credentials provider to the provider chain.
=> OK

Is this log enough to tell me what isn’t correctly set on my side ? or maybe is there a way to start first with the 5th way of connection ?
Creating AWSHttpResourceClient with max connections 2 and scheme http

I think this will be very helpful, let me double-check with someone on the team who knows AWS better – I think what is happening is that we are trying to get some credentials first, and then falling back to the normal AWS key (which works).

Do you happen to have something like “STS token” configured in your AWS configuration?

(I remember than boto might not support the token/role without additional config, so maybe it does not try by default – which might explain the difference)

We are closed to success now.
Locally I have just installed aws, run aws configure,
and set id and private key. Maybe it could be a global configuration of my IT department ? Tomorrow I will speak with our “cloud Master” about ‘STS token’ configuration.

Thanks again.

Based on the log, it looks like you are hitting this issue: Document the AWS_EC2_METADATA_DISABLED environment variable · Issue #5623 · aws/aws-cli · GitHub

So, please try either of the following:

  • set the default AWS region. In the TileDB config you can do:
    tiledb.default_ctx({"vfs.s3.region": "eu-west-3"})
    

-or-

  • export AWS_EC2_METADATA_DISABLED=true (or otherwise set the environment variable)

No change but I was already doing tiledb.VFS({‘vfs.s3.region’: ‘us-west-3’}) so…

But

This makes the tricks.

And I have found another way to fix (simple one I would say). I have re-run “aws configure” and change default region from [None] to [foobar]…and boom 400ms.
When I say “foobar” I have really put foobar and it was enough to fix my issue :slight_smile:
But I suppose setting eu-west-3 is a better idea in my case.

Thanks for your great helps.

2 Likes

Great, thank you for the update!