Hello!
Trying to read 2D sparse array and facing 2 issues:
- Local read is much slower than it would be expected for array of that kind
- After uploading this array to S3 read time is much slower than locally
I know that it’s expected that remote read is slower but I think that difference is too large, so maybe I’m doing something wrong to read the data.
There is part of array creation code:
TileDB.Config config = new TileDB.Config();
config.set("vfs.s3.scheme", "https");
config.set("vfs.s3.region", "us-east-1");
config.set("vfs.s3.endpoint_override", "");
config.set("vfs.s3.aws_access_key_id", "access_key");
config.set("vfs.s3.aws_secret_access_key", "secret_key");
config.set("vfs.s3.use_virtual_addressing", "True");
using TileDB.Context ctx = new TileDB.Context(config);
using TileDB.Domain dom = new TileDB.Domain(ctx);
dom.add_int32_dimension("x", 10, 5500, 500);
dom.add_int32_dimension("y", -3650, -10, 500);
using TileDB.ArraySchema schema = new TileDB.ArraySchema(ctx, TileDB.ArrayType.TILEDB_SPARSE);
schema.set_domain(dom);
schema.set_order(TileDB.LayoutType.TILEDB_ROW_MAJOR, TileDB.LayoutType.TILEDB_ROW_MAJOR);
schema.set_allows_dups(true);
using TileDB.Attribute attr1 = TileDB.Attribute.create_attribute(ctx, "data1", TileDB.DataType.TILEDB_STRING_ASCII);
using TileDB.Attribute attr2 = TileDB.Attribute.create_attribute(ctx, "data2", TileDB.DataType.TILEDB_INT32);
using TileDB.Attribute attr3 = TileDB.Attribute.create_attribute(ctx, "data3", TileDB.DataType.TILEDB_INT32);
using TileDB.Filter compression = new TileDB.Filter(ctx, TileDB.FilterType.TILEDB_FILTER_GZIP);
using TileDB.FilterList filterList = new TileDB.FilterList(ctx);
filterList.add_filter(compression);
attr1.set_filter_list(filterList);
schema.add_attribute(attr1);
schema.add_attribute(attr2);
schema.add_attribute(attr3);
TileDB.Array.create("s3://bucket_name/group_path/array_name", schema);
And here is the method for reading:
TileDB.Config config = new TileDB.Config();
config.set("vfs.s3.scheme", "https");
config.set("vfs.s3.region", "us-east-1");
config.set("vfs.s3.endpoint_override", "");
config.set("vfs.s3.aws_access_key_id", "access_key");
config.set("vfs.s3.aws_secret_access_key", "secret_key");
config.set("vfs.s3.use_virtual_addressing", "True");
using TileDB.Context ctx = new TileDB.Context(config);
using TileDB.VectorInt32 xVector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorInt32 yVector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorInt32 data2Vector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorInt32 data3Vector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorUInt64 data1Offset = TileDB.VectorUInt64.Repeat(0, 50000);
using TileDB.VectorChar data1Vector= TileDB.VectorChar.Repeat(' ', 50000 * 12);
TileDB.VectorInt32 subarray = new TileDB.VectorInt32() { 10, 5500, -3650, -10 };
using TileDB.Array array = new TileDB.Array(ctx, "s3://bucket_name/group_path/array_name", TileDB.QueryType.TILEDB_READ);
TileDB.ArraySchema schema = new TileDB.ArraySchema(ctx, "s3://bucket_name/group_path/array_name");
using TileDB.Query query = new TileDB.Query(ctx, array, TileDB.QueryType.TILEDB_READ);
query.set_layout(TileDB.LayoutType.TILEDB_GLOBAL_ORDER);
query.set_int32_vector_buffer("x", xVector);
query.set_int32_vector_buffer("y", yVector);
query.set_int32_vector_buffer("data2", data2);
query.set_int32_vector_buffer("data3", data3);
query.set_char_vector_buffer_with_offsets("data1", data1Vector, data1Offset );
query.set_int32_subarray(subarray);
query.submit();
using TileDB.MapStringVectorUInt64 bufferElements = query.result_buffer_elements();
array.close();
ulong resultElementOffset = bufferElements["data1"][0];
ulong resultElementSize = bufferElements["data1"][1];
using TileDB.VectorUInt64 dataSizes = new TileDB.VectorUInt64();
for (int i = 0; i < ((int)resultElementOffset - 1); ++i)
{
dataSizes.Add(data1Offset[i + 1] - data1Offset[i]);
}
dataSizes.Add(resultElementSize * TileDB.EnumUtil.datatype_size(TileDB.DataType.TILEDB_CHAR) - data1Offset[(int)resultElementOffset - 1]);
string[] dataArray = new string[(int)resultElementOffset];
for (int i = 0; i < (int)resultElementOffset; ++i)
{
dataArray[i] = new string(data1Vector.GetRange((int)data1Offset[i], (int)dataSizes[i]).ToArray());
}
xVector.RemoveRange(dataArray.Length, 50000 - dataArray.Length);
yVector.RemoveRange(dataArray.Length, 50000 - dataArray.Length);
data2.RemoveRange(dataArray.Length, 50000 - dataArray.Length);
data3.RemoveRange(dataArray.Length, 50000 - dataArray.Length);
return (xVector.ToArray(), yVector.ToArray(), dataArray, data2.ToArray(), data3.ToArray());
This is an array containing approximately 50000 points with X and Y coordinates of int data type, 1 attribute of string type (values are of different size but max size is 12 characters) and two attributes of int type. Have tried with different tile sizes but result is not changing much.
Local read time: 00:00:04.7878878
S3 read time: 00:00:10.3559373
This time result was achieved by using C# library of version 2.5.0.
Thank you in advance for help!