Slow read performance for local and S3 2D sparse array

Hello!

Trying to read 2D sparse array and facing 2 issues:

  • Local read is much slower than it would be expected for array of that kind
  • After uploading this array to S3 read time is much slower than locally

I know that it’s expected that remote read is slower but I think that difference is too large, so maybe I’m doing something wrong to read the data.

There is part of array creation code:

TileDB.Config config = new TileDB.Config();
config.set("vfs.s3.scheme", "https");
config.set("vfs.s3.region", "us-east-1");
config.set("vfs.s3.endpoint_override", "");
config.set("vfs.s3.aws_access_key_id", "access_key");
config.set("vfs.s3.aws_secret_access_key", "secret_key");
config.set("vfs.s3.use_virtual_addressing", "True");

using TileDB.Context ctx = new TileDB.Context(config);

using TileDB.Domain dom = new TileDB.Domain(ctx);
dom.add_int32_dimension("x", 10, 5500, 500);
dom.add_int32_dimension("y", -3650, -10, 500);

using TileDB.ArraySchema schema = new TileDB.ArraySchema(ctx, TileDB.ArrayType.TILEDB_SPARSE);
schema.set_domain(dom);
schema.set_order(TileDB.LayoutType.TILEDB_ROW_MAJOR, TileDB.LayoutType.TILEDB_ROW_MAJOR);
schema.set_allows_dups(true);

using TileDB.Attribute attr1 = TileDB.Attribute.create_attribute(ctx, "data1", TileDB.DataType.TILEDB_STRING_ASCII);
using TileDB.Attribute attr2 = TileDB.Attribute.create_attribute(ctx, "data2", TileDB.DataType.TILEDB_INT32);
using TileDB.Attribute attr3 = TileDB.Attribute.create_attribute(ctx, "data3", TileDB.DataType.TILEDB_INT32);

using TileDB.Filter compression = new TileDB.Filter(ctx, TileDB.FilterType.TILEDB_FILTER_GZIP);
using TileDB.FilterList filterList = new TileDB.FilterList(ctx);
filterList.add_filter(compression);
attr1.set_filter_list(filterList);
schema.add_attribute(attr1);
schema.add_attribute(attr2);
schema.add_attribute(attr3);

TileDB.Array.create("s3://bucket_name/group_path/array_name", schema);

And here is the method for reading:

TileDB.Config config = new TileDB.Config();
config.set("vfs.s3.scheme", "https");
config.set("vfs.s3.region", "us-east-1");
config.set("vfs.s3.endpoint_override", "");
config.set("vfs.s3.aws_access_key_id", "access_key");
config.set("vfs.s3.aws_secret_access_key", "secret_key");
config.set("vfs.s3.use_virtual_addressing", "True");

using TileDB.Context ctx = new TileDB.Context(config);

using TileDB.VectorInt32 xVector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorInt32 yVector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorInt32 data2Vector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorInt32 data3Vector= TileDB.VectorInt32.Repeat(0, 50000);
using TileDB.VectorUInt64 data1Offset = TileDB.VectorUInt64.Repeat(0, 50000);
using TileDB.VectorChar data1Vector= TileDB.VectorChar.Repeat(' ', 50000 * 12);

TileDB.VectorInt32 subarray = new TileDB.VectorInt32() { 10, 5500, -3650, -10 };

using TileDB.Array array = new TileDB.Array(ctx, "s3://bucket_name/group_path/array_name", TileDB.QueryType.TILEDB_READ);
TileDB.ArraySchema schema = new TileDB.ArraySchema(ctx, "s3://bucket_name/group_path/array_name");

using TileDB.Query query = new TileDB.Query(ctx, array, TileDB.QueryType.TILEDB_READ);
query.set_layout(TileDB.LayoutType.TILEDB_GLOBAL_ORDER);

query.set_int32_vector_buffer("x", xVector);
query.set_int32_vector_buffer("y", yVector);
query.set_int32_vector_buffer("data2", data2);
query.set_int32_vector_buffer("data3", data3);
query.set_char_vector_buffer_with_offsets("data1", data1Vector, data1Offset );

query.set_int32_subarray(subarray);

query.submit();
using TileDB.MapStringVectorUInt64 bufferElements = query.result_buffer_elements();
array.close();

ulong resultElementOffset = bufferElements["data1"][0];
ulong resultElementSize = bufferElements["data1"][1];
using TileDB.VectorUInt64 dataSizes = new TileDB.VectorUInt64();

for (int i = 0; i < ((int)resultElementOffset - 1); ++i)
{
    dataSizes.Add(data1Offset[i + 1] - data1Offset[i]);
}
dataSizes.Add(resultElementSize * TileDB.EnumUtil.datatype_size(TileDB.DataType.TILEDB_CHAR) - data1Offset[(int)resultElementOffset - 1]);

string[] dataArray = new string[(int)resultElementOffset];
for (int i = 0; i < (int)resultElementOffset; ++i)
{
    dataArray[i] = new string(data1Vector.GetRange((int)data1Offset[i], (int)dataSizes[i]).ToArray());
}

xVector.RemoveRange(dataArray.Length, 50000 - dataArray.Length);
yVector.RemoveRange(dataArray.Length, 50000 - dataArray.Length);
data2.RemoveRange(dataArray.Length, 50000 - dataArray.Length);
data3.RemoveRange(dataArray.Length, 50000 - dataArray.Length);

return (xVector.ToArray(), yVector.ToArray(), dataArray, data2.ToArray(), data3.ToArray());

This is an array containing approximately 50000 points with X and Y coordinates of int data type, 1 attribute of string type (values are of different size but max size is 12 characters) and two attributes of int type. Have tried with different tile sizes but result is not changing much.

Local read time: 00:00:04.7878878
S3 read time: 00:00:10.3559373

This time result was achieved by using C# library of version 2.5.0.

Thank you in advance for help!

Hello,
We tried to create and read a sparse array with 50000 points similar to your case. xVector and yVector are IList. It seems that you prefer to T[]. In that case, it is better to use TileDBBuffer class which has a T[] Data. Here is the link for the example: TileDB-CSharp/Sparse2DArrayBenchmark.cs at bd/sparse-read-bench · TileDB-Inc/TileDB-CSharp · GitHub
The above example generates 50000 random strings as data. It took around 10 milliseconds for reading local array and around 1 seconds for reading S3 array.

Please retry your case with the TileDBBuffer and let us know the reading times on local and S3. Thanks

Thanks for your reply!

It really helped, now it reads in 00:00:00.2554022 locally and in 3.5 - 4.5 seconds from S3 (have executed several times). Not sure why my result is different from your, probably it’s related to my PC configuration and communication speed with S3 but anyway it is much faster than it was.

Now I’m facing another issue - when I execute 2 separate reads from Main method (first one to read locally and next one to read from S3), program successfully completes first read and throws an AccessViolationException error.
Full error message (was not sure what kind of token was in error, so replaced it):

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:
--------------------------------
   at TileDB.tiledbcsPINVOKE.VectorUInt64_Clear(System.Runtime.InteropServices.HandleRef)
--------------------------------
   at TileDB.VectorUInt64.Clear()
   at TileDB.TileDBBuffer`1[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=replaced_token]].Release()
   at TileDB.TileDBBuffer`1[[System.__Canon, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=replaced_token]].Finalize()

I was just using two identical methods with different array URIs and removed TileDB.Config for local read.

I have tried to use debugger and found that it fails when the code is trying to open an Array but I’m using array.close() and have tried to use keywork using for this object, so not sure what is the issue with opening an array for the second time. Tried to read different array as well but the same exception has been thrown

Thanks for letting us know the time for reading locally and from S3. I am trying to reproduce the problem you mentioned above and will let you know if I find out the reason. Are you using TileDB.CSharp version 2.5.0 or version 2.4.11? If you are using 2.5.0, could you try version 2.4.11.

Have tried 2.4.7, 2.4.9, 2.4.11 and 2.5.0 version but this issue appears in all of the versions.
Code that I’m using to read:

public static (int[], int[], string[], int[], int[]) ReadBufferArray(string arrayUri, int elementCount)
{
	Stopwatch sw = new Stopwatch();
	sw.Start();

	TileDB.Config config = new TileDB.Config();

	config.set("vfs.s3.scheme", "https");
	config.set("vfs.s3.region", "us-east-1");
	config.set("vfs.s3.endpoint_override", "");
	config.set("vfs.s3.aws_access_key_id", "access_key");
	config.set("vfs.s3.aws_secret_access_key", "secret_key");
	config.set("vfs.s3.use_virtual_addressing", "True");

	using TileDB.Context ctx = new TileDB.Context(config);

	// Declaring buffers
	TileDB.TileDBBuffer<int> xBuffer = new TileDB.TileDBBuffer<int>();
	TileDB.TileDBBuffer<int> yBuffer = new TileDB.TileDBBuffer<int>();
	TileDB.TileDBBuffer<string> data1Buffer = new TileDB.TileDBBuffer<string>();
	TileDB.TileDBBuffer<int> data2Buffer = new TileDB.TileDBBuffer<int>();
	TileDB.TileDBBuffer<int> data3Buffer = new TileDB.TileDBBuffer<int>();

	// Initializing buffers
	xBuffer.Init(elementCount, false, false);
	yBuffer.Init(elementCount, false, false);
	data1Buffer.Init(elementCount, true, false, elementCount * 16);
	data2Buffer.Init(elementCount, false, false);
	data3Buffer.Init(elementCount, false, false);

	using TileDB.Array array = new TileDB.Array(ctx, arrayUri, TileDB.QueryType.TILEDB_READ);

	using TileDB.Query query = new TileDB.Query(ctx, array, TileDB.QueryType.TILEDB_READ);

	query.set_layout(TileDB.LayoutType.TILEDB_UNORDERED);
	query.set_buffer("x", xBuffer.DataIntPtr, xBuffer.BufferSize, xBuffer.ElementDataSize);
	query.set_buffer("y", yBuffer.DataIntPtr, yBuffer.BufferSize, yBuffer.ElementDataSize);
	query.set_buffer_with_offsets("data1", data1Buffer.DataIntPtr, data1Buffer.BufferSize, data1Buffer.ElementDataSize, data1Buffer.Offsets);
	query.set_buffer("data2", data2Buffer.DataIntPtr, data2Buffer.BufferSize, data2Buffer.ElementDataSize);
	query.set_buffer("data3", data3Buffer.DataIntPtr, data3Buffer.BufferSize, data3Buffer.ElementDataSize);

	query.submit();
	query.finalize();

	using TileDB.MapStringVectorUInt64 resultBufferElements = query.result_buffer_elements();

	array.close();

	ulong dataLength = 0;
	ulong bufferSize = 0;

	foreach (var kv in resultBufferElements)
	{
		if(kv.Key =="data1")
		{
			dataLength = kv.Value[0];
			bufferSize = kv.Value[1];     
		}
	}

	string[] dataString = data1Buffer.UnPackStringArray((int)bufferSize, (int)dataLength);

	sw.Stop();
	Console.WriteLine($"Read time: {sw.Elapsed}");

	return (xBuffer.Data, yBuffer.Data, dataString, data2Buffer.Data, data3Buffer.Data);
}

Maybe something is wrong in this code I’m using, so please let me know if I need to fix something in this method to make it work

Hi! Are there any updates on this?

Hi @vodohleb, we will have a C# package update released soon with a lot of improvements along with update to the TileDB Core version. @Bin_Deng is going to update and test the example above with the latest versions.