Tiledb for ML: how to dump multiple features into PyTorchTileDBDenseDataset (or Sparse)

RAbraham · July 12, 2021, 8:27pm

Hi,
Our raw data unit(a single training instance in a batch) is a page in a document file. A file with id <file_uuid> can have many pages. We identify each page with the key <file_uuid, page_number>. We then extract various features of that page. Some of them are:

input_id: 1x512. A HuggingFace library output vector for the text on that page. contains 0’s or 1’s
attention_mask: 1x512. similar to above
unnormalized_bounding_box: A list of bounding boxes for that page. 2D array.
bounding_box: Same as above
image: 3D array

I would like to use PyTorchTileDBDenseDataset from tiledb-ml(TileDB-ML/test_pytorch_dataloader_api.py at master · TileDB-Inc/TileDB-ML · GitHub) or the sparse version?

I see there, that we have to create an attribute called features which includes all of the features for a particular ML instance for me to be able to use PyTorchTileDBDenseDataset.

I’m having difficulty visualizing how to do that How do I put all of the above as a single ‘row’ in a tiledb array? Appreciate any guidance.

George_Skoumas · July 13, 2021, 7:44am

Hi Rajiv,

Thanks again for your interest in TileDB! Quick question. The number of pages is different per document, right? Moreover, is file_uuid important for your case, i.e., do you need it as a feature in while training? Finally, are unnormalized_bounding_box, bounding_box and image stable in terms of shape in all cases?

RAbraham · July 13, 2021, 10:50am

Hi George,
Thank you very much for your interest :).

Yes, the number of pages is different per document. But that is not an issue as we just consider the raw data to be a concatenated list of pages. For e.g. if p11 and p12 are in doc1 and p21 is in doc2, then we just make a list of pages like [p11, p12, p21] and then start processing them by page(e.g. p11, p12, p21 will be each page’s text)
file_uuid is not important in the case and it is not used as a feature. I just wanted to give context to our project. Ultimately, the index of the list identifies the page. e.g. p11 is at index 0
unnormalized_bounding_box, bounding_box are currently stable in terms of shape as we truncate or pad to a fixed size but I see us later experimenting with variable sized boxes.
image is stable in term of shape. 3, 3, 224

George_Skoumas · July 13, 2021, 12:52pm

Ok! So, because of the fact that all shapes are stable the problem surely lives in the dense world, i.e., you should try ‘PyTorchTileDBDenseDataset’ Class. At the moment, we only support TileDB Array schemas with any number of dimensions but only one attribute (we are already planning multiple attributes and hopefully will be out soon), so I will try to project your problem on this.

The array schema that I would try is the following.

I would define the 2 dimensions.
- 1st dim is page_number, which is an int of domain (1, Num_Of_Pages_In_Dataset). Here, set the tile extent (tile attribute when defining a dimension) equal to your batch size. This is where we slice while batching.
- 2nd dim is the size of all your features reshaped in 1D array of the same data type, i.e.,
```
   * input_id -> 1x512
   * attention_mask -> 1x512
   * unnormalized_bounding_box -> In a 10x10 scenario this will be reshaped as 1x100
   * bounding_box -> Similarly, in a 10x10 scenario this will be reshaped as 1x100 
   * image -> In your (3, 3, 224) scenario, this will be reshaped as 1x2016
```
  I would then concatenate this to a 1x[512 + 512 + 100 + 100 + 2016] → 1x3240 vector, and create the 2nd dimension, which will be an int of domain (1, 3240) and tile extent equal to dimension’s domain range.
I would save all initial feature shapes in Array’s metadata, e.g, {“shape_1”: (1, 512), “shape_2”: (1, 512), “shape_3”: (10, 10), “shape_4”: (10, 10), “shape_5”: (3, 3, 224)},
in order to be able to get the initial shapes (via numpy reshape) of my data at batch read time.

You can check our jupyter notebooks in order to find ingestion examples by the use of numpy arrays.

I hope this helps!

Please let us know in case you need extra guidance.

RAbraham · July 13, 2021, 1:55pm

@George_Skoumas Thanks!
I’ll try this out and let you know.

A quick question, re: 1st dimension domain, I think Pytorch Datasets start with 0 index? Should I change the domain to be (0, Num_Pages -1)? or is 1 indexing ok?

ktsitsi · July 13, 2021, 2:06pm

Hi @RAbraham

Batching is based on 0-index domain in PyTorchTileDBDenseDataset as well since it projects the PyTorch Datasets API.

RAbraham · July 13, 2021, 2:10pm

Thanks @ktsitsi

So should I change the domain from (1, Num_of_Pages_In_Dataset) as suggested above by George to (0, Num_of_Pages_In_Dataset - 1) for the first dimension? or it doesn’t matter?

ktsitsi · July 13, 2021, 2:14pm

Yes exactly. The slicing on the 1st dimension starts from 0 offset. So please change the domain to (0, Num_of_Pages_In_Dataset - 1).

Please do not hesitate to contact us for any further questions, information or guidance.

RAbraham · July 13, 2021, 2:17pm

Thanks! @ktsitsi
I’ll let you guys know if I have further questions.

Topic		Replies	Views
How to model financial data using tiledb	8	1291	March 2, 2021
Importing parquet file as a 2d dense array	2	103	January 15, 2025
Slicing of arrays in python	3	937	August 24, 2020
Benchmarking tiledb read performance	5	3137	October 18, 2023
Optimizing the reads for sparse arrays	9	773	June 27, 2023

Tiledb for ML: how to dump multiple features into PyTorchTileDBDenseDataset (or Sparse)

Related topics