Tiledb for ML: how to dump multiple features into PyTorchTileDBDenseDataset (or Sparse)

Hi,
Our raw data unit(a single training instance in a batch) is a page in a document file. A file with id <file_uuid> can have many pages. We identify each page with the key <file_uuid, page_number>. We then extract various features of that page. Some of them are:

  • input_id: 1x512. A HuggingFace library output vector for the text on that page. contains 0’s or 1’s

  • attention_mask: 1x512. similar to above

  • unnormalized_bounding_box: A list of bounding boxes for that page. 2D array.

  • bounding_box: Same as above

  • image: 3D array

I would like to use PyTorchTileDBDenseDataset from tiledb-ml(TileDB-ML/test_pytorch_dataloader_api.py at master · TileDB-Inc/TileDB-ML · GitHub) or the sparse version?

I see there, that we have to create an attribute called features which includes all of the features for a particular ML instance for me to be able to use PyTorchTileDBDenseDataset.

I’m having difficulty visualizing how to do that How do I put all of the above as a single ‘row’ in a tiledb array? Appreciate any guidance.

Hi Rajiv,

Thanks again for your interest in TileDB! Quick question. The number of pages is different per document, right? Moreover, is file_uuid important for your case, i.e., do you need it as a feature in while training? Finally, are unnormalized_bounding_box, bounding_box and image stable in terms of shape in all cases?

Hi George,
Thank you very much for your interest :).

  • Yes, the number of pages is different per document. But that is not an issue as we just consider the raw data to be a concatenated list of pages. For e.g. if p11 and p12 are in doc1 and p21 is in doc2, then we just make a list of pages like [p11, p12, p21] and then start processing them by page(e.g. p11, p12, p21 will be each page’s text)
  • file_uuid is not important in the case and it is not used as a feature. I just wanted to give context to our project. Ultimately, the index of the list identifies the page. e.g. p11 is at index 0
  • unnormalized_bounding_box, bounding_box are currently stable in terms of shape as we truncate or pad to a fixed size but I see us later experimenting with variable sized boxes.
  • image is stable in term of shape. 3, 3, 224

Ok! So, because of the fact that all shapes are stable the problem surely lives in the dense world, i.e., you should try ‘PyTorchTileDBDenseDataset’ Class. At the moment, we only support TileDB Array schemas with any number of dimensions but only one attribute (we are already planning multiple attributes and hopefully will be out soon), so I will try to project your problem on this.

The array schema that I would try is the following.

  1. I would define the 2 dimensions.

    • 1st dim is page_number, which is an int of domain (1, Num_Of_Pages_In_Dataset). Here, set the tile extent (tile attribute when defining a dimension) equal to your batch size. This is where we slice while batching.

    • 2nd dim is the size of all your features reshaped in 1D array of the same data type, i.e.,

         * input_id -> 1x512
         * attention_mask -> 1x512
         * unnormalized_bounding_box -> In a 10x10 scenario this will be reshaped as 1x100
         * bounding_box -> Similarly, in a 10x10 scenario this will be reshaped as 1x100 
         * image -> In your (3, 3, 224) scenario, this will be reshaped as 1x2016
      

      I would then concatenate this to a 1x[512 + 512 + 100 + 100 + 2016] → 1x3240 vector, and create the 2nd dimension, which will be an int of domain (1, 3240) and tile extent equal to dimension’s domain range.

  2. I would save all initial feature shapes in Array’s metadata, e.g, {“shape_1”: (1, 512), “shape_2”: (1, 512), “shape_3”: (10, 10), “shape_4”: (10, 10), “shape_5”: (3, 3, 224)},
    in order to be able to get the initial shapes (via numpy reshape) of my data at batch read time.

You can check our jupyter notebooks in order to find ingestion examples by the use of numpy arrays.

I hope this helps!

Please let us know in case you need extra guidance.

1 Like

@George_Skoumas Thanks!
I’ll try this out and let you know.

A quick question, re: 1st dimension domain, I think Pytorch Datasets start with 0 index? Should I change the domain to be (0, Num_Pages -1)? or is 1 indexing ok?

Hi @RAbraham

Batching is based on 0-index domain in PyTorchTileDBDenseDataset as well since it projects the PyTorch Datasets API.

1 Like

Thanks @ktsitsi

So should I change the domain from (1, Num_of_Pages_In_Dataset) as suggested above by George to (0, Num_of_Pages_In_Dataset - 1) for the first dimension? or it doesn’t matter?

Yes exactly. The slicing on the 1st dimension starts from 0 offset. So please change the domain to (0, Num_of_Pages_In_Dataset - 1).

Please do not hesitate to contact us for any further questions, information or guidance.

1 Like

Thanks! @ktsitsi
I’ll let you guys know if I have further questions.