Our raw data unit(a single training instance in a batch) is a page in a document file. A file with id <file_uuid> can have many pages. We identify each page with the key <file_uuid, page_number>. We then extract various features of that page. Some of them are:
input_id: 1x512. A HuggingFace library output vector for the text on that page. contains 0’s or 1’s
attention_mask: 1x512. similar to above
unnormalized_bounding_box: A list of bounding boxes for that page. 2D array.
bounding_box: Same as above
image: 3D array
I would like to use
tiledb-ml(TileDB-ML/test_pytorch_dataloader_api.py at master · TileDB-Inc/TileDB-ML · GitHub) or the sparse version?
I see there, that we have to create an attribute called
features which includes all of the features for a particular ML instance for me to be able to use
I’m having difficulty visualizing how to do that How do I put all of the above as a single ‘row’ in a tiledb array? Appreciate any guidance.