Hello,
I have a parquet file that has a matrix of Samples x Features (all features are same data type), and would like to create a TileDB object, where the row number and the feature# are the dimensions (as in I access sample 1 feature 1:2 as object[1,1:2]. I found a tutorial on importing the files as 1d arrays, and the columns become the attributes, but what I want is a single attribute per cel. The from_pandas
has a parameter to set a column as a dimension, but this would mean changing the data from wide format to long format, which sounds very inefficient .
A possible solution would be to fit the array into memory, and then just set the TileDB object and do a write, but the file unfortunately is too big to fit in memory.
So far my best idea is to read the parquet file and start writing to it by a chunks, and then consolidate it, but wondering if there is an easier approach.
thnx!
just bumping this @nguyenv maybe you would have some insight on this? thnx!
HI @yoshi,
Thanks for the question and sorry for the delay!
For writing large files that won’t fit into memory, I recommend you first create the array. During array creation, you’ll define the schema exactly as you’d like such as defining a single attribute.
Once you’ve written your array to disk, you can batch read from the original file and write to your array. Because TileDB arrays can handle parallel i/o, you can implement this in parallel for a faster ingestion process. Consolidation after all of these writes is a great idea as well as this will improve downstream read performance!
Once your data is written to a TileDB array, you’ll be able to perform out-of-core queries from your dataset without running into the memory constraints typical of a large tabular file.
Check out our documentation site, TileDB Academy if you need help with any of the above!
Best,
Spencer