Importing parquet file as a 2d dense array

yoshi · January 3, 2025, 3:26pm

Hello,
I have a parquet file that has a matrix of Samples x Features (all features are same data type), and would like to create a TileDB object, where the row number and the feature# are the dimensions (as in I access sample 1 feature 1:2 as object[1,1:2]. I found a tutorial on importing the files as 1d arrays, and the columns become the attributes, but what I want is a single attribute per cel. The from_pandas has a parameter to set a column as a dimension, but this would mean changing the data from wide format to long format, which sounds very inefficient .
A possible solution would be to fit the array into memory, and then just set the TileDB object and do a write, but the file unfortunately is too big to fit in memory.
So far my best idea is to read the parquet file and start writing to it by a chunks, and then consolidate it, but wondering if there is an easier approach.
thnx!

yoshi · January 15, 2025, 8:26pm

just bumping this @nguyenv maybe you would have some insight on this? thnx!

spencerseale · January 15, 2025, 8:58pm

HI @yoshi,

Thanks for the question and sorry for the delay!

For writing large files that won’t fit into memory, I recommend you first create the array. During array creation, you’ll define the schema exactly as you’d like such as defining a single attribute.

Once you’ve written your array to disk, you can batch read from the original file and write to your array. Because TileDB arrays can handle parallel i/o, you can implement this in parallel for a faster ingestion process. Consolidation after all of these writes is a great idea as well as this will improve downstream read performance!

Once your data is written to a TileDB array, you’ll be able to perform out-of-core queries from your dataset without running into the memory constraints typical of a large tabular file.

Check out our documentation site, TileDB Academy if you need help with any of the above!

Best,
Spencer

Topic		Replies	Views
Basic from_pandas usage problems	2	907	April 12, 2022
Usage help -- disk space, parallel writes	5	824	June 21, 2022
Multiple concurrent writers to append-only sparse array	10	713	January 15, 2024
Reads are suffering badly	4	1237	June 28, 2019
Confused by dimensions vs attributes need help designing backend for project	5	859	March 10, 2023

Importing parquet file as a 2d dense array

Related topics