Note: my research is funded by InfluxData, which made this blog post possible.
Parquet has become the industry standard for storing columnar data, and reading Parquet efficiently is crucial for query performance.
To optimize this, DataFusion implements advanced Parquet support for effective data pruning and decoding.
However, achieving high performance adds complexity, and this is no exception. This post provides an overview of the techniques used in DataFusion to read Parquet files selectively.
The pipeline
0. Parquet file structure
Parquet structures data into row groups, each row group contains a set of columns, each columns contains a set of pages. Read Querying Parquet with Millisecond Latency for more details.
1. Read metadata
DataFusion use Parquet metadata to understand what is in the file. Reading metadata in the worst case requires two network requests: one to read the footer size, and another one to read the footer itself. Decoding metadata is generally fast, but can be slow for very wide tables.
DataFusion allows users to cache metadata through the ParquetFileReaderFactory trait.
2. Prune by projection
Queries usually won’t select all columns, e.g., SELECT a from table
only reads column a
.
As a columnar format, Parquet allows DataFusion to only read the columns that are needed.
3. Prune by row group stats and Bloom filters
Each row group has basic stats like min/max values for each column. DataFusion applies the query predicates to these stats to prune row groups, e.g., SELECT * FROM table WHERE a > 10
will only read row groups where a
has a max value greater than 10.
Sometimes min/max stats are too simple to prune effectively, so Parquet also supports Bloom filters. DataFusion uses Bloom filter when available.
4. Prune by page stats
Parquet optionally supports page-level stats – similar to row group stats but more fine-grained. DataFusion implements the Page pruning when the stats is present.
5. Read from storage
Now we (hopefully) have pruned the Parquet file into a small ranges of bytes. The last step is to make requests to fetch those bytes and decode them into Arrow RecordBatch.
Bonus: filter pushdown
Filter pushdown, also known as late materialization or predicate pushdown, is a technique used to dynamically prune data based on what has already been evaluated. Unlike the static pruning methods applied when files are opened, filter pushdown operates during scanning, with filters being generated and applied in the scanner.
DataFusion implements filter pushdown but has not enabled by default due to some performance regressions.
The community (including me) is working on to remove the remaining performance issues and enable it by default.