Skip to content

perf(reader): Improve the performance of parquet reader#230

Open
discivigour wants to merge 17 commits intoapache:mainfrom
discivigour:perfOptimize
Open

perf(reader): Improve the performance of parquet reader#230
discivigour wants to merge 17 commits intoapache:mainfrom
discivigour:perfOptimize

Conversation

@discivigour
Copy link
Copy Markdown
Contributor

Purpose

  • introduce coalesce read
  • introduce metadata prefetch

Brief change log

Tests

API and Format

Documentation

let fetch_bytes = &fetched[idx];
let start = (range.start - fetch_range.start) as usize;
let end = (range.end - fetch_range.start) as usize;
fetch_bytes.slice(start..end.min(fetch_bytes.len()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data returned by fetch is not long enough, truncated data will be silently returned here, and downstream may obtain incomplete column chunks, leading to parsing errors or data corruption. Suggest changing it to assert or returning an error, do not silently swallow it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will change it.

Ok(ranges
.iter()
.map(|range| {
let idx = fetch_ranges.partition_point(|v| v.start <= range.start) - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If partition_point returns 0, 0-1 overflow panic will occur here. Although logically fetch_range must cover all original ranges, this is an implicit assumption. It is recommended to add a debug_assert! Or use checked_stub+more explicit error messages

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

/// Default coalesce threshold: 1 MiB.
const DEFAULT_RANGE_COALESCE_BYTES: u64 = 1024 * 1024;
/// Default concurrent range fetches.
const DEFAULT_RANGE_FETCH_CONCURRENCY: usize = 8;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same to iceberg, maybe 10?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants