Skip to content

Core: Stream DV Puffin rewrite in RewriteTablePathUtil#15927

Open
aviralgarg05 wants to merge 1 commit intoapache:mainfrom
aviralgarg05:aviralgarg/issue-15924-stream-dv-rewrite
Open

Core: Stream DV Puffin rewrite in RewriteTablePathUtil#15927
aviralgarg05 wants to merge 1 commit intoapache:mainfrom
aviralgarg05:aviralgarg/issue-15924-stream-dv-rewrite

Conversation

@aviralgarg05
Copy link
Copy Markdown

Fixes #15924

Summary

This change fixes RewriteTablePathUtil.rewriteDVFile so DV Puffin files are rewritten in a streaming fashion instead of buffering every rewritten blob in memory first.

The previous implementation collected all rewritten Blob instances into a list and wrote them only after the read loop finished. That created unnecessary peak memory usage for large deletion vector files. The new implementation rewrites each blob and writes it directly to the destination PuffinWriter as it is read.

What changed

  • Reworked rewriteDVFile to open the PuffinWriter alongside the PuffinReader.
  • Removed the intermediate List<Blob> accumulation.
  • Preserved the existing referenced-data-file path rewrite behavior for DV blobs.
  • Added a regression test that:
    • creates a real Puffin DV file with multiple blobs,
    • rewrites it through RewriteTablePathUtil,
    • verifies the rewritten blob metadata,
    • verifies the blob payloads are preserved.

Why this fixes the issue

The DV rewrite path is only supposed to update blob metadata, not materialize the entire file in memory. Writing each blob as soon as it is read keeps memory usage bounded by a single blob instead of the full DV file contents.

Verification

Ran the following checks successfully:

  • ./gradlew :iceberg-core:test --tests org.apache.iceberg.TestRewriteTablePathUtil
  • ./gradlew :iceberg-core:spotlessCheck :iceberg-core:test --tests org.apache.iceberg.TestRewriteTablePathUtil
  • git diff --check

The targeted core test suite was executed three times during validation.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses #15924 by changing RewriteTablePathUtil.rewriteDVFile to rewrite DV Puffin files blob-by-blob (streaming) rather than buffering all rewritten blobs in memory, reducing peak memory usage when DV files contain many/large blobs.

Changes:

  • Stream DV Puffin blob rewriting directly into the destination PuffinWriter during iteration.
  • Remove intermediate List<Blob> accumulation during DV rewrite.
  • Add a regression test that builds a real multi-blob Puffin DV, rewrites it, and validates rewritten blob properties and preserved payloads.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Streams DV Puffin blob rewrite directly to the output writer to avoid buffering rewritten blobs in memory.
core/src/test/java/org/apache/iceberg/TestRewriteTablePathUtil.java Adds a regression test covering multi-blob DV Puffin rewrite, validating metadata rewrite and payload preservation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core: Stream DV Puffin rewrite in RewriteTablePathUtil to reduce memory pressure

2 participants