Core: Stream DV Puffin rewrite in RewriteTablePathUtil#15927
Open
aviralgarg05 wants to merge 1 commit intoapache:mainfrom
Open
Core: Stream DV Puffin rewrite in RewriteTablePathUtil#15927aviralgarg05 wants to merge 1 commit intoapache:mainfrom
aviralgarg05 wants to merge 1 commit intoapache:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses #15924 by changing RewriteTablePathUtil.rewriteDVFile to rewrite DV Puffin files blob-by-blob (streaming) rather than buffering all rewritten blobs in memory, reducing peak memory usage when DV files contain many/large blobs.
Changes:
- Stream DV Puffin blob rewriting directly into the destination
PuffinWriterduring iteration. - Remove intermediate
List<Blob>accumulation during DV rewrite. - Add a regression test that builds a real multi-blob Puffin DV, rewrites it, and validates rewritten blob properties and preserved payloads.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java | Streams DV Puffin blob rewrite directly to the output writer to avoid buffering rewritten blobs in memory. |
| core/src/test/java/org/apache/iceberg/TestRewriteTablePathUtil.java | Adds a regression test covering multi-blob DV Puffin rewrite, validating metadata rewrite and payload preservation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #15924
Summary
This change fixes
RewriteTablePathUtil.rewriteDVFileso DV Puffin files are rewritten in a streaming fashion instead of buffering every rewritten blob in memory first.The previous implementation collected all rewritten
Blobinstances into a list and wrote them only after the read loop finished. That created unnecessary peak memory usage for large deletion vector files. The new implementation rewrites each blob and writes it directly to the destinationPuffinWriteras it is read.What changed
rewriteDVFileto open thePuffinWriteralongside thePuffinReader.List<Blob>accumulation.referenced-data-filepath rewrite behavior for DV blobs.RewriteTablePathUtil,Why this fixes the issue
The DV rewrite path is only supposed to update blob metadata, not materialize the entire file in memory. Writing each blob as soon as it is read keeps memory usage bounded by a single blob instead of the full DV file contents.
Verification
Ran the following checks successfully:
./gradlew :iceberg-core:test --tests org.apache.iceberg.TestRewriteTablePathUtil./gradlew :iceberg-core:spotlessCheck :iceberg-core:test --tests org.apache.iceberg.TestRewriteTablePathUtilgit diff --checkThe targeted core test suite was executed three times during validation.