Flink: FlinkSink data loss with unaligned checkpoints#15913
Open
UrsSchoenenbergerNu wants to merge 1 commit intoapache:mainfrom
Open
Flink: FlinkSink data loss with unaligned checkpoints#15913UrsSchoenenbergerNu wants to merge 1 commit intoapache:mainfrom
UrsSchoenenbergerNu wants to merge 1 commit intoapache:mainfrom
Conversation
(issue apache#15846) When running with unaligned checkpoints, data files may reach IcebergFilesCommitter after the associated checkpoint's first barrier. We then need to account for these files in the state for the 'next' checkpoint to avoid discarding them during recovery or failed checkpoints.
mxm
reviewed
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15846
Quick Summary
When running with unaligned checkpoints, data files may reach IcebergFilesCommitter after the associated checkpoint's first barrier.
We then need to account for these files in the state for the 'next' checkpoint to avoid discarding them during recovery or failed checkpoints.
Details
We have recently encountered a case of data loss in an application using FlinkSink (v1), running on Flink 2.x with unaligned checkpoints.
What we saw:
One commit to the Iceberg table was missing approximately half of the data files that were written by the writer during this checkpoint.
Around this time, one checkpoint was started and Flink snapshots were triggered, but this checkpoint timed out before completing.
Based on our research, we believe that the culprit might be the internal state tracking in
IcebergFilesCommitter. Here's our current theory:For unaligned checkpoints, the order of operations happening on
IcebergFilesCommittercan change subtly sinceprocessElementcan be called for aFlinkWriteResultthat is part of checkpoint N even aftersnapshotStatewas called for checkpoint N:processElement(FlinkWriteResult1[part of checkpoint N]), recorded inwriteResultsSinceLastSnapshotsnapshotState(checkpoint N)writeToManifestUptoLatestCheckpoint(N)is called, putting the contents ofwriteResultsSinceLastSnapshottodataFilesPerCheckpoint(N)and clearingwriteResultsSinceLastSnapshotprocessElement(FlinkWriteResult2[part of checkpoint N]), recorded inwriteResultsSinceLastSnapshot(now the single element there)notifyCheckpointComplete(N)is never calledprocessElement(FlinkWriteResult3[part of checkpoint N+1])snapshotState(checkpoint N+1)writeToManifestUptoLatestCheckpoint(N+1)is called, putting the contents ofwriteResultsSinceLastSnapshottodataFilesPerCheckpoint(N)anddataFilesPerCheckpoint(N+1)and clearingwriteResultsSinceLastSnapshot.dataFilesPerCheckpoint(N)now losesFlinkWriteResult1even though it was never committed and contains onlyFlinkWriteResult2notifyCheckpointComplete(checkpoint N+1)FlinkWriteResult2andFlinkWriteResult3, notFlinkWriteResult1.This sequence, or equivalently one where checkpoint N does not time out, but completes after checkpoint N+1, therefore leads to data loss.
This explanation aligns with the effects that we're seeing. The root cause seems to be that
snapshotState->writeToManifestUpToLatestCheckpoint->writeResultsSinceLastSnapshotseems to implicitly assume that all records for the checkpoints have already been processed whensnapshotState()is called - i.e. it assumes aligned checkpoints. If this assumption breaks, AND in addition a later checkpoint is snapshotted before an earlier one was notified complete, the issue described above is observed.Additionally, we suspect there's a second failure mode with unaligned checkpoints that loses data on job recovery: If our above theory is correct, then
IcebergFilesCommitterhas an issue with elements for checkpoint N being processed aftersnapshotState(N). But the iceberg commit triggered duringnotifyComplete(N)only commits the records from the snapshot. On recovery, this means one of two things: EitherinitializeState()->.tailMap(N)loses information about not-yet-committed records; or even if it remembered to do so, it would think that these records should be committed as part of Flink checkpoint N. It feels like there's a problem when a checkpoint ID can appear indataFilesPerCheckpointafter it has already been used asmaxCommittedCheckpointId, and the strict tailMap exclusion has no way to distinguish "already committed" from "deferred post-barrier" data.Contents of this PR
This PR contains three test cases that trigger the test harness in ways that are only possible when using unaligned checkpoints. Without the associated code change in prod code, these test cases fail.
Disclaimer / disclosure as per contributing guidelines
AI tools were used to initially pinpoint the possible issue and draft test cases. The code was subsequently rewritten condensed and clarified by hand. To the best of my knowledge, I believe that the way in which the reproduction test cases trigger the harness are all cases that DO happen with unaligned checkpoints, and the data loss that we had encountered at runtime aligns with the one that happens in these reproduction test cases.
Caveats
The logic that I'm changing here was last changed by #10526 while fixing a data duplication bug. The test cases that were added for that bug are still green with the changes that I'm proposing here. I would still like to tag @pvary and @zhongqishang and kindly ask them for advice on this issue here and this associated PR, as I'm sure they have a much deeper understanding of the inner workings of
IcebergFileCommitterthan I do.