Skip to content

KAFKA-14648: Do not fail clients if bootstrap servers is not immediately resolvable#21080

Open
frankvicky wants to merge 56 commits intoapache:trunkfrom
frankvicky:KAFKA-14648
Open

KAFKA-14648: Do not fail clients if bootstrap servers is not immediately resolvable#21080
frankvicky wants to merge 56 commits intoapache:trunkfrom
frankvicky:KAFKA-14648

Conversation

@frankvicky
Copy link
Copy Markdown
Contributor

@frankvicky frankvicky commented Dec 4, 2025

This PR aims to deliver
KIP-909.

Reviewers: Lianet Magrans 98415067+lianetm@users.noreply.github.com

@github-actions
Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@github-actions
Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@github-actions
Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2026

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.


@Override
public void bootstrap(List<InetSocketAddress> addresses) {
// AdminClient handles bootstrap during construction, so this method is not used
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KIP-909 states "Consumer, Producer, and Admin Clients: The bootstrap code will be changed.", so this comment is not correct

config.originalsWithPrefix(CommonClientConfigs.METRICS_CONTEXT_PREFIX));
metrics = new Metrics(metricConfig, reporters, time, metricsContext);

// Use the appropriate bootstrap configuration determined by AdminBootstrapAddresses
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems AdminBootstrapAddresses.fromConfig(config); will resolve the dns already, which is conflict with KIP-909

@Nikita-Shupletsov
Copy link
Copy Markdown
Contributor

looks like the test failures may be related to the change

unresolvedAddresses.add(InetSocketAddress.createUnresolved(host, port));
}
}
metadataManager.update(Cluster.bootstrap(unresolvedAddresses), time.milliseconds());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster created by Cluster.bootstrap has isBootstrapConfigured=true, so bootstrapCluster will be updated immediately. This makes isBootstrapped() incorrectly return true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should leverage MetadataUpdater#bootstrap to set the resolved bootstrap servers once the DNS resolution succeeds in NetworkClient?

long remainingBootstrapTimeMs = bootstrapTimer.remainingMs();
long sleepTimeMs = Math.min(Math.min(remainingPollTimeMs, remainingBootstrapTimeMs), bootstrapConfiguration.retryBackoffMs);

if (sleepTimeMs > 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if users try to interrupt the poll?

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankvicky Just wondering, did we discuss the blocking DNS resolution in poll before? I'm thinking we could use a separate BootstrapResolver class with an independent thread to handle this. We can pass it to Metadata or AdminMetadataManager so we don't have to change NetworkClient too much. NetworkClient#ensureBootstrapped can just poll the metadataUpdater's state with blocking.


while (true) {
// Check if thread has been interrupted
if (Thread.interrupted()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we propagate the exception directly?

        if (Thread.interrupted()) {
            throw new InterruptException(new InterruptedException());
        }

* @param isBootstrapConfigured whether the cluster is bootstrapped
* @return a new Cluster instance with the specified bootstrap state
*/
public static Cluster withBootstrapFlag(String clusterId,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to be unnecessary. Also, it is a public class so we should not make this change here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concern about adding a new public API. However, the Cluster.withBootstrapFlag() method is necessary for the KIP-909 implementation. Here's why:

  1. The isBootstrapConfigured flag must be preserved across metadata updates:
  • After DNS resolution succeeds, we create a cluster with isBootstrapConfigured=true
  • When a partial metadata update occurs (via MetadataSnapshot.mergeWith()), this flag needs to be preserved
  • This allows AdminMetadataManager.update() to correctly distinguish between bootstrap cluster and real cluster metadata
  1. Package structure constraints:
  • Cluster is in the org.apache.kafka.common package
  • MetadataSnapshot is in the org.apache.kafka.clients package
  • We cannot use package-private methods to access each other
  • The only options are: public static method or restructuring the package layout
  1. Issues with alternative approaches:
  • Not preserving the flag: Would break AdminMetadataManager's ability to correctly identify bootstrap clusters, affecting re-bootstrap logic
  • Modifying package structure: Too large a change with high risk
  • Other designs: All require similar public API additions or more complex refactoring

Alternative consideration:
If adding this public method is a blocker, the alternative would be to:

  1. Move both Cluster and MetadataSnapshot to the same package, or
  2. Add a ClusterBuilder class in the clients package that has access to create clusters with specific bootstrap flags

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had on offline discussion about this one. Another option to consider is to completely remove the isBootstrapConfigured flag, and rely on checking if metadata has nodes instead (having nodes, either bootstrap nodes or from real metadata, means that we can skip DNS resolution, which is what the isBootstrapConfigured was used for).

The changes are in place now, with that, no need for changes on this Cluster class. Thoughts @chia7712 ?

config.getLong(AdminClientConfig.RETRY_BACKOFF_MS_CONFIG),
config.getLong(AdminClientConfig.METADATA_MAX_AGE_CONFIG),
adminAddresses.usingBootstrapControllers());
metadataManager.update(Cluster.bootstrap(adminAddresses.addresses()), time.milliseconds());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems adminAddresses.addresses() is not used any more. Maybe we could just remove AdminBootstrapAddresses and use a helper method instead?

@chia7712
Copy link
Copy Markdown
Member

did we discuss the blocking DNS resolution in poll before? I'm thinking we could use a separate BootstrapResolver class with an independent thread to handle this. We can pass it to Metadata or AdminMetadataManager so we don't have to change NetworkClient too much. NetworkClient#ensureBootstrapped can just poll the metadataUpdater's state with blocking.

the while loop in ensureBootstrapped is problematic because it introduces an unexpected blocking logic into the event loop. The origin design likely didn't intend to have a loop here. To keep it non-blocking, it should just return immediately if the dns resolution is not ready for now

@lianetm
Copy link
Copy Markdown
Member

lianetm commented Mar 27, 2026

About the blocking behaviour, agree with removing the while loop within the network client poll, and let the retry happen on the next poll.
But even with that, we still have a blocking call to parseAddresses(single one, not in a while, but still blocking because it triggers InetAddress.getAllByName). I wonder if we could consider running this async? triggered only once when needed from the network poll (so the overhead of the extra thread is only to bootstrap), and with that we don't block on network poll and just conitnue to check if that asyn op finished in time, something like that. Thoughts?

-- update
Adding follow-up thoughts on this (tradeoff between the impact of the blocking we have vs the overhead of any asyn op). With the current shape the blocking is really is a single blocking call to parseAddresses within poll (blocks on single DNS resolution attempt, not for the full bootstrap timer), and the important bit is that the blocking would only happen initially when bootstrapping for the first time (at a point where the network client is really not functional yet). Is my understanding correct? If so then we're probably better off not introducing any overhead for running this async (and just need to handle interrupts?)


// DNS resolution failed
if (bootstrapTimer.isExpired()) {
throw new BootstrapResolutionException("Failed to resolve bootstrap servers after " +
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the KIP I get that we want this to propagate to the consumer.poll right?

If so, then we need to catch this exception and send an error event to the app thread, otherwise this error will be swallowed in the background (this

log.error("Unexpected error caught in consumer network thread", e);
)

One option I see is to catch this at the NetworkClientDelegate level, where we have the backgroundEventHandler, and propagate the error there via event (just like we do to propagate metadata errors)

Btw, we really need to have a test at the KafkaConsumerTest level covering this behaviour. E.g, create consumer with invalid host, poll continuously, poll should eventually throw the bootstrap exception (for both consumers). I expect that wouldn't work now for the async.

}

@Test
public void testAsyncConsumerBootstrapResolutionExceptionPropagatedToPoll() throws InterruptedException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this behaviour applies equally to both consumers. Should we better parameterize the test to run it for group.protocol=classic too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants