Cluster sync adj in p&a flavour by M4KIF · Pull Request #1814 · splunk/splunk-operator

M4KIF · 2026-04-02T14:35:42Z

Description

Rewritten the sync logic

moved each component checks to separate objects all implementing the Condition interface by which we iterate and check the health.

Key Changes

interfaced the health check
implemented the above for provisioner, pooler, configmap, secret
short unit test added
added constants and state check via bitmask

Testing and Verification

local integ suite passes (make test) and units (pkg/postgres/cluster/core go test) passes

Related Issues

Jira tickets, GitHub issues, Support tickets...

PR Checklist

[nie wiem] Code changes adhere to the project's coding standards.
Relevant unit and integration tests are included.
Documentation has been updated accordingly.
All tests pass locally.
The PR description follows the project's guidelines.

github-actions · 2026-04-02T14:35:58Z

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contribution License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment with the exact sentence copied from below.

I have read the CLA Document and I hereby sign the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request}

limak9182 · 2026-04-03T12:39:25Z

pkg/postgresql/cluster/business/core/cluster.go

+		CNPG cluster
+		Poolers
+		Access resources (configmap and secret)
+		And all of them needs to be set to Ready for our PostgresCluster phase to become Ready?


yes, that's correct

limak9182 · 2026-04-03T12:44:08Z

pkg/postgresql/cluster/business/core/cluster.go

+				rewrite to consider taking state of other objects into account
+				before declaring readyness.
+
+		CNPG cluster


Correct me if I'm wrong, but what I'm missing in the code is checking if CNPG cluster is ready and if yes then updating our ClusterReady condition to true, so at the end (here) we can check if all conditions are true and set whole custom resource status ready to true

Currently? Yes, currently It's mostly a scaffolding to which I will place any business logic.

limak9182 · 2026-04-03T12:47:56Z

pkg/postgresql/cluster/business/core/cluster.go

 			}
 			return ctrl.Result{}, patchErr
 		default:
 			if statusErr := updateStatus(clusterReady, metav1.ConditionFalse, reasonClusterBuildSucceeded,


Here we are updating the status with clusterReady condition False if we patched CNPG Cluster and requeue.
Shouldnt we in the next reconciliation go again to this check again if CNPG cluster is in desired state:

!equality.Semantic.DeepEqual(currentNormalized, desiredNormalized)

and if it is (we are not going inside this if), set the clusterReady condition to true?
Something like:

statusErr := updateStatus(clusterReady, metav1.ConditionTrue...

just after this big if block?

Maybe one more CNPG cluster check is needed, just to be sure it's in healthy and ready state, if not then requeue or leave with error?

actually after thinking about it, it should be probably after reconcileManagedRoles as it's the last thing we are doing with CNPG cluster.

In general I think It's a bit misleading that we do cluster ready and condtion == false. At least for me, It should be sth like ClusterErrorRetry, like http codes.

Sorry I think I don't not fully understand that. By cluster do you mean our PostgresClusterCR or Cluster Ready condition (actual CNPG cluster)?

In general I think It's a bit misleading that we do cluster ready and condtion == false. At least for me, It should be sth like ClusterErrorRetry, like http codes.

This is standard pattern in k8s, we should stick to it. Take a look at our design docs where we map phases and conditions to current situation

mploski · 2026-04-07T09:57:43Z

pkg/postgresql/cluster/core/cluster.go

+	// return ctrl.Result{}, nil
+}
+
+type clusterReadynessCheck interface {


sound like a great thing to add to the specific port capabilities :-)

Then the interface here would be implemented by the ports. Clean and nice idea

mploski · 2026-04-07T09:59:59Z

pkg/postgresql/cluster/business/ports/secondary/provisioner.go

+	pgcConstants "github.com/splunk/splunk-operator/pkg/postgresql/cluster/business/core/types/constants"
+)
+
+type Provisioner interface {


I think our secondary ports should reflect that we create cluster and database and we should map our interfaces around it.

mploski · 2026-04-07T10:36:33Z

pkg/postgresql/cluster/core/cluster.go

+	*/
+
+	// basically a sync logic
+	state := pgcConstants.EmptyState


I think the idea here was to decouple status check from cnpg status, At the same time we also check health after every stage and we move forward only if we are ok, if it is still in progress we requeue or raise error. Here we have a code we can use to check where we are with status iteratively, but I dont see yet how it solve our core problem

we build our state here, after checking all ports for readyness/not dying there is the moment for us to decide what happened and how It happened.
We don't set our state as state == cnpgVariable mapped to ours, we decide what do we want to do with the fact that cm's, secrets, provisioner etc. are ready.

mploski · 2026-04-07T10:39:37Z

internal/controller/postgrescluster_controller.go

 	cnpgv1 "github.com/cloudnative-pg/cloudnative-pg/api/v1"
 	enterprisev4 "github.com/splunk/splunk-operator/api/v4"
-	clustercore "github.com/splunk/splunk-operator/pkg/postgresql/cluster/core"
+	clustercore "github.com/splunk/splunk-operator/pkg/postgresql/cluster/business/core"


tbh business string is redundat here. Core itself is already a domain

I agree It's redundant, here It's a tradeoff for verbosity and segregation of components. And a service pattern at once, ie. the service/ is the primary port (reconciler that we provide) implementation. core/ is core, and ports/ are the contracts that we need for the core to work. They can grow large, hence the whole separate dir for ports.

mploski · 2026-04-07T10:42:54Z

pkg/postgresql/cluster/business/core/cluster.go

+	cnpgCluster *cnpgv1.Cluster
+}
+
+func (c *provisionerHealthCheck) Condition() (pgcConstants.State, error) {


I think all of this conditions check should be part of our Ports, also how you want to map condition to phase?

I agree, they were placed here as what I need. Solving It like you say is the thing I'm hoping for. For the provisioner/cluster etc. ports to include an interface for checking It's state.

Then the adapter would be essentially mapping the dependency state to our abstraction of It's state. Ie. we have cluster ready, provisioning, failover, cupcake, coffee etc.
Mapping condition to phase would be the job of the facade, ie. the cluster.go. That would be the whole operational brain behind. Ie. lot's of individual pieces funneled into our business decisions.

Sth. like a stateMapper, or another objects that specialises in deciding on what phase/condition we're in could also be born. The bit mask could be used for covering the phase 1, ie. state = FinaliserNotAdded && !ClusterProvisioning etc.
Phase 2 -> state == Finaliser && ClusterReady etc.

mploski · 2026-04-07T10:43:36Z

pkg/postgresql/cluster/business/core/cluster.go

+}
+
+func (p *poolerHealthCheck) Condition() (pgcConstants.State, error) {
+	return pgcConstants.PoolerReady, nil


how do we want to check the actual condition component has in status?

wdym? It's kind of the job of the adapter to test and provide that the state is actual. Ie. If we place this as a method of a port, and implement It via adapters. We actually won't work on the real state of the component in our core. Only on our understanding of It.

Currently It would be just to copy paste the thing that we do inside cluster.go, ie. the resource obj. of the Pooler, k8s.Get(obj, ...) and all similar. As there is no abstraction currently.

mploski · 2026-04-07T10:50:09Z

pkg/postgresql/cluster/core/types/constants/state.go

+)
+
+const (
+	ComponentsReady = PoolerReady | ProvisionerReady | SecretReady | ConfigMapReady


what we try to achieve here? Is it a bitmask? Since you use IOTA we endup in having just random integer?
That would be probably simpler to just use struct with keeping the state like that:

type ClusterState struct { Provisioner ComponentPhase Pooler ComponentPhase ConfigMap ComponentPhase Secret ComponentPhase }

It's a bitmask. And It does the same job of keeping a struct with aditional field.
And It kinda solves the case of having to create new types just for each component.
Just adding additional states to the state machine, ie. the values in the "enum". The iota usage is an enum in go: https://yourbasic.org/golang/iota/
It's the first usage in this file, hence It's basically an enum from 0

And It kinda solves the case of having to create new types just for each component.

We already create types for a new component for many reasons, so what problem it really solve? I agree it is smart way of doing this, but neverthless if new component arise you need to add it to the const and extend types. I feel like we trade go readability for a really small c-like optimisation especially with this model of bitwise comparision later:

state |= componentHealth.State if state&pgcConstants.ComponentsReady == pgcConstants.ComponentsReady

Also, if we build state incrementally it means that the LAST successful state in state machine is an final success. With this in mind we dont need to check every other component state afterwards.

Can we do this simpler so when Im wake up at 5am in the morning I easily understand the code?

I agree that this state check, after the iteration health check passes is redundant here.
And taking into consideration the potential future work. Which could include a file division.
I could expand the *healthCheck types with them returning the *(component)StateDto instead of relying on generic state bits.

Because later, in the very near future It seems that the project could follow this footstep of having some separation in phases and It's crucial elements. Like we've discussed on the p&a ideas brainstorm.

mploski · 2026-04-07T10:56:19Z

pkg/postgresql/cluster/business/core/cluster.go

-			rc.emitPoolerReadyTransition(postgresCluster, poolerOldConditions)
-		}
+
+	if state&pgcConstants.ComponentsReady != 0 {


Not sure if this logic is not broken. What happens if we set ProvisionerReady but later in stage we set failed for something. Sue to way we set state and components this would pass. I think it is because iota incrementing by 1 not by power of 2? I think we should rely not on bitmasking here, but simple struct with state for every stage and if all is good we are good

with unsetting the bit's at any space, this condition starts failing. As well as with not setting the bits, the values don't AND, hence if we mark ProvisionerReady and then set masks for ConfigMapFailed, it won't fire.
And to prove how this logic would work, there would be tests that would make sure any misfires aren't possible.

…ster-sync-adj

mploski · 2026-04-09T05:38:25Z

pkg/postgresql/cluster/core/cluster.go

-		oldPhase = *postgresCluster.Status.Phase
+	// Aggregate component readiness from iterative health checks.
+	state := pgcConstants.EmptyState
+	conditions := []clusterReadynessCheck{


so after every phase that is not immediate like cluster creation we should also incorporate state check rediness. I think we discussed that we dont really need to check at the end assuming we check intermediary statuses per phase?

I agree with that, but Isn't then the scope == refactor the reconciler?
I've tried to stick with changing the sync logic and doing the ground prep for more changes in coming tickets and potential p&a rework.

mploski · 2026-04-09T05:55:38Z

pkg/postgresql/cluster/core/cluster.go

+	for _, check := range conditions {
+		componentHealth, err := check.Condition(ctx)
+		if err != nil {
+			if statusErr := updateStatus(componentHealth.Condition, metav1.ConditionFalse, componentHealth.Reason, componentHealth.Message, componentHealth.Phase); statusErr != nil {


If we run this at the end of reconcillation it seems that some of the code is dead i.e if we are here, we cannot have a configmap or secret orphaned. If we do it should be discovered during this phase and requeue/err

mploski · 2026-04-09T06:52:43Z

pkg/postgresql/cluster/core/cluster.go

+				}
+				return ctrl.Result{}, statusErr
+			}
+			logger.Error(err, "Component health check reported issues",


Please follow our logging strategy: https://splunk.atlassian.net/wiki/spaces/CCP/pages/1079831167399/PostgreSQL+Controllers+Logging+Strategy

mploski · 2026-04-09T06:57:22Z

pkg/postgresql/cluster/core/cluster.go

+			return componentHealth.Result, err
+		}
+
+		if isPendingState(componentHealth.State) {


If we run this code on every phase separately, here we should requeue

mploski · 2026-04-09T07:07:12Z

pkg/postgresql/cluster/core/cluster.go

-	if postgresCluster.Status.Phase != nil {
-		newPhase = *postgresCluster.Status.Phase
+
+	if state&pgcConstants.ComponentsReady == pgcConstants.ComponentsReady {


thois could be potentially method on a state so it reads natually at 4 am i.e
func (s State) HasAll(required State) bool {
return s&required == required
}
if state.HasAll(pgcConstants.ComponentsReady) {}

mploski · 2026-04-09T07:09:31Z

pkg/postgresql/cluster/core/cluster.go

+	return &provisionerHealthCheck{cluster: cluster, cnpgCluster: cnpgCluster}
+}
+
+func (c *provisionerHealthCheck) Condition(_ context.Context) (StateInformationDto, error) {


I like providing interface like that! One doubt I have is about pureness and responsiblity of those condition methods. They check conditions but also fetch k8s objects. IMO k8s objects should be evaluated and the reconciller kickoff and propagated

mploski · 2026-04-09T07:13:34Z

I like the direction we are heading two!
Few concerns two discuss:

Should we aggregate the status check at the end or we should check per phase and when we pass everything in reconcile then we know we are good. Considering we build state incrementally and sequentally this make sense IMO
Function pureness
bitmask usage in that form - if 1 is true then we might not need it. If we need it mabe we can wrap it in helper functions/refactor the approach so we understand the code better in the morning when we have P0
I believe we should start with drowing current state machine and how we set a phase and conditions. Did you do that?

M4KIF changed the base branch from main to feature/database-controllers April 2, 2026 14:35

limak9182 reviewed Apr 3, 2026

View reviewed changes

mploski reviewed Apr 7, 2026

View reviewed changes

sync logic rewrite + tests and constants

ad7aaec

M4KIF force-pushed the jkoterba/feature/cluster-sync-adj branch from b0b2f12 to ad7aaec Compare April 8, 2026 08:02

Merge branch 'feature/database-controllers' into jkoterba/feature/clu…

b79430e

…ster-sync-adj

M4KIF marked this pull request as ready for review April 8, 2026 08:03

mploski reviewed Apr 9, 2026

View reviewed changes

M4KIF requested review from limak9182 and mploski April 9, 2026 06:13

mploski reviewed Apr 9, 2026

View reviewed changes

Conversation

M4KIF commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Testing and Verification

Related Issues

PR Checklist

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

limak9182 Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

limak9182 Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mploski Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

I agree, they were placed here as what I need. Solving It like you say is the thing I'm hoping for. For the provisioner/cluster etc. ports to include an interface for checking It's state.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mploski Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mploski Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mploski Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

M4KIF commented Apr 2, 2026 •

edited

Loading

github-actions bot commented Apr 2, 2026 •

edited

Loading

limak9182 Apr 3, 2026 •

edited

Loading

limak9182 Apr 3, 2026 •

edited

Loading

mploski Apr 7, 2026 •

edited

Loading

mploski Apr 7, 2026 •

edited

Loading

mploski Apr 9, 2026 •

edited

Loading

mploski Apr 7, 2026 •

edited

Loading