Skip to content

postgres controllers metrics#1811

Merged
limak9182 merged 7 commits intofeature/database-controllersfrom
feature/postgres-controllers-metrics
Apr 10, 2026
Merged

postgres controllers metrics#1811
limak9182 merged 7 commits intofeature/database-controllersfrom
feature/postgres-controllers-metrics

Conversation

@limak9182
Copy link
Copy Markdown

@limak9182 limak9182 commented Apr 2, 2026

Description

Adds comprehensive Prometheus metrics for the PostgreSQL controllers using a hexagonal
(ports & adapters) pattern — the domain code depends only on a Recorder interface, never
on Prometheus directly.

New package: pkg/postgresql/metrics/

  • ports.goRecorder interface (the port). Core service packages import only this.
  • prometheus.goPrometheusRecorder adapter: 6 metric families with splunk_operator_postgres_ prefix, registered against the controller-runtime metrics registry.
  • noop.goNoopRecorder for unit tests.
  • collector.goFleetCollector that recomputes fleet-state gauges from the informer cache after each reconcile.

Three-layer metrics collection:

Layer What it covers How
Controller-runtime (free) Reconcile count, duration, errors Automatic — controller_runtime_reconcile_total, _time_seconds, _errors_total
Fleet collector Clusters/databases by phase, managed users, poolers FleetCollector lists CRs from cache and sets gauges
Status-driven Business-logic transitions IncStatusTransition() called automatically inside persistStatus/setStatus — zero manual metric calls in service code

Custom metrics (6 families):

Metric Type Description
status_transitions_total Counter Status condition transitions by controller, condition, status, reason
clusters Gauge Clusters by phase and pooler status
databases Gauge Databases by phase
managed_users Gauge User counts by state (desired/reconciled/pending/failed)
poolers Gauge PgBouncer poolers by type and state
pooler_instances Gauge Pooler instance count

Design decisions:

  • No custom reconcile-level metrics — controller-runtime provides these out of the box
  • Status-driven metric emission — IncStatusTransition is called inside persistStatus/setStatus, so every condition write is automatically captured with no explicit calls scattered through service code
  • Low-cardinality labels only (controller, condition, status, reason, phase) — no per-resource name/namespace labels
  • Hexagonal port enables testability (NoopRecorder) and adapter swappability
  • Existing pkg/splunk/client/metrics/ is untouched

Key Changes

  • pkg/postgresql/metrics/ — new package: port interface, Prometheus adapter, noop adapter, fleet collector
  • pkg/postgresql/cluster/core/cluster.gosetStatus, syncPoolerStatus, syncStatus now accept Recorder and emit IncStatusTransition automatically
  • pkg/postgresql/database/core/database.gopersistStatus now accepts Recorder and emits IncStatusTransition automatically. Also adds 2 missing updateStatus calls on error paths (role patch failure, database reconcile failure)
  • pkg/postgresql/{cluster,database}/core/types.goMetrics pgmetrics.Recorder field added to ReconcileContext
  • internal/controller/postgres{cluster,database}_controller.go — inject Metrics into ReconcileContext, call fleet collector after each reconcile
  • cmd/main.go — create PrometheusRecorder, register with controller-runtime metrics registry, pass to controllers

Testing and Verification

Setting up Grafana + Prometheus on KIND

1. Install the monitoring stack

kubectl create namespace monitoring

# Add helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=admin \
  --set alertmanager.enabled=false \
  --set kubeStateMetrics.enabled=false \
  --set nodeExporter.enabled=false
  1. Grant Prometheus access to scrape the operator metrics
  kubectl apply -f - <<EOF
  apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    name: prometheus-splunk-operator-metrics
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: splunk-operator-metrics-reader
  subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitoring
  EOF
  1. Create a ServiceMonitor
  kubectl apply -f - <<EOF
  apiVersion: monitoring.coreos.com/v1
  kind: ServiceMonitor
  metadata:
    name: splunk-operator-postgres
    namespace: monitoring
    labels:
      release: kube-prometheus
  spec:
    namespaceSelector:
      matchNames:
      - splunk-operator
    selector:
      matchLabels:
        control-plane: controller-manager
    endpoints:
    - port: metric
      path: /metrics
      interval: 5s
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  EOF
  1. Access Grafana

kubectl port-forward svc/kube-prometheus-grafana -n monitoring 3000:80

Open http://localhost:3000 — login: admin / admin

The Prometheus datasource is auto-configured. Query any metric with the splunk_operator_postgres_ prefix.

  1. Example PromQL queries
  # Status transitions by controller and reason (error signals)
  sum by (controller, reason) (rate(splunk_operator_postgres_status_transitions_total{status="False"}[5m]))

  # Databases by phase
  splunk_operator_postgres_databases

  # Clusters by phase
  splunk_operator_postgres_clusters

  # Managed users overview
  splunk_operator_postgres_managed_users

  # Reconcile rate by controller (controller-runtime automatic)
  rate(controller_runtime_reconcile_total{controller=~"postgres.*"}[5m])

  # p99 latency per controller (controller-runtime automatic)
  histogram_quantile(0.99, sum by (controller, le) (rate(controller_runtime_reconcile_time_seconds_bucket{controller=~"postgres.*"}[5m])))
image image image

Related Issues

Jira tickets, GitHub issues, Support tickets...

PR Checklist

  • Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contribution License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment with the exact sentence copied from below.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request

@limak9182 limak9182 changed the title metrics postgres controllers metrics Apr 2, 2026
@limak9182 limak9182 marked this pull request as ready for review April 10, 2026 11:48
@limak9182 limak9182 merged commit 67a0ed2 into feature/database-controllers Apr 10, 2026
13 of 29 checks passed
@limak9182 limak9182 deleted the feature/postgres-controllers-metrics branch April 10, 2026 11:49
@github-actions github-actions bot locked and limited conversation to collaborators Apr 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants