Skip to content

managing osmo configs with k8s configmap#822

Open
vvnpn-nv wants to merge 24 commits intomainfrom
vpan/configs-to-k8s-configmap
Open

managing osmo configs with k8s configmap#822
vvnpn-nv wants to merge 24 commits intomainfrom
vpan/configs-to-k8s-configmap

Conversation

@vvnpn-nv
Copy link
Copy Markdown
Contributor

@vvnpn-nv vvnpn-nv commented Apr 7, 2026

Description

Add ConfigMap-sourced dynamic configuration to OSMO, enabling GitOps workflows where service configs are defined declaratively in Helm values and applied automatically — replacing the current imperative osmo config update CLI workflow for initial setup and config enforcement.

What this does

  • Declarative config via Helm values: Service settings, workflow limits, pools, pod templates, resource validations, backends, dataset buckets, and roles can all be defined in Helm values and applied automatically on service startup via a Kubernetes ConfigMap
  • Two management modes: seed (apply once, CLI changes persist) and configmap (ConfigMap is source of truth, CLI writes rejected with 409)
  • Write protection: All write paths — single-item endpoints, bulk endpoints, rollback — reject modifications to configmap-managed configs with HTTP 409 Conflict. The loader's own writes bypass the guard via a dedicated configmap-sync username
  • Config file watcher: Background polling thread (30s) detects ConfigMap file changes and re-applies configs without pod restart. Drift reconciliation for singleton configs detects and corrects DB-side changes
  • Secret handling: Dataset bucket credentials injected via K8s Secrets (never in ConfigMap or Git), with credentialSecretName for simplified Helm wiring
  • Visibility: GET endpoints include _managed_by field so users and tooling can see which configs are managed by ConfigMap

Issue - None

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • ConfigMap-driven dynamic configuration: ConfigMap + Secrets rendering, pod mounts for credential files, checksum-driven restarts, CLI flag and poll interval to enable a dynamic config file, multi-replica-safe watcher with drift reconciliation, and _managed_by metadata surfaced in API (edits rejected with HTTP 409 for managed items).
  • Documentation

    • Added operational guide describing design, rollout, secret handling, and reconciliation semantics.
  • Tests

    • New unit, integration, and E2E tests exercising loader, watcher, reconciliation, error resilience, and multi-replica behavior.

vvnpn-nv and others added 11 commits March 25, 2026 17:12
Add support for loading dynamic configs from a Kubernetes ConfigMap on
service startup, enabling GitOps workflows where configs are version-
controlled in Helm values and applied automatically via ArgoCD.

Two managed_by modes per config type:
- seed (default): only apply if config doesn't exist in DB
- configmap: always overwrite DB from ConfigMap on startup

Includes:
- configmap_loader.py: core loader with dependency ordering, advisory
  lock for multi-replica safety, secret file resolution for dataset
  credentials, and per-type error isolation
- dynamic_config_file field on WorkflowServiceConfig (CLI + env var)
- Helm chart: ConfigMap template, Secret template for credentials,
  volume mounts, checksum annotation for pod restart on config change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix crash when managed_configs is None (empty YAML section)
- Fix partial state corruption in _resolve_dataset_secret_files by
  validating secret file content before mutating credential dict
- Add configmap_loader.py and pyyaml dep to BUILD file
- Add type annotations for apply_function, model_class params
- Chain exception in _parse_managed_by (raise from error)
- Use RETURNING clause in _insert_backend to skip history entry on
  conflict instead of always creating one
- Add schema reference comment to _insert_backend
- Warn on unknown keys in managed_configs
- Update copyright years to 2025-2026

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add unit tests (20 tests) and integration tests (18 tests) for the
configmap_loader module. Unit tests cover file handling, YAML parsing,
managed_by mode parsing, secret file resolution, safe_apply error
isolation, advisory lock behavior, and unknown key warnings.
Integration tests verify actual DB interactions including seed vs
configmap mode for all config types, dependency ordering, per-item
error isolation, end-to-end loading, and config history entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Advisory lock: session-scoped -> transaction-scoped (pg_try_advisory_xact_lock)
  to prevent lock leaks on process kill
- Seed mode check: use config_history table instead of exclude_unset
  to correctly detect if config was ever explicitly set
- Config watcher: background polling thread (30s default) with SHA-256
  hash comparison detects ConfigMap file changes without pod restart
- Roles: pre-construct RolePolicy objects for pydantic v1 compatibility
- Dataset: auto-populate credential endpoint from bucket dataset_path
- Credential logging: removed secret file path from log messages
- Tests: fixed fetch_from_db return types, semantic action format,
  advisory lock assertions, unused imports, mypy annotations
- E2E: added test values overlay and dynamic config test file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Consolidate _apply_service_config, _apply_workflow_config, _apply_dataset_config
  into single _apply_singleton_config with config_type parameter and optional
  pre_apply hook (dataset uses it for secret resolution)
- Fix advisory lock: revert to session-scoped pg_try_advisory_lock with explicit
  unlock in finally block. Transaction-scoped xact lock was released immediately
  by execute_fetch_command's auto-commit, providing zero mutual exclusion.
- Replace _backend_exists with _named_config_exists (eliminates duplicate)
- Use SELECT 1 LIMIT 1 instead of COUNT(*) in _singleton_config_exists
- Remove duplicate _DEFAULT_POLL_INTERVAL_SECONDS constant
- Add WHY comment for endpoint defaulting from dataset_path
- Update unit tests for session lock assertions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ConfigMapWatcher now has two-tier polling:

1. File change detection (existing): SHA-256 hash comparison on the
   mounted ConfigMap file. When the file changes, re-apply everything.

2. Drift reconciliation (new): for managed_by=configmap singleton
   configs (service, workflow, dataset), compare the last-applied
   ConfigMap values against current DB state on each poll cycle. If
   someone modified a config via CLI, the watcher detects the drift
   and re-applies the ConfigMap values within one poll interval.

Key design decisions:
- Only reconciles singleton configs (not named configs like pools)
  to limit DB query load per cycle
- Compares desired values against DB BEFORE calling patch_configs,
  so no config_history entries are created when there's no drift
- Uses a separate advisory lock (configmap-reconcile) to prevent
  multiple replicas from correcting the same drift simultaneously
- Resolves secret_file references in the cached config so dataset
  drift detection compares resolved credentials, not file paths
- Seed-mode configs are never reconciled (by design)

Also refactors module-level start_config_watcher + global state
into a ConfigMapWatcher class with clean instance state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, credentialSecretName

Improvement 1: Management mode visibility
- Config GET endpoints (service, workflow, dataset) include _managed_by field
  in responses ('configmap', 'seed', or absent)
- Managed modes persisted to configmap_state table on startup
- get_managed_mode() and get_cached_section() exposed for cross-module access

Improvement 2: Immediate re-apply after CLI write
- When patch_configs writes to a configmap-managed singleton config, the
  ConfigMap values are immediately re-applied (inside helpers.py)
- Only triggers for non-configmap-sync writers (prevents infinite loops)
- Eliminates the need to wait for 30s drift reconciliation poll

Improvement 3: Simplified secret wiring (credentialSecretName)
- Dataset buckets can now use credentialSecretName instead of secret_file
- Helm template auto-generates volume + volumeMount for referenced secrets
- Loader resolves credentialSecretName to /etc/osmo/secrets/<name>/cred.yaml
- No manual extraVolumes/extraVolumeMounts needed

Also adds configmap_state table to postgres.py and test schema.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace accept-then-revert pattern with 409 rejection for all config
types. When a user tries to modify a config managed by ConfigMap in
configmap mode, the API returns 409 Conflict with a clear error message
instead of silently reverting after the write.

Singleton configs (service, workflow, dataset):
- PUT/PATCH endpoints reject with 409 before any DB write
- Removed the accept-then-revert hook from helpers.py patch_configs()

Named configs (pools, pod_templates, backends, roles, etc.):
- 19 single-item PUT/PATCH/DELETE endpoints reject with 409
- Bulk PUT endpoints (put_pools, put_pod_templates, etc.) are NOT
  guarded because the configmap_loader uses them internally

Benefits over accept-then-revert:
- Zero race conditions (no concurrent re-apply with watcher)
- Zero duplicate history entries (write never happens)
- Zero wasted side effects (no backend queue syncs, K8s updates)
- Clear API feedback ("update Helm values instead")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…all write paths

Extract 409 guard logic into configmap_guard.py — a standalone module with
no imports from helpers.py or configmap_loader.py. This eliminates the
circular import (helpers -> configmap_loader -> helpers) that required
import-outside-toplevel violations.

Architecture:
- configmap_guard.py: holds managed config state, guard functions, constants
- configmap_loader.py: delegates guard functions to configmap_guard, pushes
  state via set_managed_configs() on load_and_apply()
- helpers.py: imports configmap_guard at top level (no circular import)
- config_service.py: imports configmap_guard for _add_managed_by and named
  item guards

Coverage:
- Singletons: guarded in helpers.put_configs() and helpers.patch_configs()
  (covers all singletons including rollback)
- Named items: guarded in all 19 single-item endpoints + 6 bulk endpoints
  + rollback endpoint in config_service.py
- configmap-sync bypass: reject_if_managed() checks username and skips
  for the loader's own writes

Also:
- Eliminated triple file read in load_and_apply() (reads once, computes
  hash + parses YAML from same bytes)
- Removed dead get_configmap_state() from postgres.py
- Updated drift test to expect 409 rejection instead of accept-then-revert

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Removed debug print(test_configs) from helpers.py
- Removed unused re-export delegations from configmap_loader.py
  (get_managed_mode, get_cached_section, etc. — these are accessed
  directly from configmap_guard, not via configmap_loader)
- Removed stale comment about circular imports

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aged

The rollback_config endpoint still called configmap_loader.reject_if_managed
which was removed when extracting to configmap_guard. Fixed to use
configmap_guard.reject_if_managed. Also removed unused configmap_loader
import from config_service.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vvnpn-nv vvnpn-nv requested a review from a team as a code owner April 7, 2026 00:16
Covers architecture, management modes, 409 rejection, drift
reconciliation, credentialSecretName, and full E2E test results
from live dev instance validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vvnpn-nv vvnpn-nv force-pushed the vpan/configs-to-k8s-configmap branch from 10e7db7 to 975635e Compare April 7, 2026 16:49
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a ConfigMap-driven dynamic configuration system: Helm values/templates for a dynamic ConfigMap and Secrets, pod mounts and checksum annotation, a ConfigMap loader/watcher with advisory-locked DB application and drift reconciliation, write-protection guards (HTTP 409), service startup integration, tests, and documentation.

Changes

Cohort / File(s) Summary
Helm templates & values
deployments/charts/service/templates/api-service.yaml, deployments/charts/service/templates/dynamic-config.yaml, deployments/charts/service/templates/dynamic-config-secrets.yaml, deployments/charts/service/templates/_helpers.tpl, deployments/charts/service/values.yaml
Add dynamicConfig values and helpers; render <service>-dynamic-config ConfigMap and per-secret Secrets; helper templates to resolve/collect secretNames; mount ConfigMap and secret files into pods; add checksum pod annotation and CLI arg for dynamic config file; optional secret mounts for discovered secret names.
Config loader & watcher
src/service/core/config/configmap_loader.py
New loader and ConfigMapWatcher: parse managed_configs, resolve mounted secret files, acquire PostgreSQL advisory locks, apply configs in dependency order, persist managed state, poll for changes, and reconcile drift for configmap singletons.
Guard & enforcement
src/service/core/config/configmap_guard.py, src/service/core/config/helpers.py, src/service/core/config/config_service.py
New guard caching managed state with reject_if_managed() to enforce 409 on protected writes; helpers/handlers attach _managed_by metadata to GETs and call guard prior to mutable operations; rollback endpoint rejects managed singletons.
Service integration & config objects
src/service/core/service.py, src/service/core/workflow/objects.py
Wire ConfigMapWatcher startup when dynamic_config_file set; add config fields dynamic_config_file and dynamic_config_poll_interval with CLI/env bindings.
Postgres state & connector
src/utils/connectors/postgres.py, src/tests/common/database/testdata/schema.sql
Create configmap_state table in DB init; add PostgresConnector.set_configmap_state() upsert; add table to test DB schema.
Build / deps / tests targets
src/service/core/config/BUILD, src/service/core/config/tests/BUILD
Add new Python sources (configmap_guard.py, configmap_loader.py) to Bazel target; add pyyaml dependency; add unit and integration test targets for configmap loader.
Tests (unit, integration, E2E)
src/service/core/config/tests/test_configmap_loader_unit.py, src/service/core/config/tests/test_configmap_loader_integration.py, run/e2e/test_dynamic_config.py
Add comprehensive unit/integration/E2E tests covering loader behavior, secret resolution, advisory locks, seed vs configmap semantics, partial-failure isolation, drift reconciliation, and audit-tagging.
Test values / overlays
run/minimal/osmo_dynamic_config_values.yaml, run/minimal/osmo_values.yaml
Add Helm overlay enabling dynamicConfig for E2E with example backends/pools/podTemplates/resourceValidations and set sidecars.podMonitor.enabled: false.
Documentation
projects/configmap-dynamic-config.md
Add design doc describing managed_by modes, advisory-lock coordination, drift reconciliation, write-protection, secret handling, audit tagging (configmap-sync), and test plan.

Sequence Diagram(s)

sequenceDiagram
    participant Pod as Pod Startup
    participant Watcher as ConfigMapWatcher
    participant File as Dynamic Config File
    participant Lock as PostgreSQL Advisory Lock
    participant DB as PostgreSQL
    participant History as Config History

    Pod->>Watcher: start()
    Watcher->>File: read & parse config.yaml
    File-->>Watcher: parsed managed_configs
    Watcher->>Lock: pg_try_advisory_lock()
    Lock-->>Watcher: lock acquired
    Watcher->>DB: apply configs (resource validations → backends → pools → singletons)
    DB->>History: insert config_history (username: configmap-sync, tags: configmap)
    History-->>DB: audit recorded
    DB-->>Watcher: apply result
    Watcher->>Lock: pg_advisory_unlock()
    Watcher->>Watcher: cache hash & managed state
    Watcher->>Watcher: start polling loop

    loop every poll_interval
        Watcher->>File: compute hash
        alt hash changed
            Watcher->>Lock: pg_try_advisory_lock()
            Watcher->>DB: reload & apply all configs
            Watcher->>Lock: pg_advisory_unlock()
        else hash unchanged
            Watcher->>DB: reconcile drift for configmap singletons
        end
    end
Loading
sequenceDiagram
    participant Client as API Client
    participant Handler as Config Endpoint Handler
    participant Guard as configmap_guard
    participant DB as PostgreSQL
    participant Error as Error Response

    Client->>Handler: modify config (PATCH/PUT/DELETE)
    Handler->>Guard: reject_if_managed(config_key, username, item_name)
    Guard->>Guard: check username == "configmap-sync"?
    alt username != "configmap-sync" and config is managed
        Guard-->>Handler: raise OSMOUserError(409)
        Handler-->>Client: HTTP 409 Conflict
    else allowed
        Handler->>DB: persist change
        DB-->>Handler: success
        Handler-->>Client: HTTP 200 OK
    end
Loading

Estimated Code Review Effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I nibble on YAML at dawn and cheer,

Secrets tucked in files, ConfigMaps near.
Locks keep one hopper writing true,
I seed and sync and guard for you.
Hop—configs applied, then off I steer.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: introducing ConfigMap-sourced dynamic configuration management for OSMO. It accurately reflects the primary objective of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 89.86% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vpan/configs-to-k8s-configmap

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
projects/configmap-dynamic-config.md (1)

62-100: Consider adding a language identifier to the fenced code block.

The ASCII diagram block would benefit from a text or plaintext language identifier to satisfy markdown linters and improve rendering consistency.

Suggested fix
-```
+```text
 ┌──────────────────────────────────────────────────────────────────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/configmap-dynamic-config.md` around lines 62 - 100, The fenced ASCII
diagram block uses triple backticks without a language; update the markdown by
changing the opening fence from ``` to a labeled fence such as ```text or
```plaintext so linters/renderers treat it as plain text; locate the diagram's
opening ``` (the ASCII box beginning with "┌─────────────────") and add the
language identifier there, leaving the closing ``` unchanged.
run/minimal/osmo_dynamic_config_values.yaml (1)

1-1: Remove Python-specific pylint comment from YAML file.

The # pylint: disable=line-too-long comment is only meaningful for Python files. In YAML files, it's just a regular comment with no effect.

Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.  # pylint: disable=line-too-long
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run/minimal/osmo_dynamic_config_values.yaml` at line 1, Remove the
Python-specific pylint directive by deleting the trailing " # pylint:
disable=line-too-long" from the YAML header line (i.e., remove the comment
string "# pylint: disable=line-too-long") so the SPDX header remains but the
file no longer contains a Python-only lint directive inappropriate for YAML.
run/e2e/test_dynamic_config.py (1)

38-39: Consider documenting the relationship with source constants.

These constants duplicate values defined in src/service/core/config/configmap_guard.py. While E2E tests may need to be standalone, consider adding a comment noting where the canonical definitions live to help maintainers keep them in sync.

Note: The guard defines CONFIGMAP_SYNC_TAGS (plural, list) while this test uses CONFIGMAP_SYNC_TAG (singular, string). Both work for the test's purpose but the naming difference could cause confusion.

📝 Optional: Add reference comment
+# Canonical definitions in src/service/core/config/configmap_guard.py
+# Keep in sync if those values change.
 CONFIGMAP_SYNC_USERNAME = 'configmap-sync'
-CONFIGMAP_SYNC_TAG = 'configmap'
+CONFIGMAP_SYNC_TAG = 'configmap'  # Note: guard uses CONFIGMAP_SYNC_TAGS = ['configmap']
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run/e2e/test_dynamic_config.py` around lines 38 - 39, Add a short comment
above CONFIGMAP_SYNC_USERNAME and CONFIGMAP_SYNC_TAG noting they duplicate the
canonical definitions in src/service/core/config/configmap_guard.py (which
defines CONFIGMAP_SYNC_TAGS as a list) so maintainers know to keep values in
sync and to explain the singular vs plural naming difference; update the comment
to reference CONFIGMAP_SYNC_TAGS by name and clarify that the test intentionally
uses a single string CONFIGMAP_SYNC_TAG.
src/service/core/config/configmap_guard.py (1)

27-42: Thread safety consideration for module-level state.

The _managed_configs dict is written by the background ConfigMapWatcher thread and read by request-handling threads. While the reference assignment is atomic under CPython's GIL, consider adding a brief comment noting this design relies on:

  1. Reference assignment atomicity
  2. The watcher creating a fresh dict from yaml.safe_load each reload (no in-place mutation of the shared dict)

This is acceptable for the current implementation but worth documenting for future maintainers.

📝 Optional: Add thread-safety comment
-# Module-level reference to managed config state, set by ConfigMapWatcher.start()
+# Module-level reference to managed config state, set by ConfigMapWatcher.start().
+# Thread safety: The watcher thread replaces this reference atomically (CPython GIL).
+# Each reload creates a fresh dict from yaml.safe_load, avoiding in-place mutation.
 _managed_configs: Dict[str, Any] | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_guard.py` around lines 27 - 42, Add a short
thread-safety comment near the module-level _managed_configs and
set_managed_configs explaining that readers in request threads rely on CPython's
GIL to make the reference assignment in set_managed_configs atomic, and that
ConfigMapWatcher must replace the entire managed_configs object (e.g., by
calling yaml.safe_load to create a fresh dict) rather than mutating it in-place;
mention that this design choice ensures safe concurrent reads and should be
preserved if the implementation changes.
src/service/core/config/tests/test_configmap_loader_integration.py (1)

40-50: Consider using context manager for cleaner temp file handling.

The current approach works correctly but could be slightly cleaner using a context manager pattern that still allows file persistence for the test duration.

♻️ Optional: context manager pattern
     def _write_config_file(self, config: Dict[str, Any]) -> str:
         """Write a config dict to a temp YAML file and return its path."""
-        temp_file = tempfile.NamedTemporaryFile(
-            mode='w', suffix='.yaml', delete=False)
-        yaml.dump(config, temp_file)
-        temp_file.close()
-        return temp_file.name
+        with tempfile.NamedTemporaryFile(
+                mode='w', suffix='.yaml', delete=False) as temp_file:
+            yaml.dump(config, temp_file)
+            return temp_file.name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/tests/test_configmap_loader_integration.py` around
lines 40 - 50, The helper _write_config_file currently creates and manually
closes a tempfile; refactor it to use a context manager (with
tempfile.NamedTemporaryFile(..., delete=False) as temp_file:) so the file is
automatically closed while still persisting for the test, remove the explicit
temp_file.close() call, and keep _cleanup_file (which uses os.unlink) unchanged;
update references to _write_config_file and _cleanup_file accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/service/core/config/BUILD`:
- Around line 28-29: Update the AGENTS.md entry for the config/ module to
explicitly mention the new ConfigMap components by name: reference
configmap_loader.py as the dynamic ConfigMap-based configuration loader and
configmap_guard.py as the runtime guard/validator for those configs, and state
that the module now supports dynamic configuration sourced from Kubernetes
ConfigMaps; ensure the description also briefly notes how these interact with
existing config modules (e.g., loading, validation, and reloading behavior).

In `@src/service/core/service.py`:
- Around line 469-474: The call to watcher.start() can raise (e.g., DB
exceptions during initial sync) and should not abort service startup; wrap the
ConfigMapWatcher creation/startup (the configmap_loader.ConfigMapWatcher
instantiation and watcher.start() call) in a try/except that catches exceptions
(at least broad Exception), logs the failure including the exception and context
(use the existing logger), and allows the service to continue without the
watcher (or mark watcher as None) while preserving postgres and normal startup
flow; ensure the code references config.dynamic_config_file and
config.dynamic_config_poll_interval and that postgres is not closed on watcher
errors.

---

Nitpick comments:
In `@projects/configmap-dynamic-config.md`:
- Around line 62-100: The fenced ASCII diagram block uses triple backticks
without a language; update the markdown by changing the opening fence from ```
to a labeled fence such as ```text or ```plaintext so linters/renderers treat it
as plain text; locate the diagram's opening ``` (the ASCII box beginning with
"┌─────────────────") and add the language identifier there, leaving the closing
``` unchanged.

In `@run/e2e/test_dynamic_config.py`:
- Around line 38-39: Add a short comment above CONFIGMAP_SYNC_USERNAME and
CONFIGMAP_SYNC_TAG noting they duplicate the canonical definitions in
src/service/core/config/configmap_guard.py (which defines CONFIGMAP_SYNC_TAGS as
a list) so maintainers know to keep values in sync and to explain the singular
vs plural naming difference; update the comment to reference CONFIGMAP_SYNC_TAGS
by name and clarify that the test intentionally uses a single string
CONFIGMAP_SYNC_TAG.

In `@run/minimal/osmo_dynamic_config_values.yaml`:
- Line 1: Remove the Python-specific pylint directive by deleting the trailing "
# pylint: disable=line-too-long" from the YAML header line (i.e., remove the
comment string "# pylint: disable=line-too-long") so the SPDX header remains but
the file no longer contains a Python-only lint directive inappropriate for YAML.

In `@src/service/core/config/configmap_guard.py`:
- Around line 27-42: Add a short thread-safety comment near the module-level
_managed_configs and set_managed_configs explaining that readers in request
threads rely on CPython's GIL to make the reference assignment in
set_managed_configs atomic, and that ConfigMapWatcher must replace the entire
managed_configs object (e.g., by calling yaml.safe_load to create a fresh dict)
rather than mutating it in-place; mention that this design choice ensures safe
concurrent reads and should be preserved if the implementation changes.

In `@src/service/core/config/tests/test_configmap_loader_integration.py`:
- Around line 40-50: The helper _write_config_file currently creates and
manually closes a tempfile; refactor it to use a context manager (with
tempfile.NamedTemporaryFile(..., delete=False) as temp_file:) so the file is
automatically closed while still persisting for the test, remove the explicit
temp_file.close() call, and keep _cleanup_file (which uses os.unlink) unchanged;
update references to _write_config_file and _cleanup_file accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f700d275-afb8-4c85-a775-4afa1b0036bd

📥 Commits

Reviewing files that changed from the base of the PR and between 15312c3 and 975635e.

📒 Files selected for processing (20)
  • deployments/charts/service/templates/api-service.yaml
  • deployments/charts/service/templates/dynamic-config-secrets.yaml
  • deployments/charts/service/templates/dynamic-config.yaml
  • deployments/charts/service/values.yaml
  • projects/configmap-dynamic-config.md
  • run/e2e/test_dynamic_config.py
  • run/minimal/osmo_dynamic_config_values.yaml
  • run/minimal/osmo_values.yaml
  • src/service/core/config/BUILD
  • src/service/core/config/config_service.py
  • src/service/core/config/configmap_guard.py
  • src/service/core/config/configmap_loader.py
  • src/service/core/config/helpers.py
  • src/service/core/config/tests/BUILD
  • src/service/core/config/tests/test_configmap_loader_integration.py
  • src/service/core/config/tests/test_configmap_loader_unit.py
  • src/service/core/service.py
  • src/service/core/workflow/objects.py
  • src/tests/common/database/testdata/schema.sql
  • src/utils/connectors/postgres.py

Comment on lines +28 to +29
"configmap_guard.py",
"configmap_loader.py",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

BASE_REF="$(git merge-base HEAD main 2>/dev/null || true)"
if [[ -z "${BASE_REF}" ]]; then
  BASE_REF="$(git rev-parse HEAD~1)"
fi

echo "Changed files touching AGENTS.md:"
git diff --name-only "${BASE_REF}"...HEAD | rg '^AGENTS\.md$' || true

echo
echo "AGENTS.md diff (if any):"
git diff "${BASE_REF}"...HEAD -- AGENTS.md || true

Repository: NVIDIA/OSMO

Length of output: 243


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Current git status ==="
git status || true

echo
echo "=== Available branches ==="
git branch -a || true

echo
echo "=== Current HEAD ==="
git rev-parse HEAD || true

echo
echo "=== Find AGENTS.md in repository ==="
find . -name "AGENTS.md" -type f 2>/dev/null || echo "AGENTS.md not found"

echo
echo "=== Recent commits ==="
git log --oneline -10 || true

echo
echo "=== Changed files (if in merge/rebase) ==="
git diff --name-only || true

echo
echo "=== Staged files ==="
git diff --cached --name-only || true

Repository: NVIDIA/OSMO

Length of output: 718


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Git log for src/service/core/config/BUILD ==="
git log --oneline -20 -- src/service/core/config/BUILD || echo "No commits found for this file"

echo
echo "=== Git log for AGENTS.md ==="
git log --oneline -20 -- AGENTS.md || echo "No commits found for AGENTS.md"

echo
echo "=== Show current commit details ==="
git show --stat HEAD || true

echo
echo "=== Check if BUILD file is in HEAD commit ==="
git show --name-only HEAD | grep -i build || echo "No BUILD file changes in HEAD"

echo
echo "=== Read AGENTS.md ==="
cat AGENTS.md

Repository: NVIDIA/OSMO

Length of output: 50368


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Checking AGENTS.md for ConfigMap module references ==="
grep -i "configmap" AGENTS.md || echo "No 'configmap' found in AGENTS.md"

echo
echo "=== Checking for configmap_guard and configmap_loader ==="
grep -E "configmap_(guard|loader)" AGENTS.md || echo "No configmap_guard or configmap_loader found"

echo
echo "=== Search for references to src/service/core/config ==="
grep -i "src/service/core/config" AGENTS.md || echo "No direct path reference found"

echo
echo "=== Show relevant section of AGENTS.md (first 100 lines) ==="
head -100 AGENTS.md

Repository: NVIDIA/OSMO

Length of output: 6638


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find and display config/ section in AGENTS.md ==="
grep -A 50 "^| \`config/\`" AGENTS.md || echo "Pattern not found, searching differently..."

echo
echo "=== Search for config module documentation (case-insensitive) ==="
grep -A 30 -i "configuration" AGENTS.md | head -50

echo
echo "=== Show lines 200+ to see if ConfigMap is documented later ==="
tail -50 AGENTS.md

Repository: NVIDIA/OSMO

Length of output: 12704


Expand AGENTS.md config/ module description to document the new ConfigMap components.

AGENTS.md was updated in this commit, but the config/ module description doesn't mention the new ConfigMap loader and guard components. Update the entry to explicitly reference the dynamic configuration sourced from ConfigMaps (configmap_loader.py, configmap_guard.py).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/BUILD` around lines 28 - 29, Update the AGENTS.md
entry for the config/ module to explicitly mention the new ConfigMap components
by name: reference configmap_loader.py as the dynamic ConfigMap-based
configuration loader and configmap_guard.py as the runtime guard/validator for
those configs, and state that the module now supports dynamic configuration
sourced from Kubernetes ConfigMaps; ensure the description also briefly notes
how these interact with existing config modules (e.g., loading, validation, and
reloading behavior).

Wrap ConfigMapWatcher.start() in try/except so a malformed ConfigMap
file or DB error during initial config load doesn't prevent the service
from starting. The service logs the error and continues without
ConfigMap config management.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vvnpn-nv vvnpn-nv force-pushed the vpan/configs-to-k8s-configmap branch from e71d644 to b2f0cee Compare April 7, 2026 18:35
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
projects/configmap-dynamic-config.md (5)

323-328: Consider adding minimum version requirements.

The Dependencies section lists the required components but doesn't specify minimum versions. Consider adding:

  • Minimum Kubernetes version (for ConfigMap propagation behavior, Secret mounting)
  • Minimum PostgreSQL version (though pg_try_advisory_lock is available since 8.2, so this is likely fine)
  • Minimum Helm version (if any chart features require specific versions)

This helps operators ensure compatibility before deploying the feature.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/configmap-dynamic-config.md` around lines 323 - 328, Update the
"Dependencies" section to include explicit minimum version requirements: state a
minimum Kubernetes version (to guarantee ConfigMap propagation and Secret mount
behavior), a minimum PostgreSQL version (note pg_try_advisory_lock availability
if you want to reference it), and a minimum Helm version (if any helm chart
features in dynamic-config.yaml or dynamic-config-secrets.yaml need newer Helm
capabilities); optionally call out compatible versions for K8s Secret
controllers (Vault/ExternalSecrets) and GitOps tools (ArgoCD/Flux) so operators
know what to target before using the config_history table/helm upgrade checksum
restart workflow.

220-228: Document the worst-case config propagation latency.

The documentation mentions K8s propagates changes with ~60s kubelet sync (line 222) and the watcher polls every 30s (line 223), but the combined worst-case latency of ~90s isn't explicitly called out. This could be important for operators planning change windows or troubleshooting "why hasn't my config applied yet?" scenarios.

📝 Suggested addition

After line 227, consider adding:

- **Worst-case propagation latency**: ~90s (60s kubelet sync + 30s poll interval)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/configmap-dynamic-config.md` around lines 220 - 228, Add an explicit
worst-case propagation latency note to the "ConfigMap file changed" bullet list:
after the bullets that state "K8s propagates the change to the mounted file
(~60s kubelet sync)" and "Watcher detects the file hash change on its next 30s
poll", insert a new bullet stating the combined worst-case of ~90s (60s kubelet
sync + 30s poll interval) so operators can see the aggregated latency at a
glance.

256-262: Consider the scalability implications of the N+1 pattern.

Line 262 acknowledges the N+1 pattern for named config filtering in seed mode and notes it's "acceptable for typical config counts of 5-20". However, this could become a bottleneck in larger deployments:

  • Environments with 50+ pod templates or backends would trigger 50+ DB queries per apply cycle
  • Each poll cycle with file changes re-applies all configs, triggering these queries
  • Multi-tenant deployments might have 100+ named configs across all types

Consider documenting:

  1. A recommended maximum number of managed configs (e.g., 50 per type)
  2. Performance degradation symptoms operators should watch for
  3. Possible future optimization (single query with WHERE name IN (...) instead of N queries)

Alternatively, implement the batch query optimization now since the data is already available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/configmap-dynamic-config.md` around lines 256 - 262, The N+1 DB
query pattern during "named config filtering in seed mode" will not scale;
update the seed-mode filtering logic to perform a single batch query that
selects all required names (e.g., use a single DB call with WHERE name IN (...))
instead of issuing one query per config, and ensure this change happens in the
function that runs per apply cycle/poll; additionally, update the documentation
to recommend a soft maximum (e.g., ~50 managed configs per type), list
observable performance symptoms (increased apply latency, elevated DB query
count, CPU spikes), and note the batch-query optimization as the preferred
approach for future scaling.

235-239: Consider the implications of drift reconciliation only covering global configs.

The design correctly notes that drift reconciliation only corrects global configs (service, workflow, dataset) and not named configs. While the 409 guard prevents normal API writes to named configs, direct DB manipulation (e.g., via psql) or bugs that bypass the guard could cause undetected drift for named configs like pod templates, backends, and pools.

This is acknowledged in the documentation, so it's a deliberate trade-off rather than an oversight. However, you may want to:

  1. Add a note in the Operations section about monitoring for drift in named configs
  2. Consider adding drift reconciliation for critical named config types in a future iteration
  3. Document the recommended recovery procedure if named config drift is detected (Helm upgrade with checksum change to force pod restart and re-apply)

Do you want me to suggest specific documentation additions or open an issue to track future drift reconciliation for named configs?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/configmap-dynamic-config.md` around lines 235 - 239, Add an
Operations note to the ConfigMap mode docs warning that the watcher only
reconciles global configs (service, workflow, dataset) and that named configs
(pod templates, backends, pools) can drift if the 409 guard is bypassed or DB is
edited directly; recommend monitoring for named-config drift (e.g., periodic DB
vs ConfigMap checks or alerts), propose tracking an enhancement to add drift
reconciliation for critical named config types in the future, and document a
recovery procedure (example: perform a Helm upgrade with checksum change to
force pod restart and re-apply desired configs) so operators know how to
remediate detected named-config drift.

263-269: Add metrics/observability beyond logs and audit trail.

The Operations section covers logs and audit trail comprehensively, but doesn't mention metrics for monitoring in production. Consider adding metrics for:

  • Config apply success/failure count (by type)
  • Config apply duration
  • Drift reconciliation corrections count
  • Advisory lock contention/failures
  • ConfigMap file hash check duration

These metrics would help operators:

  1. Detect when config application is failing or slow
  2. Monitor drift frequency to detect unwanted DB manipulation
  3. Detect lock contention in multi-replica deployments
  4. Set up alerts for config management health

If metrics are already implemented but not documented, consider adding a subsection under Operations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/configmap-dynamic-config.md` around lines 263 - 269, Update the
Operations section to document production metrics beyond logs/audit: add a
subsection describing metrics to expose (config_apply_success_total and
config_apply_failure_total labeled by config type, config_apply_duration_seconds
histogram, drift_reconciliation_corrections_total,
advisory_lock_contention_total and advisory_lock_failure_total, and
configmap_hash_check_duration_seconds), note their labels (e.g., config_type,
source=“configmap_loader”), recommended alerting thresholds and scrape/export
guidance, and mention that the Helm checksum annotation still triggers pod
restarts; if metrics are already implemented, reference the exact metric names
and where they’re emitted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/configmap-dynamic-config.md`:
- Around line 62-102: Update the fenced ASCII diagram in
projects/configmap-dynamic-config.md by adding a language tag to the opening
triple backticks (use "text") so the block starts with ```text; this applies to
the ConfigMap Config Flow code block (the ASCII diagram) to satisfy markdownlint
MD040 and improve rendering/tooling support.

---

Nitpick comments:
In `@projects/configmap-dynamic-config.md`:
- Around line 323-328: Update the "Dependencies" section to include explicit
minimum version requirements: state a minimum Kubernetes version (to guarantee
ConfigMap propagation and Secret mount behavior), a minimum PostgreSQL version
(note pg_try_advisory_lock availability if you want to reference it), and a
minimum Helm version (if any helm chart features in dynamic-config.yaml or
dynamic-config-secrets.yaml need newer Helm capabilities); optionally call out
compatible versions for K8s Secret controllers (Vault/ExternalSecrets) and
GitOps tools (ArgoCD/Flux) so operators know what to target before using the
config_history table/helm upgrade checksum restart workflow.
- Around line 220-228: Add an explicit worst-case propagation latency note to
the "ConfigMap file changed" bullet list: after the bullets that state "K8s
propagates the change to the mounted file (~60s kubelet sync)" and "Watcher
detects the file hash change on its next 30s poll", insert a new bullet stating
the combined worst-case of ~90s (60s kubelet sync + 30s poll interval) so
operators can see the aggregated latency at a glance.
- Around line 256-262: The N+1 DB query pattern during "named config filtering
in seed mode" will not scale; update the seed-mode filtering logic to perform a
single batch query that selects all required names (e.g., use a single DB call
with WHERE name IN (...)) instead of issuing one query per config, and ensure
this change happens in the function that runs per apply cycle/poll;
additionally, update the documentation to recommend a soft maximum (e.g., ~50
managed configs per type), list observable performance symptoms (increased apply
latency, elevated DB query count, CPU spikes), and note the batch-query
optimization as the preferred approach for future scaling.
- Around line 235-239: Add an Operations note to the ConfigMap mode docs warning
that the watcher only reconciles global configs (service, workflow, dataset) and
that named configs (pod templates, backends, pools) can drift if the 409 guard
is bypassed or DB is edited directly; recommend monitoring for named-config
drift (e.g., periodic DB vs ConfigMap checks or alerts), propose tracking an
enhancement to add drift reconciliation for critical named config types in the
future, and document a recovery procedure (example: perform a Helm upgrade with
checksum change to force pod restart and re-apply desired configs) so operators
know how to remediate detected named-config drift.
- Around line 263-269: Update the Operations section to document production
metrics beyond logs/audit: add a subsection describing metrics to expose
(config_apply_success_total and config_apply_failure_total labeled by config
type, config_apply_duration_seconds histogram,
drift_reconciliation_corrections_total, advisory_lock_contention_total and
advisory_lock_failure_total, and configmap_hash_check_duration_seconds), note
their labels (e.g., config_type, source=“configmap_loader”), recommended
alerting thresholds and scrape/export guidance, and mention that the Helm
checksum annotation still triggers pod restarts; if metrics are already
implemented, reference the exact metric names and where they’re emitted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a625a17a-4da0-46a2-a380-8ed79792e8d0

📥 Commits

Reviewing files that changed from the base of the PR and between e71d644 and b2f0cee.

📒 Files selected for processing (2)
  • projects/configmap-dynamic-config.md
  • src/service/core/service.py

Generalize the dataset-only _resolve_dataset_secret_files() into a
recursive _resolve_secret_file_references() that works for all config
types. Any dict with a 'secret_file' or 'secretName' key at any
nesting level gets resolved by reading the mounted K8s Secret file.

Supports two patterns:
- Dict secrets (credentials): file contents merged into the dict
- Simple string secrets (tokens, passwords): file with {value: "..."}
  replaces the dict entirely with the string value

Helm templates updated:
- _helpers.tpl: recursive secretName→secret_file transformer +
  secret name collector (4 levels deep)
- dynamic-config.yaml: resolves secretName in workflow/service config
- api-service.yaml: auto-generates volume + volumeMount for all
  secretName references across all config types
- values.yaml: documents secretName pattern for workflow credentials

Example Helm values:
  workflow:
    config:
      workflow_data:
        credential:
          secretName: osmo-workflow-data-cred
      workflow_alerts:
        slack_token:
          secretName: osmo-slack-token

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deployments/charts/service/templates/_helpers.tpl`:
- Around line 160-249: The helpers osmo.resolve-secret-names-in-config and
osmo.collect-secret-names only traverse maps up to four levels and ignore
slices, so secretName entries in lists or deeper levels are missed; replace the
fixed-depth loops with a genuinely recursive helper (e.g., define
"osmo.walk-config") that accepts a node and a secrets accumulator/map and: if
kindIs "map" check hasKey "secretName" then set/unset as before and recurse over
map values; if kindIs "slice"/"list" range over items and recursively call the
walker for each element; use this same walker from
osmo.resolve-secret-names-in-config to set secret_file and unset secretName and
from osmo.collect-secret-names to populate the $secrets dict, then emit keys
$secrets | sortAlpha | join "," as before (retain use of
set/unset/keys/sortAlpha/join).

In `@src/service/core/config/configmap_loader.py`:
- Around line 226-274: Drift is currently computed into drifted_configs before
acquiring the advisory lock, so multiple replicas can race and cause redundant
re-applies; after acquiring the lock via the pg_try_advisory_lock call, re-check
each candidate in drifted_configs against the live DB (use
self._postgres.get_configs(...).plaintext_dict(...) as done earlier) and skip
any config whose current_db values now match the desired section config before
calling _apply_singleton_config; ensure you still release the advisory lock in
the finally block.
- Around line 146-168: The method load_and_apply currently sets
self._last_file_hash before validating/parsing the YAML and checking for
'managed_configs', which prevents retry of a broken revision; move the
assignment of self._last_file_hash (the hashlib.sha256(...) result) to occur
only after successful yaml.safe_load and after verifying raw_config contains
'managed_configs' and after applying configs (i.e., only once the new revision
is known usable). Update load_and_apply to compute the hash earlier if needed
but only assign to self._last_file_hash after successful
parsing/validation/apply so a bad file does not become marked as loaded.
- Around line 1-17: The pylint disable directive is currently embedded inside
the module docstring; move the comment "# pylint: disable=line-too-long" out of
the triple-quoted string and place it as a standalone comment immediately above
the module docstring so pylint recognizes it; update the top of the module in
configmap_loader.py by adding the standalone comment before the opening """ that
starts the SPDX header.
- Around line 411-447: The function _resolve_secret_file_references currently
only recurses into dict values and therefore ignores secret_file/secretName
entries inside lists; update it to also iterate over list values: for any list
found as config_data[key], iterate by index and for each element that is a dict
check for secret_file/secretName and call
_resolve_single_secret(config_data[key], index, element, path) or otherwise
recurse into the element via _resolve_secret_file_references(element, path) so
nested objects in lists are resolved at any depth; ensure you preserve
parent_key/path construction consistent with existing f'{parent_key}.{key}'
usage and handle replacing list items when a file contains a scalar 'value'.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cd1eab3f-187e-4321-86aa-692f443ae5a9

📥 Commits

Reviewing files that changed from the base of the PR and between b2f0cee and 452c4bb.

📒 Files selected for processing (6)
  • deployments/charts/service/templates/_helpers.tpl
  • deployments/charts/service/templates/api-service.yaml
  • deployments/charts/service/templates/dynamic-config.yaml
  • deployments/charts/service/values.yaml
  • src/service/core/config/configmap_loader.py
  • src/service/core/config/tests/test_configmap_loader_unit.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • deployments/charts/service/templates/api-service.yaml
  • deployments/charts/service/values.yaml

Comment on lines +160 to +249
{{/*
Recursively walk a config dict and resolve secretName references at any level.
Since Go templates lack true recursion via 'template' with dynamic context mutation,
we handle this with explicit depth traversal up to 4 levels deep.
Expects a dict as context.
*/}}
{{- define "osmo.resolve-secret-names-in-config" -}}
{{- range $key, $value := . }}
{{- if kindIs "map" $value }}
{{- if hasKey $value "secretName" }}
{{- $secretName := $value.secretName }}
{{- $_ := set $value "secret_file" (printf "/etc/osmo/secrets/%s/cred.yaml" $secretName) }}
{{- $_ := unset $value "secretName" }}
{{- else }}
{{/* Recurse one level deeper */}}
{{- range $k2, $v2 := $value }}
{{- if kindIs "map" $v2 }}
{{- if hasKey $v2 "secretName" }}
{{- $sn := $v2.secretName }}
{{- $_ := set $v2 "secret_file" (printf "/etc/osmo/secrets/%s/cred.yaml" $sn) }}
{{- $_ := unset $v2 "secretName" }}
{{- else }}
{{/* Recurse another level deeper */}}
{{- range $k3, $v3 := $v2 }}
{{- if kindIs "map" $v3 }}
{{- if hasKey $v3 "secretName" }}
{{- $sn := $v3.secretName }}
{{- $_ := set $v3 "secret_file" (printf "/etc/osmo/secrets/%s/cred.yaml" $sn) }}
{{- $_ := unset $v3 "secretName" }}
{{- else }}
{{/* One more level */}}
{{- range $k4, $v4 := $v3 }}
{{- if kindIs "map" $v4 }}
{{- if hasKey $v4 "secretName" }}
{{- $sn := $v4.secretName }}
{{- $_ := set $v4 "secret_file" (printf "/etc/osmo/secrets/%s/cred.yaml" $sn) }}
{{- $_ := unset $v4 "secretName" }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end -}}

{{/*
Collect all secretName values from a config dict (up to 4 levels deep).
Returns a comma-separated string of unique secret names.
Expects a dict as context.
*/}}
{{- define "osmo.collect-secret-names" -}}
{{- $secrets := dict -}}
{{- range $key, $value := . }}
{{- if kindIs "map" $value }}
{{- if hasKey $value "secretName" }}
{{- $_ := set $secrets $value.secretName "1" }}
{{- else }}
{{- range $k2, $v2 := $value }}
{{- if kindIs "map" $v2 }}
{{- if hasKey $v2 "secretName" }}
{{- $_ := set $secrets $v2.secretName "1" }}
{{- else }}
{{- range $k3, $v3 := $v2 }}
{{- if kindIs "map" $v3 }}
{{- if hasKey $v3 "secretName" }}
{{- $_ := set $secrets $v3.secretName "1" }}
{{- else }}
{{- range $k4, $v4 := $v3 }}
{{- if kindIs "map" $v4 }}
{{- if hasKey $v4 "secretName" }}
{{- $_ := set $secrets $v4.secretName "1" }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- keys $secrets | sortAlpha | join "," -}}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Map-only traversal misses valid secretName locations.

These helpers stop after four map levels and never descend into lists. A secretName under something like foo[0].credentials or any fifth-level object will be skipped, so the rendered config keeps the unresolved reference and the later secret-mount generation never sees that Secret name. That breaks the PR’s “any nesting level” contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deployments/charts/service/templates/_helpers.tpl` around lines 160 - 249,
The helpers osmo.resolve-secret-names-in-config and osmo.collect-secret-names
only traverse maps up to four levels and ignore slices, so secretName entries in
lists or deeper levels are missed; replace the fixed-depth loops with a
genuinely recursive helper (e.g., define "osmo.walk-config") that accepts a node
and a secrets accumulator/map and: if kindIs "map" check hasKey "secretName"
then set/unset as before and recurse over map values; if kindIs "slice"/"list"
range over items and recursively call the walker for each element; use this same
walker from osmo.resolve-secret-names-in-config to set secret_file and unset
secretName and from osmo.collect-secret-names to populate the $secrets dict,
then emit keys $secrets | sortAlpha | join "," as before (retain use of
set/unset/keys/sortAlpha/join).

Comment on lines +1 to +17
"""
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # pylint: disable=line-too-long

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

SPDX-License-Identifier: Apache-2.0
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that the pylint directive is embedded in the module string, not emitted
# as a standalone comment before the docstring.
python - <<'PY'
from pathlib import Path

path = Path("src/service/core/config/configmap_loader.py")
lines = path.read_text().splitlines()

print("line 1:", lines[0])
print("line 2:", lines[1])
print("standalone pylint comment before docstring:",
      any(line.lstrip().startswith("# pylint: disable=line-too-long") for line in lines[:2]))
PY

Repository: NVIDIA/OSMO

Length of output: 259


Move the # pylint: disable=line-too-long directive outside the docstring.

The directive is currently embedded in the SPDX header string literal (line 2), so pylint will not recognize or honor it. Per the coding guidelines for **/*.py, when copyright lines exceed 100 characters, add # pylint: disable=line-too-long as a real comment before the docstring begins, not within it.

Example:

# pylint: disable=line-too-long
"""
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 1 - 17, The pylint
disable directive is currently embedded inside the module docstring; move the
comment "# pylint: disable=line-too-long" out of the triple-quoted string and
place it as a standalone comment immediately above the module docstring so
pylint recognizes it; update the top of the module in configmap_loader.py by
adding the standalone comment before the opening """ that starts the SPDX
header.

Comment on lines +226 to +274
# Collect drifted configs before acquiring lock (read-only check)
drifted_configs = []
for config_key, config_type in _SINGLETON_CONFIG_TYPES.items():
section = self._cached_managed_configs.get(config_key)
if not section:
continue
managed_by = _parse_managed_by(section)
if managed_by != ManagedByMode.CONFIGMAP:
continue
desired_config = section.get('config', {})
if not desired_config:
continue

try:
current_db = self._postgres.get_configs(config_type).plaintext_dict(
by_alias=True, exclude_unset=True)
for key, desired_value in desired_config.items():
if current_db.get(key) != desired_value:
drifted_configs.append((config_key, config_type, section))
break
except Exception: # pylint: disable=broad-exception-caught
logging.exception('Error checking drift for %s', config_key)

if not drifted_configs:
return

# Acquire lock only when we have drift to correct
lock_result = self._postgres.execute_fetch_command(
"SELECT pg_try_advisory_lock(hashtext('configmap-reconcile'))",
(), return_raw=True)
if not lock_result or not lock_result[0]['pg_try_advisory_lock']:
logging.info('Another replica is reconciling drift, skipping')
return

try:
for config_key, config_type, section in drifted_configs:
try:
logging.info('Drift detected for %s config, re-applying from ConfigMap',
config_key)
_apply_singleton_config(
section, self._postgres, config_type,
pre_apply=(_default_dataset_credential_endpoints
if config_key == 'dataset' else None))
except Exception: # pylint: disable=broad-exception-caught
logging.exception('Error reconciling drift for %s', config_key)
finally:
self._postgres.execute_fetch_command(
"SELECT pg_advisory_unlock(hashtext('configmap-reconcile'))",
(), return_raw=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-check drift after taking the advisory lock.

Drift is detected before pg_try_advisory_lock. Two replicas can both queue the same singleton as drifted; one fixes it, then the second acquires the lock later and re-applies an already-correct config anyway, which still creates an extra write/history entry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 226 - 274, Drift is
currently computed into drifted_configs before acquiring the advisory lock, so
multiple replicas can race and cause redundant re-applies; after acquiring the
lock via the pg_try_advisory_lock call, re-check each candidate in
drifted_configs against the live DB (use
self._postgres.get_configs(...).plaintext_dict(...) as done earlier) and skip
any config whose current_db values now match the desired section config before
calling _apply_singleton_config; ensure you still release the advisory lock in
the finally block.

Comment on lines +411 to +447
def _resolve_secret_file_references(config_data: Dict[str, Any],
parent_key: str = '') -> None:
"""Recursively resolve secret_file / secretName references in a config dict.

Walks the dict tree. When it finds a dict with 'secret_file' or 'secretName':
- Reads the YAML file from the mounted K8s Secret path
- If the file contains a dict: merges the file contents into the parent dict
(replacing the secret_file/secretName reference)
- If the file contains a 'value' key: replaces the entire dict with that value
(for simple string secrets like slack_token, smtp password)

This is the generic version of the old _default_dataset_credential_endpoints.
Works for all config types: dataset credentials, workflow credentials,
backend_images credentials, notification tokens, etc.
"""
if not isinstance(config_data, dict):
return

keys_to_process = list(config_data.keys())
for key in keys_to_process:
value = config_data[key]
if not isinstance(value, dict):
continue

# Check for secret_file or secretName at this level
secret_file_path = value.get('secret_file')
if not secret_file_path:
secret_name = value.get('secretName')
if secret_name:
secret_file_path = f'/etc/osmo/secrets/{secret_name}/cred.yaml'

if secret_file_path:
_resolve_single_secret(config_data, key, value, secret_file_path,
f'{parent_key}.{key}' if parent_key else key)
else:
# Recurse into nested dicts
_resolve_secret_file_references(value, f'{parent_key}.{key}' if parent_key else key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

_resolve_secret_file_references() skips list items.

The resolver only descends into dict values. Any secret_file or secretName nested inside a list of objects is ignored, so runtime secret expansion still does not support “any nesting level.”

Possible fix
-def _resolve_secret_file_references(config_data: Dict[str, Any],
-                                     parent_key: str = '') -> None:
+def _resolve_secret_file_references(config_data: Any,
+                                    parent_key: str = '') -> None:
     """Recursively resolve secret_file / secretName references in a config dict.
@@
-    if not isinstance(config_data, dict):
+    if isinstance(config_data, list):
+        for index, item in enumerate(config_data):
+            _resolve_secret_file_references(
+                item,
+                f'{parent_key}[{index}]' if parent_key else f'[{index}]',
+            )
+        return
+
+    if not isinstance(config_data, dict):
         return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 411 - 447, The
function _resolve_secret_file_references currently only recurses into dict
values and therefore ignores secret_file/secretName entries inside lists; update
it to also iterate over list values: for any list found as config_data[key],
iterate by index and for each element that is a dict check for
secret_file/secretName and call _resolve_single_secret(config_data[key], index,
element, path) or otherwise recurse into the element via
_resolve_secret_file_references(element, path) so nested objects in lists are
resolved at any depth; ensure you preserve parent_key/path construction
consistent with existing f'{parent_key}.{key}' usage and handle replacing list
items when a file contains a scalar 'value'.

The secret file resolver now auto-detects three formats:
1. Simple string: {value: "..."} → replaces dict with string
2. Docker registry: {auths: {registry: {username, auth}}} → extracts
   registry, username, auth fields automatically
3. YAML dict: merges all keys into current dict (existing behavior)

Added secretKey field alongside secretName to specify which key in
the K8s Secret to mount (defaults to "cred.yaml"). For Docker
registry secrets, use secretKey: .dockerconfigjson.

Example:
  backend_images:
    credential:
      secretName: imagepullsecret
      secretKey: .dockerconfigjson

This lets users reuse existing imagepull secrets without creating
a duplicate secret in a different format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
src/service/core/config/configmap_loader.py (4)

1-17: ⚠️ Potential issue | 🟡 Minor

Move the pylint directive above the docstring.

Pylint will ignore # pylint: disable=line-too-long while it lives inside the module string, so this still violates the repo rule for long copyright lines.

Suggested fix
-"""
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.  # pylint: disable=line-too-long
+# pylint: disable=line-too-long
+"""
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines, If copyright lines exceed 100 characters, add # pylint: disable=line-too-long comment instead of breaking into multiple lines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 1 - 17, The pylint
disable comment is inside the module docstring so it’s ignored; move the "#
pylint: disable=line-too-long" to a module-level comment before the opening
triple-quote so pylint sees it. Edit the top of the module (module docstring in
configmap_loader.py) to place the pylint directive on its own line above the
docstring, preserving the existing copyright text and license block.

227-271: ⚠️ Potential issue | 🟠 Major

Re-check drift after acquiring the reconcile lock.

drifted_configs is built before locking. Two replicas can both decide the same singleton is stale; one fixes it, then the other later acquires the lock and re-applies an already-correct config because it never re-reads the DB inside the guarded section. Revalidate each candidate after locking before calling _apply_singleton_config().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 227 - 271,
drifted_configs is populated before taking the reconcile advisory lock which can
cause a race where another replica fixes drift first; after acquiring the lock
(the result of _postgres.execute_fetch_command), re-check each candidate's
current DB state by calling
self._postgres.get_configs(config_type).plaintext_dict(...) and comparing keys
to section['config'] (the same logic used when building drifted_configs) and
skip calling _apply_singleton_config if the DB already matches; keep using the
same identifiers (_SINGLETON_CONFIG_TYPES, _parse_managed_by,
ManagedByMode.CONFIGMAP, _apply_singleton_config,
_default_dataset_credential_endpoints) and preserve existing exception handling
and logging around each reconcilation attempt.

412-449: ⚠️ Potential issue | 🟠 Major

Secret recursion still skips list elements.

The recursion returns unless the current node is a dict, so any secret_file / secretName nested inside lists is ignored. That leaves common list-of-object shapes unresolved even though this helper is meant to work recursively.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 412 - 449, The
function _resolve_secret_file_references currently only handles dict nodes and
returns for lists, so secret_file/secretName entries inside list elements are
ignored; update _resolve_secret_file_references to handle both dict and list: if
config_data is a list, iterate its indices and for each element either recurse
(if element is dict or list) or, if element is a dict that contains
secret_file/secretName, call _resolve_single_secret to replace/merge that
element (use index-based parent_key like f'{parent_key}[{i}]' for context). Keep
the existing dict-handling branch (_resolve_single_secret, recursion into nested
dicts) but add the list-iteration branch before/alongside the dict check so
list-contained secrets get resolved.

147-192: ⚠️ Potential issue | 🟠 Major

Don’t mark or publish a revision before it has actually applied.

Line 157 advances _last_file_hash before parse/validation, and Lines 174-189 replace _cached_managed_configs plus configmap_guard before load_dynamic_configs() runs. If this revision is malformed or any section fails inside _safe_apply(), the watcher stops retrying the same file and can start returning 409s for configs that were never successfully reconciled. Make load_dynamic_configs() report success, then update the hash/cache/guard only after a usable revision finishes applying.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 147 - 192, The code
updates _last_file_hash, _cached_managed_configs, and configmap_guard before
ensuring the new revision has actually applied; change load_and_apply so it
parses and prepares managed_configs and computes a candidate file hash but does
NOT assign to self._last_file_hash, self._cached_managed_configs, or call
configmap_guard.set_managed_configs yet; call
load_dynamic_configs(self._config_file_path, self._postgres,
managed_configs=managed_configs) and only if it indicates success (or
returns/does not raise) then assign self._last_file_hash, set
self._cached_managed_configs = managed_configs, and call
configmap_guard.set_managed_configs(managed_configs); keep secret-resolution and
dataset endpoint defaults as-is before calling load_dynamic_configs so the same
resolved configs are applied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/service/core/config/configmap_loader.py`:
- Around line 129-145: The current implementation only calls
_persist_managed_modes() from start(), leaving configmap_state stale after
subsequent reloads; modify the code so that after every successful
load_and_apply() call (including any other reload paths where load_and_apply is
invoked) you call _persist_managed_modes() to write current
_cached_managed_configs into the DB via _postgres.set_configmap_state;
additionally, update _persist_managed_modes() to detect removed keys (compare
current _cached_managed_configs to prior persisted keys) and clear or remove
entries for deleted sections (e.g., set to null or delete the
managed_by:<config_key> state) so removed sections are cleaned up, referencing
functions start(), load_and_apply(), _persist_managed_modes(), the
_cached_managed_configs map, and ManagedByMode.SEED.
- Around line 173-181: The loop that resolves secret file references only
inspects section.get('config'), missing secret placeholders stored under named
sections' 'items'; update the logic in the block that sets
self._cached_managed_configs to also handle section.get('items') by iterating
each item (e.g., in sections like backends, pools, roles, pod_templates,
backend_tests) and calling _resolve_secret_file_references on the item's payload
(either the item itself or item.get('config') when present), using the same type
checks as the existing code so all secret_file/secretName placeholders are
expanded before model construction and DB writes.
- Around line 75-90: The advisory-lock sequence currently uses
execute_fetch_command (which checks out and returns a pooled connection per
call), causing pg_try_advisory_lock and pg_advisory_unlock to run on different
sessions and leak locks; update configmap_loader.py to acquire a single
dedicated Postgres session via the connector's _get_connection() context for the
entire lock/apply/unlock sequence (wrap pg_try_advisory_lock,
_apply_all_configs(managed_configs, postgres) and pg_advisory_unlock in that one
connection scope) so the lock is held and released on the same session; make the
same change for the advisory-lock usage in reconcile_drift() (the blocks around
lines referencing reconcile_drift) or alternatively replace the advisory-lock
approach with a row-level lock mechanism that survives pool round-trips.

---

Duplicate comments:
In `@src/service/core/config/configmap_loader.py`:
- Around line 1-17: The pylint disable comment is inside the module docstring so
it’s ignored; move the "# pylint: disable=line-too-long" to a module-level
comment before the opening triple-quote so pylint sees it. Edit the top of the
module (module docstring in configmap_loader.py) to place the pylint directive
on its own line above the docstring, preserving the existing copyright text and
license block.
- Around line 227-271: drifted_configs is populated before taking the reconcile
advisory lock which can cause a race where another replica fixes drift first;
after acquiring the lock (the result of _postgres.execute_fetch_command),
re-check each candidate's current DB state by calling
self._postgres.get_configs(config_type).plaintext_dict(...) and comparing keys
to section['config'] (the same logic used when building drifted_configs) and
skip calling _apply_singleton_config if the DB already matches; keep using the
same identifiers (_SINGLETON_CONFIG_TYPES, _parse_managed_by,
ManagedByMode.CONFIGMAP, _apply_singleton_config,
_default_dataset_credential_endpoints) and preserve existing exception handling
and logging around each reconcilation attempt.
- Around line 412-449: The function _resolve_secret_file_references currently
only handles dict nodes and returns for lists, so secret_file/secretName entries
inside list elements are ignored; update _resolve_secret_file_references to
handle both dict and list: if config_data is a list, iterate its indices and for
each element either recurse (if element is dict or list) or, if element is a
dict that contains secret_file/secretName, call _resolve_single_secret to
replace/merge that element (use index-based parent_key like f'{parent_key}[{i}]'
for context). Keep the existing dict-handling branch (_resolve_single_secret,
recursion into nested dicts) but add the list-iteration branch before/alongside
the dict check so list-contained secrets get resolved.
- Around line 147-192: The code updates _last_file_hash,
_cached_managed_configs, and configmap_guard before ensuring the new revision
has actually applied; change load_and_apply so it parses and prepares
managed_configs and computes a candidate file hash but does NOT assign to
self._last_file_hash, self._cached_managed_configs, or call
configmap_guard.set_managed_configs yet; call
load_dynamic_configs(self._config_file_path, self._postgres,
managed_configs=managed_configs) and only if it indicates success (or
returns/does not raise) then assign self._last_file_hash, set
self._cached_managed_configs = managed_configs, and call
configmap_guard.set_managed_configs(managed_configs); keep secret-resolution and
dataset endpoint defaults as-is before calling load_dynamic_configs so the same
resolved configs are applied.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2edc34c6-0822-4d2e-9aab-abde70bde33a

📥 Commits

Reviewing files that changed from the base of the PR and between 452c4bb and 42a2253.

📒 Files selected for processing (3)
  • deployments/charts/service/templates/_helpers.tpl
  • src/service/core/config/configmap_loader.py
  • src/service/core/config/tests/test_configmap_loader_unit.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • deployments/charts/service/templates/_helpers.tpl
  • src/service/core/config/tests/test_configmap_loader_unit.py

Comment on lines +129 to +145
def start(self) -> None:
"""Load configs immediately, then start background polling thread."""
self.load_and_apply()
self._persist_managed_modes()
thread = threading.Thread(
target=self._poll_loop, name='config-watcher', daemon=True)
thread.start()

def _persist_managed_modes(self) -> None:
"""Write managed_by modes to configmap_state table for API visibility."""
if not self._cached_managed_configs:
return
for config_key, section in self._cached_managed_configs.items():
if not isinstance(section, dict):
continue
mode = section.get('managed_by', ManagedByMode.SEED.value)
self._postgres.set_configmap_state(f'managed_by:{config_key}', mode)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Persist _managed_by state after each reload.

_persist_managed_modes() only runs from start(). After a later file change, load_and_apply() updates the in-memory state but never rewrites configmap_state, so _managed_by metadata stays stale until restart. This should run after every successful reload, and removed sections need a cleanup path too.

Also applies to: 203-207

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 129 - 145, The
current implementation only calls _persist_managed_modes() from start(), leaving
configmap_state stale after subsequent reloads; modify the code so that after
every successful load_and_apply() call (including any other reload paths where
load_and_apply is invoked) you call _persist_managed_modes() to write current
_cached_managed_configs into the DB via _postgres.set_configmap_state;
additionally, update _persist_managed_modes() to detect removed keys (compare
current _cached_managed_configs to prior persisted keys) and clear or remove
entries for deleted sections (e.g., set to null or delete the
managed_by:<config_key> state) so removed sections are cleaned up, referencing
functions start(), load_and_apply(), _persist_managed_modes(), the
_cached_managed_configs map, and ManagedByMode.SEED.

Comment on lines +173 to +181
# Cache managed configs for drift comparison and 409 guard lookups
self._cached_managed_configs = managed_configs
# Resolve secret references in cached configs so drift comparison
# uses the same format as DB values (resolved credentials, not file paths)
for section in managed_configs.values():
if isinstance(section, dict):
config_data = section.get('config')
if isinstance(config_data, dict):
_resolve_secret_file_references(config_data)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve secrets inside items sections too.

This pass only visits section["config"]. Named sections like backends, pools, roles, pod_templates, and backend_tests store their payload under items, so secret_file / secretName placeholders there are never expanded before model construction and DB writes.

Suggested fix
-        for section in managed_configs.values():
+        for section_name, section in managed_configs.items():
             if isinstance(section, dict):
                 config_data = section.get('config')
                 if isinstance(config_data, dict):
-                    _resolve_secret_file_references(config_data)
+                    _resolve_secret_file_references(config_data, section_name)
+                items = section.get('items')
+                if isinstance(items, dict):
+                    for item_name, item_config in items.items():
+                        _resolve_secret_file_references(
+                            item_config,
+                            f'{section_name}.{item_name}',
+                        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/config/configmap_loader.py` around lines 173 - 181, The loop
that resolves secret file references only inspects section.get('config'),
missing secret placeholders stored under named sections' 'items'; update the
logic in the block that sets self._cached_managed_configs to also handle
section.get('items') by iterating each item (e.g., in sections like backends,
pools, roles, pod_templates, backend_tests) and calling
_resolve_secret_file_references on the item's payload (either the item itself or
item.get('config') when present), using the same type checks as the existing
code so all secret_file/secretName placeholders are expanded before model
construction and DB writes.

vvnpn-nv and others added 7 commits April 7, 2026 16:17
…mode

Major redesign based on team feedback:

1. Single global toggle (no per-config managed_by seed/configmap)
   - dynamicConfig.enabled: true = ConfigMap mode (all writes blocked)
   - dynamicConfig.enabled: false = DB mode (CLI/API works normally)

2. In-memory config serving (standard K8s pattern)
   - ConfigMap file parsed on startup, cached in module-level dict
   - All config reads served from memory, not DB
   - DB only for runtime state (agent heartbeats, k8s_uid) + roles
   - Runtime fields (service_auth) injected from DB on first load

3. Watchdog file events replace 30s polling
   - Uses watchdog library (already a dep) with inotify
   - Watches parent dir for K8s ConfigMap symlink swaps
   - 2s debounce for atomic swap events

4. All-or-nothing validation before applying
   - Pydantic model validation for singleton configs
   - Structure validation for named config sections
   - On failure: previous config preserved, error logged

5. Simplified 409 write protection
   - Single reject_if_configmap_mode(username) function
   - 29 guard calls across config_service.py + helpers.py
   - Bug fix: added missing delete_dataset guard
   - Bug fix: simplified rollback guard

New: src/utils/configmap_state.py (dependency-free state module)
Modified: 11 files, ~900 lines added, ~2000 lines removed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated to reflect redesigned ConfigMap config system:
- Global mode (no per-config managed_by)
- In-memory serving (standard K8s pattern)
- Watchdog file events (no polling)
- Config vs runtime data separation
- All-or-nothing validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --roles-file flag to authz_sidecar. When set, the sidecar reads
roles, external role mappings, and pool names from the ConfigMap-mounted
YAML file instead of PostgreSQL. This eliminates the DB dependency
from the authz_sidecar in ConfigMap mode.

New: src/utils/roles/file_loader.go
- FileRoleStore loads roles + external_roles from YAML
- In-memory reverse map: externalRole -> []osmoRole
- ResolveExternalRoles replaces SyncUserRoles SQL query
- Poll-based file watching (os.Stat every 30s)

Modified: authz_server.go
- NewFileBackedAuthzServer constructor (no pgClient needed)
- Check(): file-backed ResolveExternalRoles vs DB SyncUserRoles
- resolveRoles(): file store vs DB for role policy fetches
- computeAllowedPools(): file store vs DB for pool names
- MigrateRoles(): skipped in file-backed mode

Modified: main.go
- --roles-file flag for ConfigMap mode
- Conditional init: file-backed vs DB-backed server
- DB connection only created when needed

Modified: configmap_loader.py
- Removed _write_roles_to_db (authz_sidecar reads file directly)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helm template conditionally passes --roles-file to the authz-sidecar
container and mounts the dynamic-config ConfigMap volume. When
dynamicConfig.enabled=true, postgres args are omitted (not needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_url

Chart values.yaml now ships with:
- Default workflow limits (max_num_tasks, timeouts)
- Default pod templates (ctrl, user)
- Default resource validations (cpu, memory)
- All 5 RBAC roles with policies (admin, user, default, ctrl, backend)

service_base_url auto-derived from services.service.hostname in the
dynamic-config template — no need to set it per deployment.

Per-deployment values only need site-specific configs:
- Dataset buckets + credentials
- Backend definitions
- Workflow image credentials
- CLI version info

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches configure_app() behavior: default backend with kai scheduler,
default pool referencing default backend with default platform.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Authz sidecar reads roles from ConfigMap file (FileRoleStore)
- IDP groups as source of truth for user role assignments
- Product defaults (roles, templates, validations, backend, pool) in chart
- Per-deployment values only for site-specific overrides
- Updated architecture diagram, component table, E2E test results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vvnpn-nv and others added 2 commits April 8, 2026 14:22
- Role.list_from_db/fetch_from_db now check snapshot (was missing)
- ConfigMapWatcher.postgres is optional (None when DB not available)
- _inject_runtime_fields skips DB read when postgres is None
- All config types now served from in-memory when ConfigMap mode active

Remaining DB access in ConfigMap mode is ONLY:
- Backend heartbeats (agent writes, pool status reads)
- configure_app() startup (runs before ConfigMap mode activates)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove stale managed_by mode comments from values.yaml (v1/v2 leftover)
- Remove unused dynamic_config_poll_interval field from WorkflowServiceConfig
  (Python watcher uses watchdog events, Go sidecar has its own constant)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant