Skip to content

feat: add component contributor test harness#508

Open
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:feature/component-test-harness
Open

feat: add component contributor test harness#508
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:feature/component-test-harness

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Contributor

Summary

Validate AICR components end-to-end with a single command — no GPU hardware required for most components.

make component-test COMPONENT=cert-manager
  • Three test tiers (auto-detected from registry.yaml): scheduling (KWOK redirect), deploy (Kind + bundle + health check), gpu-aware (Kind + nvml-mock + deploy + health check)
  • nvml-mock integration using ghcr.io/nvidia/nvml-mock:0.1.0 for GPU simulation in Kind clusters (arm64 + amd64, includes nvidia-smi)
  • Bundler bugfix: deploy.sh template now conditionally includes --version flag — fixes broken helm commands for components without defaultVersion in registry (e.g., gpu-operator)

New files

  • tools/component-test/ — 7 scripts (detect-tier, ensure-cluster, setup-gpu-mock, deploy-component, run-health-check, cleanup), Kind config, nvml-mock manifest, README
  • Makefile targets: component-test, component-detect, component-cluster, component-deploy, component-health, component-cleanup
  • Documentation updates in DEVELOPMENT.md and CONTRIBUTING.md

Test Plan

  • make test — all unit tests pass (72.1% coverage)
  • make component-test COMPONENT=cert-manager — deploy tier end-to-end (build → deploy → health check → cleanup)
  • make component-test COMPONENT=gpu-operator TIER=gpu-aware — gpu-aware tier end-to-end (build → nvml-mock → deploy → health check → cleanup)
  • make component-test COMPONENT=cert-manager TIER=scheduling — scheduling tier redirects to KWOK
  • New tests: TestGenerateDeployScript_EmptyVersionOmitsFlag, TestGenerateDeployScript_WithVersionIncludesFlag

@kannon92
Copy link
Copy Markdown

kannon92 commented Apr 8, 2026

So rather than go with mock GPUs is there a way we could have a CPU flavor?

I like that pattern for llama.cpp or vllm.

@ArangoGutierrez
Copy link
Copy Markdown
Contributor Author

So rather than go with mock GPUs is there a way we could have a CPU flavor?

I like that pattern for llama.cpp or vllm.

Good question — the harness actually already has a GPU-free path. The deploy tier validates components in plain Kind without any GPU mock (cert-manager, kai-scheduler, etc. use this today).

The nvml-mock layer is specifically for components that gate on GPU presence during init — gpu-operator, nvidia-device-plugin, DRA driver — they won't even start their reconciliation loop unless they
detect NVML libraries and device nodes on the host. There's no CPU flavor of those because their entire purpose is managing GPU hardware.

For inference workloads like llama.cpp or vLLM, a CPU flavor would make sense as a complementary pattern — deploy the serving stack with a CPU backend and validate the end-to-end request path. That's a
higher-level integration test than what this harness targets (component deployment + health check), but it could be built on top of it.

So both patterns have a place:

  • nvml-mock: GPU infrastructure components that check for hardware at init
  • CPU flavors: inference/serving workloads that can run with CPU backends

Validate AICR components end-to-end with a single command:

  make component-test COMPONENT=cert-manager

Three test tiers, auto-detected from registry.yaml:
- scheduling: redirects to existing KWOK infrastructure
- deploy: Kind cluster + aicr bundle + chainsaw health check
- gpu-aware: Kind + nvml-mock DaemonSet + deploy + health check

New files:
- tools/component-test/{detect-tier,ensure-cluster,setup-gpu-mock,
  deploy-component,run-health-check,cleanup}.sh
- tools/component-test/{kind-config.yaml,manifests/nvml-mock.yaml,README.md}

Makefile targets: component-test, component-detect, component-cluster,
component-deploy, component-health, component-cleanup.

Uses ghcr.io/nvidia/nvml-mock:0.1.0 for GPU simulation in Kind clusters
(arm64+amd64, includes nvidia-smi).

Tested end-to-end:
- deploy tier: cert-manager (build → deploy → health check → cleanup)
- gpu-aware tier: gpu-operator (build → nvml-mock → deploy → health check → cleanup)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The deploy.sh template unconditionally included '--version {{ .Version }}'
which produced a broken helm command when Version was empty (e.g.,
gpu-operator has no defaultVersion in registry.yaml). Helm 4 treats
the empty --version as a missing required argument.

The template now conditionally includes --version only when Version
is non-empty, allowing components without pinned versions to install
the latest chart from the repository.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the feature/component-test-harness branch from d84bc0a to 45ddbbe Compare April 8, 2026 19:08
Copy link
Copy Markdown

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! This should help me a lot of Kueue work.

@ArangoGutierrez
Copy link
Copy Markdown
Contributor Author

CI is passing, ready for review @yuanchen8911 / @mchmarny

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants