feat: add component contributor test harness#508
feat: add component contributor test harness#508ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
So rather than go with mock GPUs is there a way we could have a CPU flavor? I like that pattern for llama.cpp or vllm. |
Good question — the harness actually already has a GPU-free path. The deploy tier validates components in plain Kind without any GPU mock (cert-manager, kai-scheduler, etc. use this today). The nvml-mock layer is specifically for components that gate on GPU presence during init — gpu-operator, nvidia-device-plugin, DRA driver — they won't even start their reconciliation loop unless they For inference workloads like llama.cpp or vLLM, a CPU flavor would make sense as a complementary pattern — deploy the serving stack with a CPU backend and validate the end-to-end request path. That's a So both patterns have a place:
|
Validate AICR components end-to-end with a single command:
make component-test COMPONENT=cert-manager
Three test tiers, auto-detected from registry.yaml:
- scheduling: redirects to existing KWOK infrastructure
- deploy: Kind cluster + aicr bundle + chainsaw health check
- gpu-aware: Kind + nvml-mock DaemonSet + deploy + health check
New files:
- tools/component-test/{detect-tier,ensure-cluster,setup-gpu-mock,
deploy-component,run-health-check,cleanup}.sh
- tools/component-test/{kind-config.yaml,manifests/nvml-mock.yaml,README.md}
Makefile targets: component-test, component-detect, component-cluster,
component-deploy, component-health, component-cleanup.
Uses ghcr.io/nvidia/nvml-mock:0.1.0 for GPU simulation in Kind clusters
(arm64+amd64, includes nvidia-smi).
Tested end-to-end:
- deploy tier: cert-manager (build → deploy → health check → cleanup)
- gpu-aware tier: gpu-operator (build → nvml-mock → deploy → health check → cleanup)
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The deploy.sh template unconditionally included '--version {{ .Version }}'
which produced a broken helm command when Version was empty (e.g.,
gpu-operator has no defaultVersion in registry.yaml). Helm 4 treats
the empty --version as a missing required argument.
The template now conditionally includes --version only when Version
is non-empty, allowing components without pinned versions to install
the latest chart from the repository.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
d84bc0a to
45ddbbe
Compare
kannon92
left a comment
There was a problem hiding this comment.
Thanks for this! This should help me a lot of Kueue work.
|
CI is passing, ready for review @yuanchen8911 / @mchmarny |
Summary
Validate AICR components end-to-end with a single command — no GPU hardware required for most components.
registry.yaml):scheduling(KWOK redirect),deploy(Kind + bundle + health check),gpu-aware(Kind + nvml-mock + deploy + health check)ghcr.io/nvidia/nvml-mock:0.1.0for GPU simulation in Kind clusters (arm64 + amd64, includes nvidia-smi)deploy.shtemplate now conditionally includes--versionflag — fixes broken helm commands for components withoutdefaultVersionin registry (e.g., gpu-operator)New files
tools/component-test/— 7 scripts (detect-tier, ensure-cluster, setup-gpu-mock, deploy-component, run-health-check, cleanup), Kind config, nvml-mock manifest, READMEcomponent-test,component-detect,component-cluster,component-deploy,component-health,component-cleanupDEVELOPMENT.mdandCONTRIBUTING.mdTest Plan
make test— all unit tests pass (72.1% coverage)make component-test COMPONENT=cert-manager— deploy tier end-to-end (build → deploy → health check → cleanup)make component-test COMPONENT=gpu-operator TIER=gpu-aware— gpu-aware tier end-to-end (build → nvml-mock → deploy → health check → cleanup)make component-test COMPONENT=cert-manager TIER=scheduling— scheduling tier redirects to KWOKTestGenerateDeployScript_EmptyVersionOmitsFlag,TestGenerateDeployScript_WithVersionIncludesFlag