Skip to content

add kueue components as an option#490

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
kannon92:issue-486-kueue-component
Apr 9, 2026
Merged

add kueue components as an option#490
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
kannon92:issue-486-kueue-component

Conversation

@kannon92
Copy link
Copy Markdown
Contributor

@kannon92 kannon92 commented Apr 4, 2026

Summary

Add kueue recipe.

Motivation / Context

Fixes: #486
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@kannon92 kannon92 requested a review from a team as a code owner April 4, 2026 03:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Kevin, thanks for adding this. A couple things to fix before this is ready.

@kannon92
Copy link
Copy Markdown
Contributor Author

kannon92 commented Apr 6, 2026

I'm having a hard time figuring out how to test this on a kind cluster.

I was thinking that I could have a bundle and verify simple kind cluster for Kueue but can't figure out how to do that.

@kannon92 kannon92 force-pushed the issue-486-kueue-component branch from 9a20673 to 3f702e0 Compare April 7, 2026 00:30
@ArangoGutierrez
Copy link
Copy Markdown
Contributor

I'm having a hard time figuring out how to test this on a kind cluster.

I was thinking that I could have a bundle and verify simple kind cluster for Kueue but can't figure out how to do that.

In response to this comment, I have created #508 , with it, is easier for contributors to fully test the component in a controlled environment, even if the contributor doesn't have access to a GPU by using the NVML-Mock project.

Hope this helps @kannon92

@yuanchen8911
Copy link
Copy Markdown
Contributor

Cross-Review Summary for PR #490

Reviewers: Codex, CodeRabbit + Integration Analysis
Rounds: 1 + Codex follow-up
Consensus reached: Yes — no confirmed defects

Confirmed Issues

None. The PR is correct as written.

Positive Observations

  • Registry entry follows established patterns — displayName, version quoting, OCI URL, nodeScheduling structure all consistent with existing components
  • Health check mirrors kubeflow-trainer/kai-scheduler pattern (deployment exists + no unhealthy pods in namespace)
  • controllerManager.tolerations: [{operator: Exists}] matches kai-scheduler/nvsentinel/network-operator convention
  • Kueue webhook runs in the same controller-manager pod (verified against upstream chart v0.17.0), so wiring only controllerManager.* scheduling paths is correct
  • All 3 files automatically embedded via existing //go:embed directives — no Go code changes needed
  • No existing kueue references in codebase — clean integration

Cross-review by Claude Code + Codex + CodeRabbit

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911
Copy link
Copy Markdown
Contributor

@kannon92 , can you rebase on main.

@yuanchen8911 yuanchen8911 self-requested a review April 9, 2026 18:30
@kannon92 kannon92 marked this pull request as ready for review April 9, 2026 18:37
@kannon92 kannon92 force-pushed the issue-486-kueue-component branch from 197b1fb to 7d7a177 Compare April 9, 2026 18:38
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-Review Summary (Claude Code + Codex + CodeRabbit + Integration Analysis)

Rounds: 1 | Consensus: Yes | Confirmed issues: 0

Result: No findings.

All four reviewers independently confirmed this PR follows established patterns exactly. The registry entry, values.yaml, and health check are consistent with existing components (kubeflow-trainer, kai-scheduler, etc.).

Residual Risk

Risk Notes
No overlay references kueue yet Component is inert until an overlay/mixin adds a componentRef — expected for new components
Chart version not verified oci://registry.k8s.io/kueue/charts/kueue:0.17.0 is an external artifact
Broad toleration (operator: Exists) Kueue controller schedules on any node; system nodeScheduling tolerations are redundant but harmless

Positive Observations

  • All three files follow established patterns (registry, values, chainsaw health check)
  • nodeScheduling paths internally consistent with values.yaml structure
  • Version string properly quoted to avoid YAML float interpretation

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kannon92 kannon92 force-pushed the issue-486-kueue-component branch 2 times, most recently from 6b32aec to 1f32c34 Compare April 9, 2026 21:27
Signed-off-by: Kevin Hannon <kehannon@redhat.com>
@kannon92 kannon92 force-pushed the issue-486-kueue-component branch from 1f32c34 to f420487 Compare April 9, 2026 22:35
@yuanchen8911 yuanchen8911 merged commit 6bd821d into NVIDIA:main Apr 9, 2026
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: overlay support for Kueue

3 participants