diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 9b00f9797..5db212ad5 100644 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -247,6 +247,27 @@ slog.Error("operation failed", "error", err, "component", "gpu-collector") **Note:** A component must have either `helm` OR `kustomize` configuration, not both. +**Using mixins for shared OS/platform content:** +```yaml +# Leaf overlay referencing mixins instead of duplicating content +spec: + base: h100-eks-ubuntu-training + mixins: + - os-ubuntu # Ubuntu constraints (defined once in recipes/mixins/) + - platform-kubeflow # kubeflow-trainer component (defined once in recipes/mixins/) + criteria: + service: eks + accelerator: h100 + os: ubuntu + intent: training + platform: kubeflow + constraints: + - name: K8s.server.version + value: ">= 1.32.4" +``` + +Mixins carry only `constraints` and `componentRefs` — no `criteria`, `base`, `mixins`, or `validation`. They live in `recipes/mixins/` with `kind: RecipeMixin`. + ## Error Wrapping Rules **Never return bare errors.** Every `return err` must wrap with context: @@ -467,6 +488,7 @@ ${AICR_BIN} validate -r recipe.yaml -s snapshot.yaml --no-cluster | `.settings.yaml` | Project settings: tool versions, quality thresholds, build/test config (single source of truth) | | `recipes/registry.yaml` | Declarative component configuration | | `recipes/overlays/*.yaml` | Recipe overlay definitions | +| `recipes/mixins/*.yaml` | Composable mixin fragments (OS constraints, platform components) | | `recipes/components/*/values.yaml` | Component Helm values | | `api/aicr/v1/server.yaml` | OpenAPI spec | | `.goreleaser.yaml` | Release configuration | diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 44ee4c4e0..baf59ec99 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -209,7 +209,7 @@ aicr/ - **Snapshot Mode**: Extract query from snapshot → Build recipe → Return recommendations - **Input**: OS, OS version, kernel, K8s service/version, GPU type, workload intent - **Output**: Recipe with matched rules and configuration measurements -- **Data Source**: Embedded YAML configuration (`recipes/overlays/*.yaml` including `base.yaml`) +- **Data Source**: Embedded YAML configuration (`recipes/overlays/*.yaml` including `base.yaml`, `recipes/mixins/*.yaml`) - **Query Extraction**: Parses K8s, OS, GPU measurements from snapshots to construct recipe queries #### Snapshotter diff --git a/docs/contributor/data.md b/docs/contributor/data.md index d0f46868d..7c26f3eb7 100644 --- a/docs/contributor/data.md +++ b/docs/contributor/data.md @@ -36,6 +36,10 @@ recipes/ │ ├── eks-training.yaml # EKS + training workloads (inherits from eks) │ ├── gb200-eks-ubuntu-training.yaml # GB200/EKS/Ubuntu/training (inherits from eks-training) │ └── h100-ubuntu-inference.yaml # H100/Ubuntu/inference +├── mixins/ # Composable mixin fragments (kind: RecipeMixin) +│ ├── os-ubuntu.yaml # Ubuntu OS constraints (shared by leaf overlays) +│ ├── platform-inference.yaml # Inference gateway components (shared by service-inference overlays) +│ └── platform-kubeflow.yaml # Kubeflow trainer component (shared by leaf overlays) └── components/ # Component values files ├── cert-manager/ │ └── values.yaml @@ -88,6 +92,9 @@ metadata: spec: base: # Optional - inherits from another recipe + mixins: # Optional - composable mixin fragments + - os-ubuntu # OS constraints (from recipes/mixins/) + - platform-kubeflow # Platform components (from recipes/mixins/) criteria: # When this recipe/overlay applies service: eks # Kubernetes platform @@ -118,6 +125,7 @@ spec: | `apiVersion` | Always `aicr.nvidia.com/v1alpha1` | | `metadata.name` | Unique recipe identifier | | `spec.base` | Parent recipe to inherit from (empty = inherits from `overlays/base.yaml`) | +| `spec.mixins` | List of mixin names to compose (e.g., `["os-ubuntu", "platform-kubeflow"]`) | | `spec.criteria` | Query parameters that select this recipe | | `spec.constraints` | Pre-flight validation rules | | `spec.componentRefs` | List of components to deploy | @@ -389,6 +397,52 @@ spec: | **Flexible Extension** | Add new leaf recipes without duplicating parent configs | | **Testable** | Each level can be validated independently | +### Mixin Composition + +Inheritance is single-parent (`spec.base`), which means cross-cutting concerns like OS constraints or platform components would need to be duplicated across leaf overlays. **Mixins** solve this by providing composable fragments that leaf overlays reference via `spec.mixins`. + +Mixin files live in `recipes/mixins/` and use `kind: RecipeMixin`: + +```yaml +# recipes/mixins/os-ubuntu.yaml +kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: os-ubuntu + +spec: + constraints: + - name: OS.release.ID + value: ubuntu + - name: OS.release.VERSION_ID + value: "24.04" + - name: OS.sysctl./proc/sys/kernel/osrelease + value: ">= 6.8" +``` + +Leaf overlays compose mixins alongside inheritance: + +```yaml +# recipes/overlays/h100-eks-ubuntu-training-kubeflow.yaml +spec: + base: h100-eks-training + mixins: + - os-ubuntu # Ubuntu constraints + - platform-kubeflow # Kubeflow trainer component + criteria: + service: eks + accelerator: h100 + os: ubuntu + intent: training + platform: kubeflow +``` + +**Mixin rules:** +- Mixins carry only `constraints` and `componentRefs` — no `criteria`, `base`, `mixins`, or `validation` +- Mixins are applied after inheritance chain merging but before constraint evaluation +- Conflict detection: a mixin constraint or component that conflicts with the inheritance chain or a previously applied mixin produces an error +- When a snapshot is provided, mixin constraints are evaluated against it after merging; if any fail, the entire composed candidate is invalid and falls back to base-only output. In plain query mode (no snapshot), mixin constraints are merged but not evaluated + ### Cycle Detection The system detects circular inheritance to prevent infinite loops: @@ -624,7 +678,7 @@ store, err := loadMetadataStore(ctx) - Embedded YAML files are parsed into Go structs - Cached in memory on first access (singleton pattern with `sync.Once`) -- Contains base recipe, all overlays, and component values files +- Contains base recipe, all overlays, mixins, and component values files ### Step 2: Find Matching Overlays @@ -679,7 +733,18 @@ func mergeComponentRef(base, overlay ComponentRef) ComponentRef { } ``` -### Step 5: Validate Dependencies +### Step 5: Apply Mixins + +```go +mixinConstraintNames, err := store.mergeMixins(mergedSpec) +``` + +- If the leaf overlay declares `spec.mixins`, each named mixin is loaded from `recipes/mixins/` +- Mixin constraints and componentRefs are appended to the merged spec +- Conflict detection prevents duplicates between the inheritance chain, previously applied mixins, and the current mixin +- When a snapshot evaluator is provided, mixin constraints are evaluated against it after merging; failure invalidates the entire composed candidate. In plain query mode (no snapshot), mixin constraints are merged but not evaluated + +### Step 6: Validate Dependencies ```go if err := mergedSpec.ValidateDependencies(); err != nil { @@ -690,7 +755,7 @@ if err := mergedSpec.ValidateDependencies(); err != nil { - Verify all `dependencyRefs` reference existing components - Detect circular dependencies -### Step 6: Compute Deployment Order +### Step 7: Compute Deployment Order ```go deployOrder, err := mergedSpec.TopologicalSort() @@ -699,7 +764,7 @@ deployOrder, err := mergedSpec.TopologicalSort() - Topologically sort components based on `dependencyRefs` - Ensures dependencies are deployed before dependents -### Step 7: Build RecipeResult +### Step 8: Build RecipeResult ```go return &RecipeResult{ diff --git a/docs/integrator/data-flow.md b/docs/integrator/data-flow.md index e32289d3e..bd96c9c3d 100644 --- a/docs/integrator/data-flow.md +++ b/docs/integrator/data-flow.md @@ -251,10 +251,15 @@ When a query matches a leaf recipe that has a `spec.base` reference, the system │ ├─ + gb200-eks-training (GB200 overrides) │ │ └─ + gb200-eks-ubuntu-training (Ubuntu specifics) │ │ │ -│ 4. Strip context (if !context) │ +│ 4. Apply mixins (if spec.mixins declared) │ +│ ├─ Load mixin files from recipes/mixins/ │ +│ ├─ Append mixin constraints and componentRefs │ +│ └─ If snapshot provided, evaluate mixin constraints│ +│ │ +│ 5. Strip context (if !context) │ │ └─ Remove context maps from all subtypes │ │ │ -│ 5. Return recipe │ +│ 6. Return recipe │ │ │ └────────────────────────────────────────────────────────┘ ``` @@ -812,7 +817,7 @@ X-RateLimit-Reset: 1735650000 ### Embedded Data **Recipe Data:** -- Location: `recipes/overlays/*.yaml` (including `base.yaml`) +- Location: `recipes/overlays/*.yaml` (including `base.yaml`), `recipes/mixins/*.yaml` - Embedded at compile time via `//go:embed` directives - Loaded once per process, cached in memory - TTL: 5 minutes (in-memory cache) diff --git a/docs/integrator/recipe-development.md b/docs/integrator/recipe-development.md index a232b0197..086986336 100644 --- a/docs/integrator/recipe-development.md +++ b/docs/integrator/recipe-development.md @@ -74,11 +74,12 @@ make qualify # Includes end to end tests before submitting ## Overview -Recipe metadata files define component configurations for GPU-accelerated Kubernetes deployments using a **base-plus-overlay architecture** with **multi-level inheritance**: +Recipe metadata files define component configurations for GPU-accelerated Kubernetes deployments using a **base-plus-overlay architecture** with **multi-level inheritance** and **mixin composition**: - **Base values** (`overlays/base.yaml`) - universal defaults - **Intermediate recipes** (`eks.yaml`, `eks-training.yaml`) - shared configurations for categories - **Leaf recipes** (`gb200-eks-ubuntu-training.yaml`) - hardware/workload-specific overrides +- **Mixins** (`mixins/*.yaml`) - composable fragments (OS constraints, platform components) that leaf overlays reference via `spec.mixins` instead of duplicating content - **Inline overrides** - per-recipe customization without new files Recipe files in `recipes/` are embedded at compile time. Integrators can extend or override using the `--data` flag (see [Advanced Topics](#advanced-topics)). @@ -125,12 +126,31 @@ spec: version: "580.82.07" # Hardware-specific override ``` -**Merge order:** `base.yaml` (lowest) → intermediate → leaf (highest) +**Leaf recipes with mixins** compose shared fragments: +```yaml +# h100-eks-ubuntu-training-kubeflow.yaml +spec: + base: h100-eks-ubuntu-training + mixins: + - os-ubuntu # Shared Ubuntu constraints (from recipes/mixins/) + - platform-kubeflow # Kubeflow trainer component (from recipes/mixins/) + criteria: + service: eks + accelerator: h100 + os: ubuntu + intent: training + platform: kubeflow +``` + +Mixins use `kind: RecipeMixin` and carry only `constraints` and `componentRefs`. They live in `recipes/mixins/` and are applied after inheritance chain merging. See [Data Architecture](../contributor/data.md#mixin-composition) for details. + +**Merge order:** `base.yaml` (lowest) → intermediate → leaf → mixins (highest) **Merge rules:** - Constraints: same-named overridden, new added - ComponentRefs: same-named merged field-by-field, new added - Criteria: not inherited (each recipe defines its own) +- Mixin constraints/components must not conflict with the inheritance chain or other mixins ### Component Types @@ -219,6 +239,8 @@ File names are for human readability—matching uses `spec.criteria`, not file n | Service + intent | `{service}-{intent}.yaml` | `eks-training.yaml` | | Full criteria | `{accel}-{service}-{os}-{intent}.yaml` | `gb200-eks-ubuntu-training.yaml` | | + platform | `{accel}-{service}-{os}-{intent}-{platform}.yaml` | `gb200-eks-ubuntu-training-kubeflow.yaml` | +| Mixin (OS) | `os-{os}.yaml` | `os-ubuntu.yaml` | +| Mixin (platform) | `platform-{platform}.yaml` | `platform-kubeflow.yaml` | | Component values | `values-{service}-{intent}.yaml` | `values-eks-training.yaml` | ## Constraints and Validation @@ -298,9 +320,10 @@ go test -v ./pkg/recipe/... -run TestConstraintPathsUseValidMeasurementTypes **Steps:** 1. Create overlay in `recipes/overlays/` with criteria and componentRefs -2. Create component values files if using `valuesFile` -3. Run tests: `make test` -4. Test generation: `aicr recipe --service eks --accelerator gb200 --format yaml` +2. If the recipe shares OS constraints or platform components with other overlays, reference existing mixins via `spec.mixins` instead of duplicating (or create new mixins in `recipes/mixins/`) +3. Create component values files if using `valuesFile` +4. Run tests: `make test` +5. Test generation: `aicr recipe --service eks --accelerator gb200 --format yaml` **Example:** ```yaml @@ -348,6 +371,7 @@ componentRefs: **Do:** - Use minimum criteria fields needed for matching - Keep base recipe universal and conservative +- Use mixins for shared OS constraints or platform components instead of duplicating across leaf overlays - Always explain why settings exist (1-2 sentences) - Follow naming conventions (`{accel}-{service}-{os}-{intent}-{platform}`) - Run `make test` before committing @@ -357,6 +381,7 @@ componentRefs: - Add environment-specific settings to base - Over-specify criteria (too narrow = fewer matches) - Create duplicate criteria combinations +- Duplicate OS or platform content across leaf overlays (use mixins instead) - Skip validation tests - Forget to update context when values change @@ -406,6 +431,8 @@ Integrators can extend or override embedded recipe data using the `--data` flag ├── registry.yaml # Extends/overrides component registry ├── overlays/ │ └── custom-recipe.yaml # New or override existing recipe +├── mixins/ +│ └── os-custom.yaml # Custom mixin fragments └── components/ └── my-operator/ └── values.yaml # Component values diff --git a/pkg/recipe/builder_test.go b/pkg/recipe/builder_test.go index 9699ad3cd..f1b47a9df 100644 --- a/pkg/recipe/builder_test.go +++ b/pkg/recipe/builder_test.go @@ -278,11 +278,9 @@ func TestGetEmbeddedFS(t *testing.T) { // TestConstraintWarning tests the ConstraintWarning struct. func TestConstraintWarning(t *testing.T) { - const k8sVersionConstraint = "K8s.server.version" - warning := ConstraintWarning{ Overlay: "h100-eks-ubuntu-training-kubeflow", - Constraint: k8sVersionConstraint, + Constraint: testK8sVersionConstant, Expected: ">= 1.32.4", Actual: "1.30.0", Reason: "expected >= 1.32.4, got 1.30.0", @@ -291,8 +289,8 @@ func TestConstraintWarning(t *testing.T) { if warning.Overlay != "h100-eks-ubuntu-training-kubeflow" { t.Errorf("expected overlay h100-eks-ubuntu-training-kubeflow, got %q", warning.Overlay) } - if warning.Constraint != k8sVersionConstraint { - t.Errorf("expected constraint %s, got %q", k8sVersionConstraint, warning.Constraint) + if warning.Constraint != testK8sVersionConstant { + t.Errorf("expected constraint %s, got %q", testK8sVersionConstant, warning.Constraint) } if warning.Expected != ">= 1.32.4" { t.Errorf("expected expression >= 1.32.4, got %q", warning.Expected) diff --git a/pkg/recipe/conformance_test.go b/pkg/recipe/conformance_test.go index 27449eb85..4d3ccdd5e 100644 --- a/pkg/recipe/conformance_test.go +++ b/pkg/recipe/conformance_test.go @@ -261,7 +261,7 @@ func TestConformanceRecipeInvariants(t *testing.T) { if tt.wantDRAConstraint { var hasDRAConstraint bool for _, c := range result.Constraints { - if c.Name == "K8s.server.version" && strings.Contains(c.Value, "1.34") { + if c.Name == testK8sVersionConstant && strings.Contains(c.Value, "1.34") { hasDRAConstraint = true break } diff --git a/pkg/recipe/metadata.go b/pkg/recipe/metadata.go index 56b7a94af..eea461d10 100644 --- a/pkg/recipe/metadata.go +++ b/pkg/recipe/metadata.go @@ -259,6 +259,12 @@ type RecipeMetadataSpec struct { // Only present in overlay files, not in base. Criteria *Criteria `json:"criteria,omitempty" yaml:"criteria,omitempty"` + // Mixins is a list of mixin names to compose into this overlay. + // Mixins are loaded from recipes/mixins/ and carry only constraints + // and componentRefs. This field is loader metadata and is stripped + // from the materialized recipe result. + Mixins []string `json:"mixins,omitempty" yaml:"mixins,omitempty"` + // Constraints are deployment assumptions/requirements. Constraints []Constraint `json:"constraints,omitempty" yaml:"constraints,omitempty"` @@ -270,6 +276,24 @@ type RecipeMetadataSpec struct { Validation *ValidationConfig `json:"validation,omitempty" yaml:"validation,omitempty"` } +// RecipeMixinKind is the kind value for mixin files. +const RecipeMixinKind = "RecipeMixin" + +// RecipeMixin represents a composable fragment that carries only constraints +// and componentRefs. Mixins live in recipes/mixins/ and are referenced by +// overlay spec.mixins fields. +type RecipeMixin struct { + Kind string `json:"kind" yaml:"kind"` + APIVersion string `json:"apiVersion" yaml:"apiVersion"` + Metadata struct { + Name string `json:"name" yaml:"name"` + } `json:"metadata" yaml:"metadata"` + Spec struct { + Constraints []Constraint `json:"constraints,omitempty" yaml:"constraints,omitempty"` + ComponentRefs []ComponentRef `json:"componentRefs,omitempty" yaml:"componentRefs,omitempty"` + } `json:"spec" yaml:"spec"` +} + // RecipeMetadataHeader contains the Kubernetes-style header fields. type RecipeMetadataHeader struct { // Kind is always "RecipeMetadata". @@ -422,6 +446,23 @@ func (s *RecipeMetadataSpec) Merge(other *RecipeMetadataSpec) { } } } + + // Accumulate mixins (deduplicated, preserving order). + // Both leaf and intermediate overlays can declare mixins. When an + // intermediate overlay (e.g., eks-inference) declares a mixin, it is + // accumulated into all descendants during inheritance chain merging. + if len(other.Mixins) > 0 { + seen := make(map[string]bool) + for _, m := range s.Mixins { + seen[m] = true + } + for _, m := range other.Mixins { + if !seen[m] { + s.Mixins = append(s.Mixins, m) + seen[m] = true + } + } + } } // mergeComponentRef merges overlay into base, with overlay taking precedence diff --git a/pkg/recipe/metadata_store.go b/pkg/recipe/metadata_store.go index f8adc0a37..670bc53c1 100644 --- a/pkg/recipe/metadata_store.go +++ b/pkg/recipe/metadata_store.go @@ -15,6 +15,7 @@ package recipe import ( + "bytes" "context" "fmt" "io/fs" @@ -44,6 +45,9 @@ type MetadataStore struct { // Overlays is a list of overlay recipes indexed by name. Overlays map[string]*RecipeMetadata + // Mixins is a map of composable mixin fragments indexed by name. + Mixins map[string]*RecipeMixin + // ValuesFiles contains embedded values file contents indexed by filename. ValuesFiles map[string][]byte } @@ -56,6 +60,7 @@ func loadMetadataStore(_ context.Context) (*MetadataStore, error) { store := &MetadataStore{ Overlays: make(map[string]*RecipeMetadata), + Mixins: make(map[string]*RecipeMixin), ValuesFiles: make(map[string][]byte), } @@ -77,6 +82,34 @@ func loadMetadataStore(_ context.Context) (*MetadataStore, error) { return nil } + // Handle mixin files (files in the mixins/ directory) + if strings.HasPrefix(path, "mixins/") { + if !strings.HasSuffix(filename, ".yaml") { + return nil + } + content, readErr := provider.ReadFile(path) + if readErr != nil { + return aicrerrors.Wrap(aicrerrors.ErrCodeInternal, fmt.Sprintf("failed to read mixin %s", path), readErr) + } + var mixin RecipeMixin + decoder := yaml.NewDecoder(bytes.NewReader(content)) + decoder.KnownFields(true) + if parseErr := decoder.Decode(&mixin); parseErr != nil { + return aicrerrors.Wrap(aicrerrors.ErrCodeInvalidRequest, fmt.Sprintf("failed to parse mixin %s (unknown fields are not allowed)", path), parseErr) + } + if mixin.Kind != RecipeMixinKind { + return aicrerrors.New(aicrerrors.ErrCodeInvalidRequest, + fmt.Sprintf("mixin file %s has wrong kind %q, expected %q", path, mixin.Kind, RecipeMixinKind)) + } + if _, exists := store.Mixins[mixin.Metadata.Name]; exists { + return aicrerrors.New(aicrerrors.ErrCodeInvalidRequest, + fmt.Sprintf("duplicate mixin name %q in %s", mixin.Metadata.Name, path)) + } + store.Mixins[mixin.Metadata.Name] = &mixin + slog.Debug("loaded mixin", "name", mixin.Metadata.Name, "path", path) + return nil + } + // Handle component files (files in the components/ directory) if strings.Contains(path, "components/") { content, readErr := provider.ReadFile(path) @@ -290,6 +323,241 @@ func (s *MetadataStore) filterToMaximalLeaves(matches []*RecipeMetadata) []*Reci return leaves } +// mergeMixins resolves and merges mixin fragments referenced by spec.mixins. +// Mixins are merged after the inheritance chain, contributing only constraints +// and componentRefs. Detects conflicts: duplicate constraint names or component +// names between a mixin and the already-merged spec are rejected. +// The Mixins field is cleared from the result afterward. +// Returns the set of mixin-contributed constraint names for post-compose evaluation. +func (s *MetadataStore) mergeMixins(mergedSpec *RecipeMetadataSpec) (map[string]bool, error) { + mixinConstraintNames := make(map[string]bool) + if len(mergedSpec.Mixins) == 0 { + return mixinConstraintNames, nil + } + + // Build index of existing constraint and component names for conflict detection + existingConstraints := make(map[string]bool) + for _, c := range mergedSpec.Constraints { + existingConstraints[c.Name] = true + } + existingComponents := make(map[string]bool) + for _, c := range mergedSpec.ComponentRefs { + existingComponents[c.Name] = true + } + + for _, mixinName := range mergedSpec.Mixins { + mixin, exists := s.Mixins[mixinName] + if !exists { + return nil, aicrerrors.New(aicrerrors.ErrCodeNotFound, + fmt.Sprintf("mixin %q not found in recipes/mixins/", mixinName)) + } + + // Detect conflicts: mixin constraint/component names vs inheritance chain + // and previously applied mixins (existingConstraints/existingComponents + // are updated after each mixin merge) + for _, c := range mixin.Spec.Constraints { + if existingConstraints[c.Name] { + return nil, aicrerrors.New(aicrerrors.ErrCodeInvalidRequest, + fmt.Sprintf("mixin %q constraint %q conflicts with inheritance chain or another mixin", mixinName, c.Name)) + } + } + for _, c := range mixin.Spec.ComponentRefs { + if existingComponents[c.Name] { + return nil, aicrerrors.New(aicrerrors.ErrCodeInvalidRequest, + fmt.Sprintf("mixin %q component %q conflicts with inheritance chain or another mixin", mixinName, c.Name)) + } + } + + // Merge mixin content + mixinSpec := RecipeMetadataSpec{ + Constraints: mixin.Spec.Constraints, + ComponentRefs: mixin.Spec.ComponentRefs, + } + mergedSpec.Merge(&mixinSpec) + + // Track mixin contributions for future conflict detection + for _, c := range mixin.Spec.Constraints { + existingConstraints[c.Name] = true + mixinConstraintNames[c.Name] = true + } + for _, c := range mixin.Spec.ComponentRefs { + existingComponents[c.Name] = true + } + + slog.Debug("merged mixin", "name", mixinName, + "constraints", len(mixin.Spec.Constraints), + "components", len(mixin.Spec.ComponentRefs)) + } + + // Strip mixins from the materialized result — loader metadata only + mergedSpec.Mixins = nil + return mixinConstraintNames, nil +} + +// mixinEvalResult holds the outcome of post-compose mixin constraint evaluation. +type mixinEvalResult struct { + // Failed is true if any mixin constraint failed evaluation. + Failed bool + // ExcludedOverlays are the overlays excluded due to the failure. + ExcludedOverlays []string + // Warnings are the constraint warnings for the failing constraints. + Warnings []ConstraintWarning + // Spec is the rebuilt spec (without mixin-bearing chains) if failed, or nil if all passed. + Spec *RecipeMetadataSpec + // AppliedOverlays is the surviving applied overlays if failed. + AppliedOverlays []string +} + +// evaluateMixinConstraints evaluates the fully composed constraint set +// (including mixin-contributed constraints) against the snapshot evaluator. +// This runs after mergeMixins so that constraints moved from inline overlay +// definitions to mixins are still validated against the snapshot. +// +// If any mixin constraint fails, only the overlay chains that contributed +// mixins are excluded. Independent overlays (e.g., monitoring-hpa) that do +// not declare mixins are preserved. This maintains the existing maximal-leaf +// filtering behavior for non-mixin overlays. +func (s *MetadataStore) evaluateMixinConstraints( + mergedSpec *RecipeMetadataSpec, + evaluator ConstraintEvaluatorFunc, + mixinConstraintNames map[string]bool, + appliedOverlays []string, +) mixinEvalResult { + + if evaluator == nil || len(mixinConstraintNames) == 0 { + return mixinEvalResult{} + } + + var failedConstraints []ConstraintWarning + for _, constraint := range mergedSpec.Constraints { + if !mixinConstraintNames[constraint.Name] { + continue // already evaluated per-overlay + } + result := evaluator(constraint) + if !result.Passed { + // Identify the overlay that declared the mixin contributing this constraint. + declaringOverlay := s.findMixinDeclaringOverlay(constraint.Name, appliedOverlays) + warning := ConstraintWarning{ + Overlay: declaringOverlay, + Constraint: constraint.Name, + Expected: constraint.Value, + Actual: result.Actual, + Reason: "mixin constraint failed post-compose evaluation", + } + if result.Error != nil { + warning.Reason = result.Error.Error() + } + failedConstraints = append(failedConstraints, warning) + } + } + + if len(failedConstraints) == 0 { + return mixinEvalResult{} + } + + // Identify which applied overlays declared mixins and which are in their + // inheritance chains. The mixin-declaring leaves and their ancestors are + // removed from the rebuilt spec. Only the leaf candidates themselves are + // reported in ExcludedOverlays (ancestors are shared infrastructure, not + // rejected candidates). + mixinDeclarers := make(map[string]bool) // leaf overlays that declared mixins + mixinChainMembers := make(map[string]bool) // declarers + their ancestors (for spec rebuild) + for _, name := range appliedOverlays { + if name == baseRecipeName { + continue + } + overlay, exists := s.Overlays[name] + if !exists { + continue + } + if len(overlay.Spec.Mixins) > 0 { + mixinDeclarers[name] = true + mixinChainMembers[name] = true + s.markAncestors(name, mixinChainMembers) + } + } + + var excluded []string + var surviving []string + surviving = append(surviving, baseRecipeName) + for _, name := range appliedOverlays { + if name == baseRecipeName { + continue + } + if mixinChainMembers[name] { + // Only report leaf candidates in ExcludedOverlays, not ancestors + if mixinDeclarers[name] { + excluded = append(excluded, name) + } + } else { + surviving = append(surviving, name) + } + } + + // Rebuild the spec from surviving overlays only + rebuiltSpec, _ := s.initBaseMergedSpec() + for _, name := range surviving { + if name == baseRecipeName { + continue + } + if overlay, exists := s.Overlays[name]; exists { + rebuiltSpec.Merge(&overlay.Spec) + } + } + + slog.Warn("post-compose constraint evaluation failed, excluding mixin-bearing chains", + "failed_constraints", len(failedConstraints), + "excluded", excluded, + "surviving", surviving) + + return mixinEvalResult{ + Failed: true, + ExcludedOverlays: excluded, + Warnings: failedConstraints, + Spec: &rebuiltSpec, + AppliedOverlays: surviving, + } +} + +// findMixinDeclaringOverlay finds the overlay that declared the mixin +// contributing a given constraint name. +func (s *MetadataStore) findMixinDeclaringOverlay(constraintName string, appliedOverlays []string) string { + // Walk applied overlays in reverse (most specific first) to find + // which overlay declared a mixin that contains this constraint. + for i := len(appliedOverlays) - 1; i >= 0; i-- { + name := appliedOverlays[i] + overlay, exists := s.Overlays[name] + if !exists { + continue + } + for _, mixinName := range overlay.Spec.Mixins { + mixin, mExists := s.Mixins[mixinName] + if !mExists { + continue + } + for _, c := range mixin.Spec.Constraints { + if c.Name == constraintName { + return name + } + } + } + } + return "post-compose" // fallback if mapping not found +} + +// markAncestors walks the inheritance chain of an overlay and marks all ancestors. +func (s *MetadataStore) markAncestors(name string, marked map[string]bool) { + overlay, exists := s.Overlays[name] + if !exists || overlay.Spec.Base == "" { + return + } + ancestor := overlay.Spec.Base + if !marked[ancestor] { + marked[ancestor] = true + s.markAncestors(ancestor, marked) + } +} + // initBaseMergedSpec creates a copy of the base spec for overlay merging. func (s *MetadataStore) initBaseMergedSpec() (RecipeMetadataSpec, []string) { mergedSpec := RecipeMetadataSpec{ @@ -384,6 +652,11 @@ func (s *MetadataStore) BuildRecipeResult(ctx context.Context, criteria *Criteri return nil, err } + // Merge mixin fragments referenced by overlays in the chain + if _, err := s.mergeMixins(&mergedSpec); err != nil { + return nil, err + } + if len(appliedOverlays) <= 1 { slog.Warn("no environment-specific overlays matched, using base configuration only", "criteria", criteria.String(), @@ -452,6 +725,25 @@ func (s *MetadataStore) BuildRecipeResultWithEvaluator(ctx context.Context, crit return nil, err } + // Merge mixin fragments referenced by overlays in the chain. + mixinConstraintNames, err := s.mergeMixins(&mergedSpec) + if err != nil { + return nil, err + } + + // Evaluate mixin-contributed constraints against the snapshot. + // Per-overlay constraints were evaluated before merge (above), but mixin + // constraints are only present after mergeMixins. Without this post-compose + // evaluation, a mixin constraint (e.g., kernel >= 6.8 from os-ubuntu) could + // fail against the snapshot but the candidate would still be selected. + mixinResult := s.evaluateMixinConstraints(&mergedSpec, evaluator, mixinConstraintNames, appliedOverlays) + if mixinResult.Failed { + excludedOverlays = append(excludedOverlays, mixinResult.ExcludedOverlays...) + constraintWarnings = append(constraintWarnings, mixinResult.Warnings...) + mergedSpec = *mixinResult.Spec + appliedOverlays = mixinResult.AppliedOverlays + } + if len(excludedOverlays) > 0 { slog.Warn("some overlays were excluded due to constraint failures", "excluded", excludedOverlays, diff --git a/pkg/recipe/metadata_store_test.go b/pkg/recipe/metadata_store_test.go index 1533e08d6..760dff318 100644 --- a/pkg/recipe/metadata_store_test.go +++ b/pkg/recipe/metadata_store_test.go @@ -15,16 +15,20 @@ package recipe import ( + "bytes" "context" "fmt" "reflect" "testing" + + "gopkg.in/yaml.v3" ) const ( - testRecipeBase = "base" - testOverlayEKS = "eks" - testOverlayEKSTraning = "eks-training" + testRecipeBase = "base" + testOverlayEKS = "eks" + testK8sVersionConstant = "K8s.server.version" + testOverlayEKSTraning = "eks-training" ) func TestMetadataStore_GetValuesFile(t *testing.T) { @@ -676,3 +680,148 @@ func TestEvaluatorFailingLeafExcludesCandidate(t *testing.T) { t.Error("monitoring-hpa should remain applied (independent non-ancestor leaf)") } } + +// TestMixinConstraintFailureExcludesCandidate verifies that when a mixin-contributed +// constraint fails evaluation (e.g., os-ubuntu kernel constraint against a snapshot +// with kernel < 6.8), the composed candidate is excluded and the result falls back +// to base-only output. This tests the post-compose evaluation path in +// evaluateMixinConstraints. +func TestMixinConstraintFailureExcludesCandidate(t *testing.T) { + ctx := context.Background() + store, err := loadMetadataStore(ctx) + if err != nil { + t.Fatalf("failed to load metadata store: %v", err) + } + + // Query that resolves to a leaf using the os-ubuntu mixin + criteria := &Criteria{ + Service: CriteriaServiceEKS, + Accelerator: CriteriaAcceleratorH100, + Intent: CriteriaIntentTraining, + OS: CriteriaOSUbuntu, + } + + // Evaluator that passes K8s constraint but fails OS/kernel constraints + // (simulates a snapshot where OS matches but kernel is too old) + selectiveEvaluator := func(c Constraint) ConstraintEvalResult { + if c.Name == testK8sVersionConstant { + return ConstraintEvalResult{Passed: true, Actual: "v1.35.0"} + } + // Fail OS-related constraints (these come from the os-ubuntu mixin) + if c.Name == "OS.sysctl./proc/sys/kernel/osrelease" { + return ConstraintEvalResult{Passed: false, Actual: "5.15.0"} + } + // Pass everything else + return ConstraintEvalResult{Passed: true, Actual: "ok"} + } + + result, err := store.BuildRecipeResultWithEvaluator(ctx, criteria, selectiveEvaluator) + if err != nil { + t.Fatalf("BuildRecipeResultWithEvaluator failed: %v", err) + } + + // The mixin constraint (kernel >= 6.8) should have failed post-compose, + // causing a fallback to base-only output + if len(result.Metadata.ExcludedOverlays) == 0 { + t.Fatal("expected excluded overlays from mixin constraint failure") + } + + // Applied overlays should be base-only (plus monitoring-hpa which has no + // mixin constraints and passes evaluation independently) + applied := make(map[string]bool) + for _, name := range result.Metadata.AppliedOverlays { + applied[name] = true + } + if !applied[baseRecipeName] { + t.Error("base should always be applied") + } + + // The EKS chain overlays should NOT be in applied (they were part of the + // composed candidate that failed post-compose evaluation) + for _, name := range []string{"h100-eks-ubuntu-training", "h100-eks-training", "eks-training", "eks"} { + if applied[name] { + t.Errorf("%q should not be applied after mixin constraint failure", name) + } + } + + // Constraint warnings should include the failing mixin constraint + foundKernelWarning := false + for _, w := range result.Metadata.ConstraintWarnings { + if w.Constraint == "OS.sysctl./proc/sys/kernel/osrelease" { + foundKernelWarning = true + } + } + if !foundKernelWarning { + t.Error("expected constraint warning for OS.sysctl./proc/sys/kernel/osrelease from mixin") + } + + t.Logf("excluded: %v", result.Metadata.ExcludedOverlays) + t.Logf("applied: %v", result.Metadata.AppliedOverlays) + t.Logf("warnings: %d", len(result.Metadata.ConstraintWarnings)) +} + +// TestMalformedMixinRejected verifies that mixin files with forbidden fields +// (base, criteria, mixins, validation) are rejected at load time by +// KnownFields(true) strict parsing. +func TestMalformedMixinRejected(t *testing.T) { + tests := []struct { + name string + content string + }{ + { + name: "mixin with forbidden base field", + content: `kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: bad-mixin +spec: + base: eks + constraints: + - name: test + value: "1.0" +`, + }, + { + name: "mixin with forbidden criteria field", + content: `kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: bad-mixin +spec: + criteria: + service: eks + constraints: + - name: test + value: "1.0" +`, + }, + { + name: "mixin with forbidden validation field", + content: `kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: bad-mixin +spec: + validation: + deployment: + checks: + - operator-health + constraints: + - name: test + value: "1.0" +`, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + var mixin RecipeMixin + decoder := yaml.NewDecoder(bytes.NewReader([]byte(tt.content))) + decoder.KnownFields(true) + err := decoder.Decode(&mixin) + if err == nil { + t.Error("expected error for mixin with forbidden fields, got nil") + } + }) + } +} diff --git a/pkg/recipe/metadata_test.go b/pkg/recipe/metadata_test.go index 2da4b21b0..e590eeb68 100644 --- a/pkg/recipe/metadata_test.go +++ b/pkg/recipe/metadata_test.go @@ -532,7 +532,7 @@ func TestMergeValidationConfig(t *testing.T) { base := RecipeMetadataSpec{ Validation: &ValidationConfig{ Readiness: &ValidationPhase{ - Constraints: []Constraint{{Name: "K8s.server.version", Value: ">= 1.30"}}, + Constraints: []Constraint{{Name: testK8sVersionConstant, Value: ">= 1.30"}}, }, Deployment: &ValidationPhase{ Timeout: "5m", @@ -560,7 +560,7 @@ func TestMergeValidationConfig(t *testing.T) { if base.Validation.Readiness == nil { t.Fatal("readiness should be preserved from base") } - if base.Validation.Readiness.Constraints[0].Name != "K8s.server.version" { + if base.Validation.Readiness.Constraints[0].Name != testK8sVersionConstant { t.Error("readiness constraints should be preserved from base") } if base.Validation.Deployment.Timeout != "10m" { diff --git a/recipes/README.md b/recipes/README.md index c77333b0d..811696216 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -24,6 +24,10 @@ recipes/ │ ├── gb200-eks-ubuntu-training.yaml # Full criteria leaf recipe │ ├── h100-eks-ubuntu-training-kubeflow.yaml # H100 + EKS + Ubuntu + training + Kubeflow │ └── h100-ubuntu-inference.yaml # H100 inference overlay +├── mixins/ # Composable mixin fragments +│ ├── os-ubuntu.yaml # Ubuntu OS constraints (shared by 12 leaf overlays) +│ ├── platform-inference.yaml # Inference gateway components (shared by 5 service-inference overlays) +│ └── platform-kubeflow.yaml # Kubeflow trainer component (shared by 4 leaf overlays) └── components/ # Component value configurations ├── cert-manager/ ├── nvidia-dra-driver-gpu/ @@ -42,10 +46,11 @@ Examples: ## Overview -The recipe system uses a **base-plus-overlay architecture**: +The recipe system uses a **base-plus-overlay architecture** with **mixin composition**: - **Base values** (`overlays/base.yaml`) provide default configurations - **Overlay values** (e.g., `eks-gb200-training.yaml`) provide environment-specific optimizations +- **Mixins** (`mixins/*.yaml`) provide shared fragments (OS constraints, platform components) that leaf overlays compose via `spec.mixins` instead of duplicating content - **Inline overrides** allow per-recipe customization without creating new files All files in this directory are embedded into the CLI binary and API server at compile time. diff --git a/recipes/data.go b/recipes/data.go index 62691b9c5..8456663d2 100644 --- a/recipes/data.go +++ b/recipes/data.go @@ -18,5 +18,5 @@ import ( "embed" ) -//go:embed overlays/*.yaml registry.yaml validators/catalog.yaml components/*/*.yaml components/*/manifests/*.yaml checks/*/*.yaml +//go:embed overlays/*.yaml mixins/*.yaml registry.yaml validators/catalog.yaml components/*/*.yaml components/*/manifests/*.yaml checks/*/*.yaml var FS embed.FS diff --git a/recipes/mixins/os-ubuntu.yaml b/recipes/mixins/os-ubuntu.yaml new file mode 100644 index 000000000..207002cd6 --- /dev/null +++ b/recipes/mixins/os-ubuntu.yaml @@ -0,0 +1,26 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: os-ubuntu +spec: + constraints: + - name: OS.release.ID + value: ubuntu + - name: OS.release.VERSION_ID + value: "24.04" + - name: OS.sysctl./proc/sys/kernel/osrelease + value: ">= 6.8" diff --git a/recipes/mixins/platform-inference.yaml b/recipes/mixins/platform-inference.yaml new file mode 100644 index 000000000..5f66ee0ca --- /dev/null +++ b/recipes/mixins/platform-inference.yaml @@ -0,0 +1,39 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: platform-inference +spec: + componentRefs: + - name: kgateway-crds + type: Helm + source: oci://cr.kgateway.dev/kgateway-dev/charts + version: v2.0.0 + valuesFile: components/kgateway-crds/values.yaml + manifestFiles: + - components/kgateway-crds/manifests/gateway-api-crds.yaml + - components/kgateway-crds/manifests/inference-extension-crds.yaml + + - name: kgateway + type: Helm + source: oci://cr.kgateway.dev/kgateway-dev/charts + version: v2.0.0 + valuesFile: components/kgateway/values.yaml + manifestFiles: + - components/kgateway/manifests/inference-gateway.yaml + dependencyRefs: + - kgateway-crds + - cert-manager diff --git a/recipes/mixins/platform-kubeflow.yaml b/recipes/mixins/platform-kubeflow.yaml new file mode 100644 index 000000000..4ca877aac --- /dev/null +++ b/recipes/mixins/platform-kubeflow.yaml @@ -0,0 +1,29 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: RecipeMixin +apiVersion: aicr.nvidia.com/v1alpha1 +metadata: + name: platform-kubeflow +spec: + componentRefs: + - name: kubeflow-trainer + type: Helm + valuesFile: components/kubeflow-trainer/values.yaml + manifestFiles: + - components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml + dependencyRefs: + - cert-manager + - kube-prometheus-stack + - gpu-operator diff --git a/recipes/overlays/aks-inference.yaml b/recipes/overlays/aks-inference.yaml index 87839fea7..1274e0c43 100644 --- a/recipes/overlays/aks-inference.yaml +++ b/recipes/overlays/aks-inference.yaml @@ -25,29 +25,14 @@ spec: service: aks intent: inference + # Inference gateway components via mixin + mixins: + - platform-inference + # Inference specific constraints for AKS workloads # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} constraints: - name: K8s.server.version value: ">= 1.30" - componentRefs: - - name: kgateway-crds - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway-crds/values.yaml - manifestFiles: - - components/kgateway-crds/manifests/gateway-api-crds.yaml - - components/kgateway-crds/manifests/inference-extension-crds.yaml - - - name: kgateway - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway/values.yaml - manifestFiles: - - components/kgateway/manifests/inference-gateway.yaml - dependencyRefs: - - kgateway-crds - - cert-manager + componentRefs: [] diff --git a/recipes/overlays/eks-inference.yaml b/recipes/overlays/eks-inference.yaml index fe47fb06f..bf18b7052 100644 --- a/recipes/overlays/eks-inference.yaml +++ b/recipes/overlays/eks-inference.yaml @@ -25,29 +25,14 @@ spec: service: eks intent: inference + # Inference gateway components via mixin + mixins: + - platform-inference + # Inference specific constraints for EKS workloads # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} constraints: - name: K8s.server.version value: ">= 1.30" - componentRefs: - - name: kgateway-crds - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway-crds/values.yaml - manifestFiles: - - components/kgateway-crds/manifests/gateway-api-crds.yaml - - components/kgateway-crds/manifests/inference-extension-crds.yaml - - - name: kgateway - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway/values.yaml - manifestFiles: - - components/kgateway/manifests/inference-gateway.yaml - dependencyRefs: - - kgateway-crds - - cert-manager + componentRefs: [] diff --git a/recipes/overlays/gb200-eks-ubuntu-inference.yaml b/recipes/overlays/gb200-eks-ubuntu-inference.yaml index 2a174fa79..890c15304 100644 --- a/recipes/overlays/gb200-eks-ubuntu-inference.yaml +++ b/recipes/overlays/gb200-eks-ubuntu-inference.yaml @@ -28,16 +28,13 @@ spec: os: ubuntu intent: inference - # GB200 + Ubuntu specific constraints for inference workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints via mixin (defined once in recipes/mixins/os-ubuntu.yaml) + mixins: + - os-ubuntu + + # GB200 + EKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/gb200-eks-ubuntu-training-kubeflow.yaml b/recipes/overlays/gb200-eks-ubuntu-training-kubeflow.yaml index 8006fcdc8..1185de63b 100644 --- a/recipes/overlays/gb200-eks-ubuntu-training-kubeflow.yaml +++ b/recipes/overlays/gb200-eks-ubuntu-training-kubeflow.yaml @@ -29,26 +29,14 @@ spec: intent: training platform: kubeflow - # Constraints for GB200 on EKS with Ubuntu for Kubeflow training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints + Kubeflow platform via mixins + mixins: + - os-ubuntu + - platform-kubeflow + + # GB200 + EKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" - # Kubeflow Training Operator for TrainJob support - componentRefs: - - name: kubeflow-trainer - type: Helm - valuesFile: components/kubeflow-trainer/values.yaml - manifestFiles: - - components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml - dependencyRefs: - - cert-manager - - kube-prometheus-stack - - gpu-operator + componentRefs: [] diff --git a/recipes/overlays/gb200-eks-ubuntu-training.yaml b/recipes/overlays/gb200-eks-ubuntu-training.yaml index 17888c54e..4491a35a3 100644 --- a/recipes/overlays/gb200-eks-ubuntu-training.yaml +++ b/recipes/overlays/gb200-eks-ubuntu-training.yaml @@ -28,17 +28,14 @@ spec: os: ubuntu intent: training - # Constraints for GB200 on EKS with Ubuntu for training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints via mixin (defined once in recipes/mixins/os-ubuntu.yaml) + mixins: + - os-ubuntu + + # GB200 + EKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/gb200-oke-ubuntu-inference.yaml b/recipes/overlays/gb200-oke-ubuntu-inference.yaml index 5a9db0949..e8ccc64ec 100644 --- a/recipes/overlays/gb200-oke-ubuntu-inference.yaml +++ b/recipes/overlays/gb200-oke-ubuntu-inference.yaml @@ -28,16 +28,13 @@ spec: os: ubuntu intent: inference + # Ubuntu OS constraints via mixin + mixins: + - os-ubuntu + # GB200 + Ubuntu specific constraints for inference workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/gb200-oke-ubuntu-training-kubeflow.yaml b/recipes/overlays/gb200-oke-ubuntu-training-kubeflow.yaml index d33c4ada8..12d56bd11 100644 --- a/recipes/overlays/gb200-oke-ubuntu-training-kubeflow.yaml +++ b/recipes/overlays/gb200-oke-ubuntu-training-kubeflow.yaml @@ -29,26 +29,14 @@ spec: intent: training platform: kubeflow + # Ubuntu OS constraints + Kubeflow trainer via mixins + mixins: + - os-ubuntu + - platform-kubeflow + # Constraints for GB200 on OKE with Ubuntu for Kubeflow training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" - # Kubeflow Training Operator for TrainJob support - componentRefs: - - name: kubeflow-trainer - type: Helm - valuesFile: components/kubeflow-trainer/values.yaml - manifestFiles: - - components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml - dependencyRefs: - - cert-manager - - kube-prometheus-stack - - gpu-operator + componentRefs: [] diff --git a/recipes/overlays/gb200-oke-ubuntu-training.yaml b/recipes/overlays/gb200-oke-ubuntu-training.yaml index be4038ff3..e3bc92665 100644 --- a/recipes/overlays/gb200-oke-ubuntu-training.yaml +++ b/recipes/overlays/gb200-oke-ubuntu-training.yaml @@ -28,17 +28,14 @@ spec: os: ubuntu intent: training + # Ubuntu OS constraints via mixin + mixins: + - os-ubuntu + # Constraints for GB200 on OKE with Ubuntu for training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/gke-cos-inference.yaml b/recipes/overlays/gke-cos-inference.yaml index 56cb0b68f..744e9b00c 100644 --- a/recipes/overlays/gke-cos-inference.yaml +++ b/recipes/overlays/gke-cos-inference.yaml @@ -26,28 +26,13 @@ spec: os: cos intent: inference + # Inference gateway components via mixin + mixins: + - platform-inference + # Inference specific constraints for GKE workloads constraints: - name: K8s.server.version value: ">= 1.30" - componentRefs: - - name: kgateway-crds - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway-crds/values.yaml - manifestFiles: - - components/kgateway-crds/manifests/gateway-api-crds.yaml - - components/kgateway-crds/manifests/inference-extension-crds.yaml - - - name: kgateway - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway/values.yaml - manifestFiles: - - components/kgateway/manifests/inference-gateway.yaml - dependencyRefs: - - kgateway-crds - - cert-manager + componentRefs: [] diff --git a/recipes/overlays/h100-aks-ubuntu-inference.yaml b/recipes/overlays/h100-aks-ubuntu-inference.yaml index 27b3b9407..5ddc91a03 100644 --- a/recipes/overlays/h100-aks-ubuntu-inference.yaml +++ b/recipes/overlays/h100-aks-ubuntu-inference.yaml @@ -28,16 +28,13 @@ spec: os: ubuntu intent: inference - # H100 + Ubuntu specific constraints for inference workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints via mixin (defined once in recipes/mixins/os-ubuntu.yaml) + mixins: + - os-ubuntu + + # H100 + AKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/h100-aks-ubuntu-training-kubeflow.yaml b/recipes/overlays/h100-aks-ubuntu-training-kubeflow.yaml index 22ab282a2..b98e2833c 100644 --- a/recipes/overlays/h100-aks-ubuntu-training-kubeflow.yaml +++ b/recipes/overlays/h100-aks-ubuntu-training-kubeflow.yaml @@ -29,26 +29,14 @@ spec: intent: training platform: kubeflow - # Constraints for H100 on AKS with Ubuntu for Kubeflow training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints + Kubeflow platform via mixins + mixins: + - os-ubuntu + - platform-kubeflow + + # H100 + AKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" - # Kubeflow Training Operator for TrainJob support - componentRefs: - - name: kubeflow-trainer - type: Helm - valuesFile: components/kubeflow-trainer/values.yaml - manifestFiles: - - components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml - dependencyRefs: - - cert-manager - - kube-prometheus-stack - - gpu-operator + componentRefs: [] diff --git a/recipes/overlays/h100-aks-ubuntu-training.yaml b/recipes/overlays/h100-aks-ubuntu-training.yaml index 7e5952669..820e6e30a 100644 --- a/recipes/overlays/h100-aks-ubuntu-training.yaml +++ b/recipes/overlays/h100-aks-ubuntu-training.yaml @@ -28,17 +28,14 @@ spec: os: ubuntu intent: training - # Constraints for H100 on AKS with Ubuntu for training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints via mixin (defined once in recipes/mixins/os-ubuntu.yaml) + mixins: + - os-ubuntu + + # H100 + AKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/h100-eks-ubuntu-inference.yaml b/recipes/overlays/h100-eks-ubuntu-inference.yaml index f453bac9e..818302851 100644 --- a/recipes/overlays/h100-eks-ubuntu-inference.yaml +++ b/recipes/overlays/h100-eks-ubuntu-inference.yaml @@ -28,16 +28,13 @@ spec: os: ubuntu intent: inference - # H100 + Ubuntu specific constraints for inference workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints via mixin (defined once in recipes/mixins/os-ubuntu.yaml) + mixins: + - os-ubuntu + + # H100 + EKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/h100-eks-ubuntu-training-kubeflow.yaml b/recipes/overlays/h100-eks-ubuntu-training-kubeflow.yaml index 15265fcff..ade70040a 100644 --- a/recipes/overlays/h100-eks-ubuntu-training-kubeflow.yaml +++ b/recipes/overlays/h100-eks-ubuntu-training-kubeflow.yaml @@ -29,26 +29,14 @@ spec: intent: training platform: kubeflow - # Constraints for H100 on EKS with Ubuntu for Kubeflow training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints + Kubeflow platform via mixins + mixins: + - os-ubuntu + - platform-kubeflow + + # H100 + EKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" - # Kubeflow Training Operator for TrainJob support - componentRefs: - - name: kubeflow-trainer - type: Helm - valuesFile: components/kubeflow-trainer/values.yaml - manifestFiles: - - components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml - dependencyRefs: - - cert-manager - - kube-prometheus-stack - - gpu-operator + componentRefs: [] diff --git a/recipes/overlays/h100-eks-ubuntu-training.yaml b/recipes/overlays/h100-eks-ubuntu-training.yaml index 06491b54a..45c3a5bd2 100644 --- a/recipes/overlays/h100-eks-ubuntu-training.yaml +++ b/recipes/overlays/h100-eks-ubuntu-training.yaml @@ -28,17 +28,14 @@ spec: os: ubuntu intent: training - # Constraints for H100 on EKS with Ubuntu for training workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} + # Ubuntu OS constraints via mixin (defined once in recipes/mixins/os-ubuntu.yaml) + mixins: + - os-ubuntu + + # H100 + EKS specific constraints (not covered by mixin) constraints: - name: K8s.server.version value: ">= 1.32.4" - - name: OS.release.ID - value: ubuntu - - name: OS.release.VERSION_ID - value: "24.04" - - name: OS.sysctl./proc/sys/kernel/osrelease - value: ">= 6.8" componentRefs: [] diff --git a/recipes/overlays/kind-inference.yaml b/recipes/overlays/kind-inference.yaml index 66e10f050..58e76e090 100644 --- a/recipes/overlays/kind-inference.yaml +++ b/recipes/overlays/kind-inference.yaml @@ -25,29 +25,14 @@ spec: service: kind intent: inference + # Inference gateway components via mixin + mixins: + - platform-inference + # Inference specific constraints for kind workloads # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} constraints: - name: K8s.server.version value: ">= 1.30" - componentRefs: - - name: kgateway-crds - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway-crds/values.yaml - manifestFiles: - - components/kgateway-crds/manifests/gateway-api-crds.yaml - - components/kgateway-crds/manifests/inference-extension-crds.yaml - - - name: kgateway - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway/values.yaml - manifestFiles: - - components/kgateway/manifests/inference-gateway.yaml - dependencyRefs: - - kgateway-crds - - cert-manager + componentRefs: [] diff --git a/recipes/overlays/oke-inference.yaml b/recipes/overlays/oke-inference.yaml index f214053a4..6ddcc13f1 100644 --- a/recipes/overlays/oke-inference.yaml +++ b/recipes/overlays/oke-inference.yaml @@ -25,29 +25,13 @@ spec: service: oke intent: inference + # Inference gateway components via mixin + mixins: + - platform-inference + # Inference specific constraints for OKE workloads - # Constraint names use fully qualified measurement paths: {type}.{subtype}.{key} constraints: - name: K8s.server.version value: ">= 1.30" - componentRefs: - - name: kgateway-crds - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway-crds/values.yaml - manifestFiles: - - components/kgateway-crds/manifests/gateway-api-crds.yaml - - components/kgateway-crds/manifests/inference-extension-crds.yaml - - - name: kgateway - type: Helm - source: oci://cr.kgateway.dev/kgateway-dev/charts - version: v2.0.0 - valuesFile: components/kgateway/values.yaml - manifestFiles: - - components/kgateway/manifests/inference-gateway.yaml - dependencyRefs: - - kgateway-crds - - cert-manager + componentRefs: [] diff --git a/site/docs/getting-started/index.md b/site/docs/getting-started/index.md index 9ea02983c..8fdf5c491 100644 --- a/site/docs/getting-started/index.md +++ b/site/docs/getting-started/index.md @@ -18,6 +18,7 @@ NVIDIA AI Cluster Runtime (AICR) is a suite of tooling designed to automate the | **Recipe** | A generated configuration recommendation containing component references, constraints, and deployment order. Created by `aicr recipe` based on criteria or snapshot analysis. | | **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke), `accelerator` (h100/gb200/a100/l40), `intent` (training/inference), `os` (ubuntu/rhel/cos), `platform` (kubeflow), and `nodes`. | | **Overlay** | A recipe metadata file that extends the base recipe for specific environments. Overlays are matched against criteria using asymmetric matching. | +| **Mixin** | A composable recipe fragment (`kind: RecipeMixin`) that carries shared constraints or components (e.g., OS requirements, platform components). Leaf overlays reference mixins via `spec.mixins` to avoid duplicating cross-cutting content. | | **Bundle** | Deployment artifacts generated from a recipe: Helm values files, Kubernetes manifests, installation scripts, and checksums. | | **Bundler** | A plugin that generates bundle artifacts for a specific component (e.g., GPU Operator bundler, Network Operator bundler). | | **Deployer** | A plugin that transforms bundle artifacts into deployment-specific formats: `helm` (Helm per-component bundles, default), `argocd` (Applications with sync-waves). | diff --git a/tools/sync-site-docs b/tools/sync-site-docs index f53a0bc6d..8ebefa233 100755 --- a/tools/sync-site-docs +++ b/tools/sync-site-docs @@ -34,10 +34,10 @@ sync_section "${DOCS}/user" "${SITE_DOCS}/user" sync_section "${DOCS}/integrator" "${SITE_DOCS}/integrator" sync_section "${DOCS}/contributor" "${SITE_DOCS}/contributor" -# Conformance: flatten cncf/ prefix (docs/conformance/cncf/ → site/docs/conformance/) -mkdir -p "${SITE_DOCS}/conformance" -rm -rf "${SITE_DOCS}/conformance/evidence" "${SITE_DOCS}/conformance/index.md" -cp -R "${DOCS}/conformance/cncf/evidence" "${SITE_DOCS}/conformance/evidence" -cp "${DOCS}/conformance/cncf/index.md" "${SITE_DOCS}/conformance/index.md" +# Conformance: flatten cncf/ prefix (docs/conformance/cncf/ → site/docs/conformance/). +# Copies the entire cncf/ subtree so new subdirectories (e.g., v1.35/) are picked up +# automatically without maintaining an explicit file list. +rm -rf "${SITE_DOCS}/conformance" +cp -R "${DOCS}/conformance/cncf" "${SITE_DOCS}/conformance" echo "OK: synced docs/ → site/docs/"