Skip to content
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
bd4c9cc
Added new feature to execute workflows without a kubernetes cluster, …
maufrancom Apr 2, 2026
cfbba1d
Add local.py and update dependencies in BUILD file
maufrancom Apr 3, 2026
7f16d4f
Add GPU passthrough support in LocalExecutor
maufrancom Apr 3, 2026
1c71133
Update Docker command construction in LocalExecutor
maufrancom Apr 3, 2026
77fa2ab
Add resume functionality to local workflow execution
maufrancom Apr 3, 2026
c44dbfb
Update .gitignore to include .venv directory
maufrancom Apr 3, 2026
ffcdd72
Enhance local workflow execution with Docker command support
maufrancom Apr 3, 2026
0bf8bd5
Enhance documentation and comments in local execution modules
maufrancom Apr 4, 2026
a79bca8
Refactor file handling in LocalExecutor for UTF-8 encoding
maufrancom Apr 4, 2026
27424cd
Enhance error handling and update documentation in local execution mo…
maufrancom Apr 4, 2026
313466b
Update copyright line in test_local_executor.py to comply with pylint…
maufrancom Apr 4, 2026
9459fec
Add shared memory size support for GPU tasks in local execution
maufrancom Apr 4, 2026
e5adf29
Add tutorial specs filegroup and enhance local executor tests
maufrancom Apr 4, 2026
429aa84
Implement file path validation in LocalExecutor to prevent directory …
maufrancom Apr 4, 2026
d08bf9b
Clear GPU device specification in Docker arguments for LocalExecutor
maufrancom Apr 4, 2026
c59a65d
Refactor shared memory size handling in LocalExecutor
maufrancom Apr 4, 2026
b50140f
Add spec_includes library and integrate into local_executor
maufrancom Apr 4, 2026
72c1990
Merge branch 'NVIDIA:main' into mfranco/spec-composition
maufrancom Apr 6, 2026
688ecc6
Enhance local execution capabilities and documentation
maufrancom Apr 6, 2026
ba9b994
Refactor default-values handling in local execution
maufrancom Apr 6, 2026
34936e2
Add compose command to CLI for workflow spec resolution
maufrancom Apr 6, 2026
37e5aa4
Update project configuration and enhance .gitignore
maufrancom Apr 6, 2026
1f37243
Refactor and enhance code documentation in local execution
maufrancom Apr 6, 2026
16f410a
Enhance LocalExecutor to support Docker Compose for workflow execution
maufrancom Apr 6, 2026
524f7cb
Enhance LocalExecutor with additional volume support and command vali…
maufrancom Apr 7, 2026
17d73e0
Refactor task dependency handling and enhance test coverage
maufrancom Apr 7, 2026
e707b27
Enhance workflow specification handling with environment variable res…
maufrancom Apr 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,5 @@ docs/**/domain_config.js
.ruff_cache

.lycheecache

.venv/
3 changes: 3 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,8 @@ Entry point: `service/core/service.py`. Framework: FastAPI + Uvicorn + OpenTelem
| `utils/job/` | `Task`, `FrontendJob`, `K8sObjectFactory`, `PodGroupTopologyBuilder` | Workflow execution framework. Task → K8s spec generation. Gang scheduling via PodGroup. Topology constraints. Backend job definitions. |
| `utils/connectors/` | `ClusterConnector`, `PostgresConnector`, `RedisConnector` | K8s API wrapper, PostgreSQL operations, Redis job queue management. |
| `utils/secret_manager/` | `SecretManager` | JWE-based secret encryption/decryption. MEK/UEK key management. |
| `utils/local_executor.py` | `LocalExecutor`, `run_workflow_locally` | Local Docker-based workflow execution. Runs workflow specs without Kubernetes by mapping tasks to `docker run` commands with volume mounts for data flow. Supports DAG scheduling, resume (`--from-step`), and GPU passthrough. |
| `utils/spec_includes.py` | `resolve_includes` | Helpers to resolve and merge workflow spec `includes` directives into fully composed specs. Supports recursive inclusion, cycle detection, deep-merging, and `default-values` variable expansion. |
| `utils/progress_check/` | — | Liveness/progress tracking for long-running services. |
| `utils/metrics/` | — | Prometheus metrics collection and export. |

Expand All @@ -139,6 +141,7 @@ Entry point: `cli.py` → `main_parser.py` (argparse). Subcommand modules:
| `login.py` | Authentication |
| `pool.py`, `resources.py`, `user.py`, `credential.py`, `access_token.py`, `bucket.py`, `task.py`, `version.py` | Supporting commands |
| `backend.py` | Backend cluster management |
| `local.py` | Local workflow execution via Docker (`osmo local run`) |

Features: Tab completion (shtab), response formatting (`formatters.py`), spec editor (`editor.py`), PyInstaller packaging (`cli_builder.py`, `packaging/`).

Expand Down
5 changes: 5 additions & 0 deletions cookbook/tutorials/BUILD
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
filegroup(
name = "tutorial_specs",
srcs = glob(["*.yaml"]),
visibility = ["//src/utils/tests:__pkg__"],
)
2 changes: 2 additions & 0 deletions src/cli/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ osmo_py_library(
"dataset.py",
"editor.py",
"formatters.py",
"local.py",
"login.py",
"main_parser.py",
"pool.py",
Expand Down Expand Up @@ -73,6 +74,7 @@ osmo_py_library(
"//src/lib/utils:validation",
"//src/lib/utils:version",
"//src/lib/utils:workflow",
"//src/utils:local_executor",
],
)

Expand Down
99 changes: 99 additions & 0 deletions src/cli/local.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# pylint: disable=line-too-long
"""
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

SPDX-License-Identifier: Apache-2.0
"""

import argparse
import sys

import shtab

from src.utils import local_executor


def setup_parser(parser: argparse._SubParsersAction):
"""Register the 'local' subcommand and its nested 'run' action with the CLI argument parser."""
local_parser = parser.add_parser(
'local',
help='Run workflows locally using Docker (no Kubernetes cluster required).')
subparsers = local_parser.add_subparsers(dest='command')
subparsers.required = True

run_parser = subparsers.add_parser(
'run',
help='Execute a workflow spec locally using Docker containers.')
run_parser.add_argument(
'-f', '--file',
required=True,
dest='workflow_file',
help='Path to the workflow YAML spec file.').complete = shtab.FILE
run_parser.add_argument(
'--work-dir',
dest='work_dir',
default=None,
help='Directory for task inputs/outputs. Defaults to a temporary directory.')
run_parser.add_argument(
'--keep',
action='store_true',
default=False,
help='Keep the work directory after execution (always kept on failure).')
run_parser.add_argument(
'--docker',
dest='docker_cmd',
default='docker',
help='Docker-compatible command to use (e.g. podman). Default: docker.')
run_parser.add_argument(
'--resume',
action='store_true',
default=False,
help='Resume a previous run, skipping tasks that already completed successfully. '
'Requires --work-dir pointing to the previous run directory.')
run_parser.add_argument(
'--from-step',
dest='from_step',
default=None,
help='Resume from a specific task, re-running it and all downstream tasks. '
'Tasks upstream of the specified step are skipped if they completed '
'successfully. Requires --work-dir pointing to the previous run directory.')
run_parser.add_argument(
'--shm-size',
dest='shm_size',
default=None,
help='Shared memory size for GPU containers (e.g. 16g, 32g). '
'Defaults to 16g for tasks that request GPUs. '
'PyTorch DataLoader workers require large shared memory.')
run_parser.set_defaults(func=_run_local)


def _run_local(service_client, args: argparse.Namespace):
"""Execute a workflow locally via Docker using the parsed CLI arguments."""
try:
success = local_executor.run_workflow_locally(
spec_path=args.workflow_file,
work_dir=args.work_dir,
keep_work_dir=args.keep,
resume=args.resume,
from_step=args.from_step,
docker_cmd=args.docker_cmd,
shm_size=args.shm_size,
)
except (ValueError, FileNotFoundError, PermissionError) as error:
print(f'Error: {error}', file=sys.stderr)
sys.exit(1)

if not success:
sys.exit(1)
4 changes: 3 additions & 1 deletion src/cli/main_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
credential,
data,
dataset,
local,
login,
pool,
profile,
Expand Down Expand Up @@ -55,7 +56,8 @@
profile.setup_parser,
pool.setup_parser,
user.setup_parser,
config.setup_parser
config.setup_parser,
local.setup_parser,
)


Expand Down
21 changes: 21 additions & 0 deletions src/utils/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,24 @@ osmo_py_library(
],
visibility = ["//visibility:public"],
)

osmo_py_library(
name = "spec_includes",
srcs = ["spec_includes.py"],
deps = [
requirement("pyyaml"),
"//src/lib/utils:osmo_errors",
],
visibility = ["//visibility:public"],
)

osmo_py_library(
name = "local_executor",
srcs = ["local_executor.py"],
deps = [
requirement("pyyaml"),
"//src/utils:spec_includes",
"//src/utils/job",
],
visibility = ["//visibility:public"],
)
Loading