fix(onboard): auto-cleanup orphaned gateway container on re-onboard#1615
fix(onboard): auto-cleanup orphaned gateway container on re-onboard#1615yanyunl1991 wants to merge 3 commits intomainfrom
Conversation
…1582) When Ctrl+C interrupts gateway start, the Docker container openshell-cluster-nemoclaw keeps running but OpenShell no longer tracks it. Re-onboard then fails with "Port 8080 is not available". Add cleanupOrphanedGatewayContainer() that detects and removes the orphaned container. Called in two places: - preflight port check: auto-cleanup before failing on port conflict - destroyGateway(): ensure containers are removed alongside volumes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughDetects and removes an orphaned Docker gateway container during preflight and gateway teardown. On port conflict, the preflight now attempts best-effort container cleanup and retries port availability before failing. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Onboard CLI
participant Preflight as preflight()
participant Docker as Docker Engine
participant PortCheck as checkPortAvailable()
CLI->>Preflight: start preflight
Preflight->>PortCheck: check port
alt port blocked
Preflight->>Docker: query container openshell-cluster-${GATEWAY_NAME}
Docker-->>Preflight: container found
Preflight->>Docker: stop container (best-effort)
Preflight->>Docker: remove container (best-effort)
Preflight->>PortCheck: re-check port
alt port freed
PortCheck-->>Preflight: available
Preflight-->>CLI: continue
else still blocked
PortCheck-->>Preflight: still blocked
Preflight-->>CLI: fail preflight
end
else port available
PortCheck-->>Preflight: available
Preflight-->>CLI: continue
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
bin/lib/onboard.js (2)
1565-1566: Prefer a singledocker rm -ffor cleanup.This removes the container in one call and avoids a split stop/remove path.
♻️ Proposed simplification
- run(`docker stop ${shellQuote(containerName)} 2>/dev/null || true`, { ignoreError: true }); - run(`docker rm ${shellQuote(containerName)} 2>/dev/null || true`, { ignoreError: true }); + run(`docker rm -f ${shellQuote(containerName)} 2>/dev/null || true`, { ignoreError: true });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1565 - 1566, Replace the two-step stop+rm cleanup with a single forced remove call: update the invocation that uses run and shellQuote with containerName so it calls docker rm -f on the container (still preserving the 2>/dev/null || true redirection and the { ignoreError: true } option). Locate the two lines using run(`docker stop ${shellQuote(containerName)} ...`) and run(`docker rm ${shellQuote(containerName)} ...`) and consolidate them into one run(...) that performs docker rm -f ${shellQuote(containerName)} with the same error suppression and options.
1733-1741: Scope orphan-container cleanup retry to port 8080 only.The orphaned gateway container fix is specific to gateway-port conflicts; applying it to other ports (like 18789) is unnecessary and can trigger unrelated Docker operations.
🎯 Proposed scope tightening
- if (cleanupOrphanedGatewayContainer()) { + if (port === 8080 && cleanupOrphanedGatewayContainer()) { portCheck = await checkPortAvailable(port); if (portCheck.ok) { console.log(` ✓ Port ${port} available after orphaned container cleanup (${label})`); continue; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1733 - 1741, The orphaned gateway container cleanup logic currently runs for every port; restrict it to only run when handling the gateway port (8080) to avoid unnecessary Docker operations on other ports. In the block that calls cleanupOrphanedGatewayContainer() and then re-checks with checkPortAvailable(port), guard that entire sequence with a check like if (port === 8080) (or compare against your GATEWAY_PORT constant) so cleanupOrphanedGatewayContainer() and the subsequent port re-check only execute for port 8080; keep the existing functions cleanupOrphanedGatewayContainer(), checkPortAvailable(), and the console log (which references port and label) unchanged otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1565-1566: Replace the two-step stop+rm cleanup with a single
forced remove call: update the invocation that uses run and shellQuote with
containerName so it calls docker rm -f on the container (still preserving the
2>/dev/null || true redirection and the { ignoreError: true } option). Locate
the two lines using run(`docker stop ${shellQuote(containerName)} ...`) and
run(`docker rm ${shellQuote(containerName)} ...`) and consolidate them into one
run(...) that performs docker rm -f ${shellQuote(containerName)} with the same
error suppression and options.
- Around line 1733-1741: The orphaned gateway container cleanup logic currently
runs for every port; restrict it to only run when handling the gateway port
(8080) to avoid unnecessary Docker operations on other ports. In the block that
calls cleanupOrphanedGatewayContainer() and then re-checks with
checkPortAvailable(port), guard that entire sequence with a check like if (port
=== 8080) (or compare against your GATEWAY_PORT constant) so
cleanupOrphanedGatewayContainer() and the subsequent port re-check only execute
for port 8080; keep the existing functions cleanupOrphanedGatewayContainer(),
checkPortAvailable(), and the console log (which references port and label)
unchanged otherwise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d27dee1-3f8a-46e7-bdbb-3c03c66ee6f8
📒 Files selected for processing (1)
bin/lib/onboard.js
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
1558-1569: Add behavioral tests for cleanup execution and retry path.The current gateway cleanup tests are static string checks; they don’t validate that
cleanupOrphanedGatewayContainer()actually triggersdocker stop/rmand enables the secondcheckPortAvailable()pass in preflight.Also applies to: 1728-1742
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1558 - 1569, Add behavioral tests that exercise cleanupOrphanedGatewayContainer() and the preflight retry path instead of just checking strings: mock/stub runCapture to return a non-empty container id to force the cleanup branch, and spy/mock run to verify docker stop and docker rm are invoked for the containerName used in cleanupOrphanedGatewayContainer(); additionally, test the preflight flow that calls checkPortAvailable() twice by simulating the first check failing and the second succeeding (or by advancing the mocked state after cleanup) so the code path that rechecks the port is executed; reference the functions cleanupOrphanedGatewayContainer, runCapture, run, and checkPortAvailable when wiring the mocks and assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1560-1563: The current use of runCapture with `docker ps -aq
--filter "name=^${containerName}$"` is unreliable because docker ps does
substring matching; replace that call with a deterministic inspect call: invoke
runCapture(`docker container inspect ${containerName}`, { ignoreError: true })
(or with a --format to return ID) and treat a successful inspect (non-empty
output / zero exit) as the container existing; update the variable `existing`
(and any downstream logic that uses it) to rely on the inspect result instead of
the substring-matching docker ps output while preserving error-ignore behavior
around `runCapture` and using the `containerName` identifier.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1558-1569: Add behavioral tests that exercise
cleanupOrphanedGatewayContainer() and the preflight retry path instead of just
checking strings: mock/stub runCapture to return a non-empty container id to
force the cleanup branch, and spy/mock run to verify docker stop and docker rm
are invoked for the containerName used in cleanupOrphanedGatewayContainer();
additionally, test the preflight flow that calls checkPortAvailable() twice by
simulating the first check failing and the second succeeding (or by advancing
the mocked state after cleanup) so the code path that rechecks the port is
executed; reference the functions cleanupOrphanedGatewayContainer, runCapture,
run, and checkPortAvailable when wiring the mocks and assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d36b60a6-8aa1-4f7c-87f5-3867fe3b65ae
📒 Files selected for processing (1)
bin/lib/onboard.js
| const existing = runCapture( | ||
| `docker ps -aq --filter "name=^${containerName}$"`, | ||
| { ignoreError: true }, | ||
| ).trim(); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Docker CLI, for docker ps --filter "name=...", what is the documented exact-match pattern behavior? Is ^container-name$always reliable, or is^/container-name$(ordocker container inspect ) recommended for exact name checks?
💡 Result:
- Documented behavior: For
docker ps/docker container ls, thenamefilter “matches on all or part of a container’s name” (i.e., substring match). Docker’s CLI docs do not document an “exact match” mode forname=. [1] - Regex anchoring (
^...$) status: Docker’s general filtering docs say some filters support regex, but you must rely on each command’s own reference to know what’s supported—and thedocker psdocs for thenamefilter only describe substring matching, not regex/exact semantics. [2][1]
⇒ So--filter "name=^container-name$"is not documented as a reliable exact-match contract. - Why people use
^/name$: In the Engine API, container “Names” are represented with a leading slash (e.g.,"/coolName"). [4] This is why, in practice, exact-match regex attempts often use^/container-name$—but that’s still implementation-dependent, not the CLI’s documented promise. [4][1] - Recommended exact-name existence check: Use
docker inspect --type=container <name>(ordocker container inspect <name>).docker inspectis documented to match objects by ID or name (and--typeavoids ambiguity with non-container objects). [3]
Bottom line: The documented docker ps --filter name=... behavior is substring matching, not exact matching; for a dependable exact-name check, prefer docker (container) inspect <name> (optionally --type=container). [1][3]
Sources: [1] Docker CLI docker container ls (name filter) docs; [2] Docker “Filter commands” docs; [3] Docker CLI docker inspect docs; [4] Docker Engine API swagger (ContainerSummary.Names examples include leading /).
Replace docker ps substring filter with docker container inspect for reliable exact-name matching (Line 1561).
docker ps --filter "name=..." is documented as substring matching only; regex anchors like ^...$ are not part of the CLI's documented contract and can fail to detect containers. Use docker container inspect <name> for deterministic exact-name/ID lookups.
Suggested hardening
function cleanupOrphanedGatewayContainer() {
const containerName = `openshell-cluster-${GATEWAY_NAME}`;
- const existing = runCapture(
- `docker ps -aq --filter "name=^${containerName}$"`,
- { ignoreError: true },
- ).trim();
- if (!existing) return false;
+ const exists = run(`docker container inspect ${shellQuote(containerName)} >/dev/null 2>&1`, {
+ ignoreError: true,
+ });
+ if (exists.status !== 0) return false;
console.log(` Cleaning up orphaned gateway container (${containerName})...`);
run(`docker stop ${shellQuote(containerName)} 2>/dev/null || true`, { ignoreError: true });
run(`docker rm ${shellQuote(containerName)} 2>/dev/null || true`, { ignoreError: true });
return true;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 1560 - 1563, The current use of runCapture
with `docker ps -aq --filter "name=^${containerName}$"` is unreliable because
docker ps does substring matching; replace that call with a deterministic
inspect call: invoke runCapture(`docker container inspect ${containerName}`, {
ignoreError: true }) (or with a --format to return ID) and treat a successful
inspect (non-empty output / zero exit) as the container existing; update the
variable `existing` (and any downstream logic that uses it) to rely on the
inspect result instead of the substring-matching docker ps output while
preserving error-ignore behavior around `runCapture` and using the
`containerName` identifier.
ericksoa
left a comment
There was a problem hiding this comment.
Nice fix — this addresses a real user pain point cleanly. Approving as-is.
A few suggestions for a follow-up PR:
-
docker ps --filter "name=^...$"vsdocker container inspect— The^...$regex anchoring on Docker's name filter works in practice but isn't documented behavior. Woulddocker container inspect openshell-cluster-nemoclaw >/dev/null 2>&1be a more future-proof way to check for the container? Just something to consider. -
Scope cleanup to port 8080 only — Right now the cleanup fires for every port that fails its availability check, but the orphaned gateway container only binds 8080. If 18789 is blocked by something else, the "Cleaning up orphaned gateway container..." message could mislead users. Scoping the call to
port === 8080would make the output clearer. -
Add a test for the new cleanup path —
gateway-cleanup.test.jscurrently only does static string matching. Could a follow-up add an assertion forcleanupOrphanedGatewayContainer, or better yet, a behavioral test that mocksrunCapture/run? -
Nit:
docker stop+docker rm→docker rm -f? — Is the graceful SIGTERM viastopintentional for the orphaned container, or woulddocker rm -fsimplify things?
Dismissing — CI checks are failing (checks and dco-check). Please fix before re-requesting review.
ericksoa
left a comment
There was a problem hiding this comment.
CI is failing — please fix before re-requesting review:
- dco-check: Commits need
--signofffor DCO compliance. - checks: Test suite is failing.
The code feedback from my earlier review still applies — happy to re-approve once CI is green.
|
Closing as duplicate of #1567, which has been merged. Thank you for the contribution @yanyunl1991 — both approaches addressed #1582. |
Summary
Fixes #1582
Ctrl+C during
nemoclaw onboardgateway start leaves an orphanedopenshell-cluster-nemoclawDocker container running. The nextnemoclaw onboardfails with "Port 8080 is not available" becauseOpenShell no longer tracks the container but it still holds the port.
cleanupOrphanedGatewayContainer()that detects and removesorphaned
openshell-cluster-nemoclawDocker containersdestroyGateway(): also clean up containers, not just volumesTest plan
test-double-onboard.sh,test-port8080-conflict.sh)Reproduction log
Before fix — re-onboard fails with port conflict:
After fix — auto-cleanup and continue:
Summary by CodeRabbit