Skip to content

fix(preflight): distinguish docker socket permission from daemon-down#1599

Closed
latenighthackathon wants to merge 1 commit intoNVIDIA:mainfrom
latenighthackathon:fix/preflight-docker-socket-permission
Closed

fix(preflight): distinguish docker socket permission from daemon-down#1599
latenighthackathon wants to merge 1 commit intoNVIDIA:mainfrom
latenighthackathon:fix/preflight-docker-socket-permission

Conversation

@latenighthackathon
Copy link
Copy Markdown
Contributor

@latenighthackathon latenighthackathon commented Apr 8, 2026

Summary

Problem

On DGX Spark (and any Linux host where the user isn't in the docker group), nemoclaw onboard reports that Docker is not reachable and suggests sudo systemctl start docker, even when docker.service is already active and running. The current preflight already collects dockerServiceActive via systemctl is-active docker but never uses it in the remediation logic — so a socket permission error and a daemon-down error get the same misleading message.

The reporter verified that systemctl status docker returned active (running) and was still told to start docker. This sends users down a dead end because re-running systemctl start docker does nothing when it's already up.

Fix

In planHostRemediation, when dockerInstalled && !dockerReachable && dockerServiceActive === true && platform === "linux", emit a new remediation:

  • id: fix_docker_socket_permission
  • reason: "Docker daemon is running but NemoClaw could not talk to the Docker socket. This usually means your user does not have permission to access /var/run/docker.sock."
  • commands: sudo usermod -aG docker $USERnewgrp docker (or relogin) → docker infonemoclaw onboard

The original start_docker remediation is preserved for the daemon-actually-down case.

Test plan

  • New regression test asserts:
    • The new action id is returned for the active-but-unreachable case
    • sudo usermod -aG docker $USER is in the commands
    • The misleading sudo systemctl start docker command is NOT present
    • The reason text mentions the socket so users understand the root cause
  • All 36 preflight tests pass (npx vitest run src/lib/preflight.test.ts)
  • Existing start_docker test still passes — the daemon-down branch is unchanged
  • Prettier clean, ESLint clean
  • Signed commit + DCO sign-off

Closes #1574

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved Docker troubleshooting: when the daemon is already running but inaccessible on Linux, the system now correctly guides users to grant socket permissions instead of attempting to restart Docker.
  • Tests

    • Added test coverage for Docker socket permission remediation scenarios.

When 'docker info' fails but 'systemctl is-active docker' reports the
service is active, the daemon is running and the real problem is that
the current user cannot access /var/run/docker.sock. The previous
remediation always suggested 'sudo systemctl start docker', which is
misleading and frustrating because the daemon is already running.

This was reported on DGX Spark in NVIDIA#1574 — the user verified docker.service
was active, then ran 'nemoclaw onboard' and got the wrong remediation.

When the assessment shows docker is installed, unreachable, AND systemd
reports the service is active, emit a new 'fix_docker_socket_permission'
remediation that points users at the docker group + newgrp/relogin
workflow instead.

Adds a regression test asserting:
- The new action id is returned for the active-but-unreachable case
- The misleading 'sudo systemctl start docker' command is NOT present
- The reason mentions the socket so users understand the root cause

Closes NVIDIA#1574

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 971246a8-219c-4f74-8e08-42be7eeb7fd0

📥 Commits

Reviewing files that changed from the base of the PR and between adbea05 and 73bcb86.

📒 Files selected for processing (2)
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts

📝 Walkthrough

Walkthrough

Updated the Docker preflight remediation logic to conditionally handle scenarios where Docker is installed and its service is active on Linux but unreachable to the current user. Instead of suggesting to start Docker, it now recommends fixing socket permissions by adding the user to the docker group. Added a test case validating this new remediation path.

Changes

Cohort / File(s) Summary
Docker Remediation Logic
src/lib/preflight.ts, src/lib/preflight.test.ts
Modified planHostRemediation to conditionally branch when Docker is unreachable: on Linux with active dockerService, now suggests fix_docker_socket_permission (user group access) instead of start_docker. Otherwise preserves existing "Start Docker" behavior. Test case added to cover the active-service-but-unreachable scenario with assertions on remediation action ID, commands, and reason field.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A socket permission blocked the way,
But the Docker daemon was already at play!
Now the fix guides users to their proper place,
Adding them to the group with measured grace. 🔐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(preflight): distinguish docker socket permission from daemon-down' clearly and concisely summarizes the main change: differentiating between Docker socket permission issues and daemon availability.
Linked Issues check ✅ Passed The PR implementation meets all coding requirements from #1574: detects daemon-active-but-unreachable scenario, returns new fix_docker_socket_permission action, preserves start_docker for actual daemon-down cases, includes appropriate usermod and newgrp commands, and adds regression test.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: test case added for the new scenario, remediation logic updated for Docker socket permission handling, and no unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@wscurran wscurran added Platform: DGX Spark Support for DGX Spark NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). Docker Support for Docker containerization fix labels Apr 8, 2026
@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented Apr 8, 2026

✨ Thanks for submitting this fix, which proposes a way to distinguish Docker socket permission errors from daemon-down states during preflight checks. This improves the onboarding experience on DGX Spark and other Linux hosts where users may not be in the docker group.


Possibly related open issues:

@latenighthackathon
Copy link
Copy Markdown
Contributor Author

@ericksoa landed essentially the same fix in #1614 yesterday (closing #50) — the docker_group_permission remediation action on main covers exactly this case (Linux + dockerInstalled + !dockerReachable + dockerServiceActive). Closing in favor of that.

Thanks @ericksoa for the cleaner shape — the kind: "sudo" annotation is a nicer fit than my manual since the remediation is a one-shot usermod rather than a multi-step user-action sequence.

For @zNeill on the original report (#1574): the fix is already on main and will land in the next release. The new behavior on Linux when systemd reports docker.service active but docker info still fails: the wizard now prints "Add user to docker group" with sudo usermod -aG docker $USER && newgrp docker instead of the misleading "Start Docker" message.

Cheers!

@latenighthackathon latenighthackathon deleted the fix/preflight-docker-socket-permission branch April 9, 2026 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). Platform: DGX Spark Support for DGX Spark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nemoclaw][Spark] Spark: Docker preflight reports 'start docker' even when docker.service is running

3 participants