Skip to content

fix: enable multi-GPU DDP training in Jupyter notebooks#928

Open
mfazrinizar wants to merge 15 commits intoroboflow:developfrom
mfazrinizar:fix/ddp-notebook-cuda-init
Open

fix: enable multi-GPU DDP training in Jupyter notebooks#928
mfazrinizar wants to merge 15 commits intoroboflow:developfrom
mfazrinizar:fix/ddp-notebook-cuda-init

Conversation

@mfazrinizar
Copy link
Copy Markdown

What does this PR do?

Fixes multi-GPU DDP training (strategy="ddp_notebook" and strategy="ddp_spawn") which was completely broken in Jupyter (e.g. Kaggle) notebook environments. The fix addresses two layers of issues:

  1. CUDA early initialization: RFDETRBase() eagerly moved the model to CUDA during __init__(), and module-level torch.cuda.is_available() in config.py created a CUDA driver context at import time, making multi-process training impossible.

  2. OpenMP thread pool corruption after fork: Even after fixing CUDA init, PyTorch's OpenMP thread pool (created during model construction) cannot survive fork(). The worker threads become zombie handles, causing SIGABRT: Invalid thread pool! when the autograd engine initializes in forked children. Fixed by transparently replacing fork-based DDP with a spawn-based strategy.

Related Issue(s): Fixes #923

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Testing

  • I have tested this change locally
  • I have added/updated tests for this change

Test details:

Unit tests (101 pass locally)

  • test_build_trainer.py: 52 tests covering precision resolution, strategy selection, ddp_notebook→spawn mapping, EMA guards, logger wiring
  • test_module_data.py: 49 tests including test_ddp_notebook_preserves_num_workers and test_other_strategy_preserves_num_workers

Integration test (Kaggle T4 x2)

Validated on Kaggle GPU T4 x2 accelerator (Python 3.12, PyTorch 2.10.0+cu128, PTL 2.6.1):

Test Result Time
CUDA not initialized after RFDETRBase() ✅ PASS
Model weights on CPU after construction ✅ PASS
strategy="ddp_notebook" training (3 epochs, 2×T4) ✅ PASS 84.3s
strategy="ddp_spawn" training (3 epochs, 2×T4) ✅ PASS 77.4s
Inference after DDP training ✅ PASS

What This Fixes

Scenario Before After
model.train(devices=2, strategy="ddp_notebook") in notebook ❌ CUDA re-init / SIGABRT ✅ Works
model.train(devices=2, strategy="ddp_spawn") in notebook ❌ CUDA re-init / MisconfigurationException ✅ Works
model.train(devices=1) ✅ Works ✅ Works (no regression)
model.predict(img) ✅ Works ✅ Works (lazy device placement)
model.train() → model.predict(img) ✅ Works ✅ Works
model.export_onnx() / model.optimize_for_inference() ✅ Works ✅ Works

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

Additional Context

The ddp_notebook → spawn conversion is transparent to users: they continue passing strategy="ddp_notebook" (or strategy="ddp_spawn") and training just works. An INFO log message is emitted:

[INFO] rf-detr - ddp_notebook → spawn-based DDP to avoid OpenMP thread pool corruption after fork.

The find_unused_parameters=True flag is required because RF-DETR's architecture has parameters in the detection head that may not contribute to every loss term (e.g. encoder-only auxiliary losses).

Technical Details

Two layers of CUDA initialization that had to be fixed

  1. Module-level (config.py): torch.cuda.is_available() creates a CUDA driver context at import time. Fixed with torch.accelerator.current_accelerator() which queries NVML without creating a primary context.

  2. Model construction (inference.py): nn_model.to("cuda") fully initializes the CUDA runtime. Fixed by keeping the model on CPU and deferring .to(device) to first predict()/export()/batch_size="auto" call via _ensure_model_on_device().

Why spawn instead of fork

PyTorch creates an OpenMP thread pool (default 8 threads) during the first tensor operation (model construction). fork() only copies the calling thread, OMP worker threads become zombie handles. When the autograd engine in forked children calls set_num_threads during thread_init, the OMP runtime finds an invalid pool state and aborts:

terminate called after throwing an instance of 'c10::Error'
  what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64

This is a fundamental fork+OMP incompatibility; as far as I know, there is no library-level workaround. The fix transparently replaces fork-based ddp_notebook with a spawn-based _NotebookSpawnDDPStrategy whose launcher is marked is_interactive_compatible = True, allowing PTL to accept it in notebook environments.

Performance impact

  • First predict() call: ~50-200ms one-time latency from CPU→GPU model transfer. Strictly one-time, _ensure_model_on_device() checks first_param.device != target and becomes a no-op once the model is on GPU. After train(), the PTL-trained model is already on CUDA (synced at line 548), so even the first post-training predict() has zero transfer cost.
  • Subsequent predict() calls: Zero overhead (single next(parameters()).device comparison)
  • Production inference (RFDETRBase() → predict() without training): The one-time transfer happens on the very first call only. All subsequent calls, including batch evaluation loops, are zero-overhead.
  • Training: Zero impact (PTL builds its own model on CPU and handles device placement)
  • DDP spawn vs fork: ~12s additional startup for process spawn (one-time per training run)

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 57%. Comparing base (d9f6be3) to head (e67ac24).

❌ Your patch check has failed because the patch coverage (62%) is below the target coverage (95%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (57%) is below the target coverage (95%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (d9f6be3) and HEAD (e67ac24). Click for more details.

HEAD has 24 uploads less than BASE
Flag BASE (d9f6be3) HEAD (e67ac24)
cpu 8 0
Linux 5 1
py3.13 3 0
py3.11 1 0
py3.12 2 1
py3.10 3 0
Windows 2 0
macOS 2 0
Additional details and impacted files
@@           Coverage Diff            @@
##           develop   #928     +/-   ##
========================================
- Coverage       79%    57%    -22%     
========================================
  Files           97     97             
  Lines         7793   7832     +39     
========================================
- Hits          6148   4454   -1694     
- Misses        1645   3378   +1733     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

model.train(strategy="ddp_notebook") fails with "Cannot re-initialize CUDA in forked subprocess"

1 participant