fix: enable multi-GPU DDP training in Jupyter notebooks by mfazrinizar · Pull Request #928 · roboflow/rf-detr

mfazrinizar · 2026-04-06T21:13:08Z

What does this PR do?

Fixes multi-GPU DDP training (strategy="ddp_notebook" and strategy="ddp_spawn") which was completely broken in Jupyter (e.g. Kaggle) notebook environments. The fix addresses two layers of issues:

CUDA early initialization: RFDETRBase() eagerly moved the model to CUDA during __init__(), and module-level torch.cuda.is_available() in config.py created a CUDA driver context at import time, making multi-process training impossible.
OpenMP thread pool corruption after fork: Even after fixing CUDA init, PyTorch's OpenMP thread pool (created during model construction) cannot survive fork(). The worker threads become zombie handles, causing SIGABRT: Invalid thread pool! when the autograd engine initializes in forked children. Fixed by transparently replacing fork-based DDP with a spawn-based strategy.

Related Issue(s): Fixes #923

Type of Change

Bug fix (non-breaking change that fixes an issue)

Testing

I have tested this change locally
I have added/updated tests for this change

Test details:

Unit tests (101 pass locally)

test_build_trainer.py: 52 tests covering precision resolution, strategy selection, ddp_notebook→spawn mapping, EMA guards, logger wiring
test_module_data.py: 49 tests including test_ddp_notebook_preserves_num_workers and test_other_strategy_preserves_num_workers

Integration test (Kaggle T4 x2)

Validated on Kaggle GPU T4 x2 accelerator (Python 3.12, PyTorch 2.10.0+cu128, PTL 2.6.1):

Test	Result	Time
CUDA not initialized after `RFDETRBase()`	✅ PASS	—
Model weights on CPU after construction	✅ PASS	—
`strategy="ddp_notebook"` training (3 epochs, 2×T4)	✅ PASS	84.3s
`strategy="ddp_spawn"` training (3 epochs, 2×T4)	✅ PASS	77.4s
Inference after DDP training	✅ PASS	—

What This Fixes

Scenario	Before	After
`model.train(devices=2, strategy="ddp_notebook")` in notebook	❌ CUDA re-init / SIGABRT	✅ Works
`model.train(devices=2, strategy="ddp_spawn")` in notebook	❌ CUDA re-init / MisconfigurationException	✅ Works
`model.train(devices=1)`	✅ Works	✅ Works (no regression)
`model.predict(img)`	✅ Works	✅ Works (lazy device placement)
`model.train() → model.predict(img)`	✅ Works	✅ Works
`model.export_onnx()` / `model.optimize_for_inference()`	✅ Works	✅ Works

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

Additional Context

The ddp_notebook → spawn conversion is transparent to users: they continue passing strategy="ddp_notebook" (or strategy="ddp_spawn") and training just works. An INFO log message is emitted:

[INFO] rf-detr - ddp_notebook → spawn-based DDP to avoid OpenMP thread pool corruption after fork.

The find_unused_parameters=True flag is required because RF-DETR's architecture has parameters in the detection head that may not contribute to every loss term (e.g. encoder-only auxiliary losses).

Technical Details

Two layers of CUDA initialization that had to be fixed

Module-level (config.py): torch.cuda.is_available() creates a CUDA driver context at import time. Fixed with torch.accelerator.current_accelerator() which queries NVML without creating a primary context.
Model construction (inference.py): nn_model.to("cuda") fully initializes the CUDA runtime. Fixed by keeping the model on CPU and deferring .to(device) to first predict()/export()/batch_size="auto" call via _ensure_model_on_device().

Why spawn instead of fork

PyTorch creates an OpenMP thread pool (default 8 threads) during the first tensor operation (model construction). fork() only copies the calling thread, OMP worker threads become zombie handles. When the autograd engine in forked children calls set_num_threads during thread_init, the OMP runtime finds an invalid pool state and aborts:

terminate called after throwing an instance of 'c10::Error'
  what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64

This is a fundamental fork+OMP incompatibility; as far as I know, there is no library-level workaround. The fix transparently replaces fork-based ddp_notebook with a spawn-based _NotebookSpawnDDPStrategy whose launcher is marked is_interactive_compatible = True, allowing PTL to accept it in notebook environments.

Performance impact

First predict() call: ~50-200ms one-time latency from CPU→GPU model transfer. Strictly one-time, _ensure_model_on_device() checks first_param.device != target and becomes a no-op once the model is on GPU. After train(), the PTL-trained model is already on CUDA (synced at line 548), so even the first post-training predict() has zero transfer cost.
Subsequent predict() calls: Zero overhead (single next(parameters()).device comparison)
Production inference (RFDETRBase() → predict() without training): The one-time transfer happens on the very first call only. All subsequent calls, including batch evaluation loops, are zero-overhead.
Training: Zero impact (PTL builds its own model on CPU and handles device placement)
DDP spawn vs fork: ~12s additional startup for process spawn (one-time per training run)

codecov · 2026-04-07T06:21:09Z

Codecov Report

❌ Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 57%. Comparing base (d9f6be3) to head (e67ac24).

❌ Your patch check has failed because the patch coverage (62%) is below the target coverage (95%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (57%) is below the target coverage (95%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (d9f6be3) and HEAD (e67ac24). Click for more details.

HEAD has 24 uploads less than BASE

Flag BASE (d9f6be3) HEAD (e67ac24)

cpu 8 0

Linux 5 1

py3.13 3 0

py3.11 1 0

py3.12 2 1

py3.10 3 0

Windows 2 0

macOS 2 0

Additional details and impacted files

@@           Coverage Diff            @@
##           develop   #928     +/-   ##
========================================
- Coverage       79%    57%    -22%     
========================================
  Files           97     97             
  Lines         7793   7832     +39     
========================================
- Hits          6148   4454   -1694     
- Misses        1645   3378   +1733

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mfazrinizar added 11 commits April 6, 2026 04:37

fix: defer CUDA init to enable DDP training in notebooks

2f814fd

fix: skip CUDA bf16 probe for ddp_notebook strategy

9814557

fix: eliminate all CUDA driver context leaks before DDP fork

9104d43

fix: use overridden num_workers in all dataloaders for ddp_notebook

34ab21b

fix: possible thread-state corruption from fork()

6ce6ad0

revert: remove torch.set_num_threads that crashes forked DDP children

ed99190

fix: use spawn-based DDP for ddp_notebook to avoid OpenMP SIGABRT

a31dcb2

fix: adding logger for ddp_notebook strategy

728c1e5

fix: use spawn-based DDP for ddp_notebook to avoid OpenMP SIGABRT

a464cf2

fix: remove unnecessary num_workers=0 override for ddp_notebook

08af3c5

fix: use standard precision probing for DDP and guard auto-batch

bcdfd0a

mfazrinizar requested review from Borda, SkalskiP, isaacrob and probicheaux as code owners April 6, 2026 21:13

mfazrinizar and others added 3 commits April 7, 2026 04:13

Merge branch 'develop' into fix/ddp-notebook-cuda-init

c4c88f2

fix(pre-commit): 🎨 auto format pre-commit hooks

e7a84d0

style: fix ruff E402 imports and codespell in DDP tests

e67ac24

fix: handle None from torch.accelerator on CPU-only environments

1927ca5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable multi-GPU DDP training in Jupyter notebooks#928

fix: enable multi-GPU DDP training in Jupyter notebooks#928
mfazrinizar wants to merge 15 commits intoroboflow:developfrom
mfazrinizar:fix/ddp-notebook-cuda-init

mfazrinizar commented Apr 6, 2026

Uh oh!

codecov bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mfazrinizar commented Apr 6, 2026

What does this PR do?

Type of Change

Testing

Unit tests (101 pass locally)

Integration test (Kaggle T4 x2)

What This Fixes

Checklist

Additional Context

Technical Details

Two layers of CUDA initialization that had to be fixed

Why spawn instead of fork

Performance impact

Uh oh!

codecov bot commented Apr 7, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant