Skip to content

Fix ONNX export for dynamic batch dimensions#871

Merged
Borda merged 10 commits intoroboflow:developfrom
svengoluza:fix-onnx-dynamic-export
Mar 25, 2026
Merged

Fix ONNX export for dynamic batch dimensions#871
Borda merged 10 commits intoroboflow:developfrom
svengoluza:fix-onnx-dynamic-export

Conversation

@svengoluza
Copy link
Copy Markdown
Contributor

What does this PR do?

ONNX export silently baked spatial dimensions and batch size into the graph as fixed constants. When users tried to run the exported model with a different batch size at inference time, it would fail because the ONNX graph contained hardcoded values derived from the export-time batch size.

Three root causes:

  1. LayerNorm.forward used x.size(3) instead of self.normalized_shape — The ONNX tracer captured the concrete integer from the export-time tensor shape and embedded it as a constant node. Switching to self.normalized_shape (already stored on the module) gives the tracer a static attribute it can reference symbolically.

  2. spatial_shapes was built as a Python list, then converted with torch.as_tensor() — The tracer never saw the h, w values flow through tensor operations, so it treated them as constants. Building spatial_shapes directly as a tensor with index assignment (spatial_shapes[lvl, 0] = h) lets the tracer track the symbolic relationship between the input spatial dims and downstream uses.

  3. gen_encoder_output_proposals created valid_H/valid_W via Python list comprehensionstorch.tensor([H_ for _ in range(N_)]) bakes the batch-dependent value N_ as a fixed list length. Replacing this with H_.expand(N_) keeps the batch dimension symbolic in the graph.

Additionally, a dynamic_batch parameter is added to RFDETR.export_onnx() and the standalone export CLI. When enabled, it marks the batch axis (dim 0) as dynamic on all input and output names, allowing the exported model to accept variable batch sizes at runtime.

Related Issue(s): #376 , #79

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Testing

  • I have tested this change locally

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 25, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes RF-DETR’s ONNX export to better support dynamic batch sizes by removing shape-dependent Python constructs that get constant-folded into the exported graph.

Changes:

  • Replace shape-derived Python constructs with tensor/attribute-based equivalents to avoid baking batch/spatial dims into ONNX graphs.
  • Add dynamic_batch option to RFDETR.export() and the export CLI path to emit dynamic_axes for batch dimension.
  • Update projector LayerNorm to use self.normalized_shape instead of x.size(3) to prevent tracing fixed constants.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/rfdetr/models/transformer.py Reworks spatial_shapes construction and proposal generation to be more ONNX-trace-friendly.
src/rfdetr/models/backbone/projector.py Prevents LayerNorm from embedding export-time channel size as a constant.
src/rfdetr/export/main.py Adds CLI support for dynamic batch via dynamic_axes.
src/rfdetr/detr.py Adds dynamic_batch parameter and wires dynamic_axes through high-level export API.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77%. Comparing base (c0fb944) to head (cacdd70).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@          Coverage Diff           @@
##           develop   #871   +/-   ##
======================================
  Coverage       77%    77%           
======================================
  Files           97     97           
  Lines         7530   7538    +8     
======================================
+ Hits          5793   5801    +8     
  Misses        1737   1737           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Borda added 5 commits March 25, 2026 13:53
The PR switched spatial_shapes from a Python list to a tensor for
ONNX tracing. But gen_encoder_output_proposals iterates over it and
uses H_/W_ as slice indices and torch.linspace steps, both of which
require Python ints, not scalar tensors — causing TypeError at runtime.

Reintroduce spatial_shapes_hw as a list[tuple[int, int]] built
alongside the tensor during the loop (h/w come from src.shape, so
they are already Python ints). Pass spatial_shapes_hw to
gen_encoder_output_proposals while the tensor form is still used
for MSDeformAttn and level_start_index.

Addresses review comment by @Copilot (PR roboflow#871)
Add two test suites recommended by Copilot review:

1. TestCliExportMain.test_dynamic_batch_forwards_dynamic_axes — verifies
   that CLI main() passes dynamic_axes={name: {0: 'batch'}} to export_onnx
   for every I/O name when --dynamic_batch=True, and None when False.
   Also updates _make_args() to accept dynamic_batch and fake_export_onnx
   to capture dynamic_axes.

2. test_rfdetr_export_dynamic_batch_forwards_dynamic_axes — verifies that
   RFDETR.export(..., dynamic_batch=True) forwards a correctly keyed
   dynamic_axes dict into export_onnx, covering detection and segmentation
   model configs, plus static (False) baseline.

Addresses review comments by @Copilot (PR roboflow#871)
H_.expand(N_) assumed H_ is a tensor, but gen_encoder_output_proposals
now receives spatial_shapes as list[tuple[int, int]] from Transformer.forward().
Python ints have no .expand() method, causing AttributeError on the export
path (masks=None, two_stage=True).

Replace with torch.full((N_,), H_, ...) which accepts Python ints.

Also add regression test with list[tuple[int, int]] + masks=None to
lock in the int-tuple path and catch any future regressions.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Borda Borda merged commit 7450c16 into roboflow:develop Mar 25, 2026
23 checks passed
@Borda Borda added the bug Something isn't working label Mar 25, 2026
@Borda Borda mentioned this pull request Mar 26, 2026
2 tasks
Borda added a commit that referenced this pull request Mar 27, 2026
* Fix ONNX export for dynamic batch dimensions
* fix: restore Python int pairs for gen_encoder_output_proposals
* test: add dynamic_batch coverage for CLI and RFDETR.export()
* fix: replace H_.expand(N_) with torch.full for Python int spatial dims
* Apply suggestions from code review

---------

Co-authored-by: jirka <6035284+Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Borda added a commit that referenced this pull request Mar 27, 2026
* Fix ONNX export for dynamic batch dimensions
* fix: restore Python int pairs for gen_encoder_output_proposals
* test: add dynamic_batch coverage for CLI and RFDETR.export()
* fix: replace H_.expand(N_) with torch.full for Python int spatial dims
* Apply suggestions from code review

---------

Co-authored-by: jirka <6035284+Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants