docs: comprehensive documentation overhaul by sunway513 · Pull Request #2706 · ROCm/aiter

sunway513 · 2026-04-12T19:12:39Z

Summary

Fix 22 factual errors identified in documentation audit (Feb 2026)
Rewrite API docs (gemm, attention, operators) to match actual codebase
Fix add_new_op tutorial: replace CUDA build system with HIP/Triton AITER JIT
Add new pages: ROCm compatibility matrix, supported models, GEMM tuning guide
Add MoE and normalization API documentation
Auto-detect version in conf.py (was hardcoded to 0.1.0, now 12 versions behind)
Deploy docs on every GitHub Release (was only on docs/ path changes)
Add Sphinx -W flag to catch broken references in CI
Add release notification workflow (Slack + downstream tracking issues)
Remove non-functional doc.aiter.amd.com CNAME reference

Files Changed (18 files)

Infrastructure:

.github/workflows/docs.yml — release trigger, -W errors, linkcheck, remove CNAME
.github/workflows/release-notify.yml — new: Slack + downstream issue automation
.github/changelog-config.json — new: auto changelog categorization
docs/conf.py — version auto-detection

API docs rewritten:

docs/api/gemm.rst — removed 8 nonexistent functions, documented 25+ actual APIs
docs/api/attention.rst — removed fabricated GQA/MQA, documented PA/MHA/MLA
docs/api/operators.rst — documented actual norm/activation/rope/quant/sample/cache
docs/tutorials/add_new_op.rst — CUDA to HIP/Triton, CUDAExtension to AITER JIT

New pages:

docs/api/moe.rst, docs/api/normalization.rst
docs/compatibility.rst, docs/models.rst, docs/gemm_tuning.rst
docs/changelog.rst, docs/contributing.rst
docs/advanced/triton_kernels.rst, docs/performance/benchmarks.rst

Secrets Required (for release-notify.yml)

AITER_RELEASE_SLACK_WEBHOOK — Slack incoming webhook URL
CROSS_REPO_TOKEN — PAT with issues:write on ROCm/ATOM

Test plan

CI docs.yml build passes (Sphinx with -W)
Linkcheck reports no critical broken links
Manual: trigger workflow_dispatch and verify docs deploy to gh-pages
Review API docs against actual function signatures

Follow PyTorch's wheel naming convention (e.g. +rocm7.2.1) for AITER release wheels. This enables building distinct wheels for different ROCm versions from the same workflow. Changes: - Add rocm_version input (auto-detects from container if empty) - Use SETUPTOOLS_SCM_PRETEND_VERSION for version+rocm suffix - Include ROCm version in concurrency group to prevent cross-version cancellation - Update artifact naming to include ROCm suffix

- Default runner: aiter-k8s-build -> aiter-1gpu-runner (actually exists) - Remove non-existent runners: aiter-mi300-1gpu, aiter-mi325-1gpu - Fix runner typo: linux-aiter-mi355-1 -> linux-aiter-mi35x-1 - Fix Docker username: rocmshard -> rocmshared (missing 'e')

setuptools_scm 10.x moved to vcs_versioning package, breaking the build with ModuleNotFoundError. Pin to 9.x until pyproject.toml is updated.

…dd new pages, automate deployment - Fix conf.py version to auto-detect from setuptools_scm (was hardcoded 0.1.0) - Fix docs.yml: add release trigger, enable -W (warnings as errors), add linkcheck - Remove doc.aiter.amd.com CNAME reference (DNS was never configured) - Rewrite gemm.rst: remove 8 nonexistent functions, document actual 25+ GEMM APIs - Rewrite attention.rst: remove fabricated GQA/MQA, document PA/MHA/MLA APIs - Rewrite operators.rst: document actual norm/activation/rope/quant/sample/cache APIs - Rewrite add_new_op.rst: replace CUDA build system with HIP/Triton AITER JIT pattern - Add new pages: compatibility matrix, supported models, GEMM tuning guide - Add new API docs: moe.rst, normalization.rst - Add stubs: changelog, contributing, triton_kernels, benchmarks - Add release-notify.yml: Slack webhook + downstream tracking issue automation - Add changelog-config.json for auto-generated release notes

github-actions · 2026-04-12T19:13:00Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2706 --add-label <label>

Copilot

Pull request overview

This PR overhauls AITER’s Sphinx documentation to better reflect the current ROCm/HIP + Triton JIT-based codebase, and updates CI/workflows to build/deploy docs on releases plus send release notifications.

Changes:

Rewrites major API docs (GEMM, attention, operators) and updates the “add new op” tutorial to match the @compile_ops JIT flow.
Adds new documentation pages (compatibility matrix, models, GEMM tuning guide, benchmarks, MoE/normalization APIs).
Updates GitHub Actions for docs builds (-W, linkcheck, release deploy) and adds a release notification workflow.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
docs/tutorials/add_new_op.rst	Reworked tutorial to document HIP/Triton JIT flow and testing patterns
docs/performance/benchmarks.rst	New page describing performance/benchmark resources
docs/models.rst	New supported-model matrix and tuned-config references
docs/index.rst	Updates landing page messaging, install snippet, TOCs, and support matrix
docs/gemm_tuning.rst	New GEMM tuning workflow guide
docs/contributing.rst	New contributor setup/testing/linting guidance
docs/conf.py	Auto-detects docs version instead of hardcoding
docs/compatibility.rst	New ROCm + GPU compatibility/install matrix
docs/changelog.rst	New changelog landing page pointing to GitHub Releases
docs/api/operators.rst	Rewritten operator inventory aligned to actual modules/APIs
docs/api/normalization.rst	New normalization API page
docs/api/moe.rst	New MoE API page
docs/api/gemm.rst	Rewritten GEMM API docs across backends/precisions
docs/api/attention.rst	Rewritten attention API docs (MHA/PA/MLA)
docs/advanced/triton_kernels.rst	New Triton backend overview page
.github/workflows/release-notify.yml	Adds Slack + downstream tracking issue automation on releases
.github/workflows/docs.yml	Builds docs with `-W`, runs linkcheck, deploys on release
.github/workflows/aiter-release.yaml	Adds ROCm-version wheel suffixing + artifact naming adjustments
.github/changelog-config.json	New config for changelog categorization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/tutorials/add_new_op.rst

+   void my_op_fwd(torch::Tensor input, torch::Tensor output) {
+       int n = input.numel();
+       auto stream = at::cuda::getCurrentHIPStream().stream();
+       launch_my_op(input.data_ptr(), output.data_ptr(), n, stream);


docs/tutorials/add_new_op.rst

+   def gen_my_op_fake(input: Tensor) -> Tensor:
+       """Fake tensor impl for torch.compile tracing."""
+       return torch.empty_like(input)

-   # Build and install
-   python setup.py develop

-   # Or for production
-   python setup.py install
+   @compile_ops("module_my_op", gen_fake=gen_my_op_fake)
+   def my_op_fwd(input: Tensor) -> Tensor:
+       """My custom operator."""
+       ...


docs/tutorials/add_new_op.rst

+   # op_tests/test_my_op.py
+   import torch
+   import aiter
+   from aiter.test_common import checkAllclose, benchmark

-Print Kernel Launches
-^^^^^^^^^^^^^^^^^^^^^
+   @benchmark()
+   def test_my_op(m, n, dtype):
+       ret = {}
+       input = torch.randn(m, n, dtype=dtype, device="cuda")

-.. code-block:: bash
+       # Reference (PyTorch)
+       ref_output = input.clone()  # replace with actual reference

-   export HIP_VISIBLE_DEVICES=0
-   export AMD_LOG_LEVEL=3  # Verbose logging
+       # AITER op
+       output = torch.empty_like(input)
+       aiter.my_op_fwd(output, input)



docs/tutorials/add_new_op.rst

+5. **Run ruff and pytest before committing.** Lint and test locally before
+   pushing.


docs/performance/benchmarks.rst

+AITER includes kernel-level benchmarks in its test suite. To run them:
+
+.. code-block:: bash
+
+   pytest tests/ -k "benchmark"
+


docs/api/normalization.rst

+traffic and improve throughput.
+
+.. py:function:: rmsnorm2d_fwd_with_add(input, residual, weight, eps)
+
+   Fused residual addition and RMS normalization. Computes
+   ``rmsnorm(input + residual)`` in a single kernel.
+
+   :param input: Input tensor.
+   :param residual: Residual tensor to add before normalization.
+   :param weight: Learnable scale parameter.
+   :param eps: Numerical stability constant.
+
+.. py:function:: rmsnorm2d_fwd_with_smoothquant(...)
+
+   Fused RMS normalization with SmoothQuant. Applies per-channel smooth
+   quantization scales after normalization.
+
+.. py:function:: rmsnorm2d_fwd_with_dynamicquant(...)
+
+   Fused RMS normalization with dynamic quantization. Computes quantization
+   parameters on the fly and outputs quantized activations.
+
+.. py:function:: add_rmsnorm_quant(...)
+
+   Fused residual add + RMS normalization + quantization in a single kernel.
+   Combines three operations to minimize global memory round-trips.
+


.github/workflows/release-notify.yml

+                  "type": "section",
+                  "text": {
+                    "type": "mrkdwn",
+                    "text": "${{ github.event.release.body && '```' || 'See release page for details.' }}"


.github/workflows/release-notify.yml

+                '### Install',
+                '```bash',
+                'pip install amd-aiter --find-links ' + url,
+                '```',


.github/workflows/aiter-release.yaml

+          if [ -n "$ROCM_VER" ]; then
+            FULL_VER="${BASE_VER}+rocm${ROCM_VER}"
+          else
+            FULL_VER="${BASE_VER}"
+          fi
+          echo "Wheel version: $FULL_VER (base=$BASE_VER, rocm=$ROCM_VER)"
+          echo "full_version=$FULL_VER" >> "$GITHUB_OUTPUT"
+          echo "rocm_suffix=rocm${ROCM_VER}" >> "$GITHUB_OUTPUT"
+


docs/api/moe.rst

+.. py:function:: fmoe(input, w1, w2, topk_weights, topk_ids, ...)
+
+   Main fused MoE forward pass. Dispatches tokens to selected experts, applies
+   expert weights (w1, w2), and combines results.
+
+   :param input: Hidden states of shape ``(num_tokens, hidden_dim)``.
+   :param w1: First expert weight matrix.
+   :param w2: Second expert weight matrix.
+   :param topk_weights: Per-token expert weights from gating.
+   :param topk_ids: Per-token expert indices from gating.
+
+.. py:function:: fmoe_g1u1(input, gate, up, down, topk_weights, topk_ids, ...)
+
+   Fused MoE with separate gate, up, and down projections (GLU-style).
+   Used by architectures that split the MoE FFN into gate/up/down matrices.
+
+.. py:function:: fmoe_int8_g1u0(...)
+
+   INT8 quantized fused MoE. Applies INT8 weight-only quantization to the
+   expert computations.
+
+.. py:function:: fmoe_fp8_blockscale_g1u1(...)
+
+   FP8 block-scale quantized fused MoE with gate/up/down projections.
+   Uses per-block scaling factors for FP8 computation.
+
+.. py:function:: fused_moe(hidden_states, w1, w2, gating_output, topk, ...)
+
+   High-level fused MoE entry point (from ``aiter/fused_moe.py``). Combines
+   gating and expert computation in a single call.


5 test categories to prevent doc rot: 1. API signature consistency — every autofunction/autoclass in RST must be importable 2. Code example syntax — Python code blocks must parse without SyntaxError 3. Sphinx build — build with -W (warnings as errors), catch broken refs 4. Version consistency — conf.py must use auto-detection, no hardcoded version 5. RST structure — no orphan API pages, no CUDA references in ROCm docs Also adds test_docs.py to docs.yml CI trigger paths.

…odoc imports - Fix black formatting issues in test_docs.py and conf.py - Fix ruff: remove unused os import, fix f-string without placeholders - Remove autofunction/autoclass directives (fail on CPU-only CI, manual docs sufficient) - Add autodoc_mock_imports for triton/ROCm modules in conf.py - Remove deprecated display_version theme option - Fix tutorials/index.rst: remove 9 references to nonexistent pages - Fix quickstart.rst: point to existing pages instead of nonexistent tutorials - Fix installation.rst: remove broken tutorials/triton_comms ref - Fix basic_usage.rst: remove broken tutorial cross-references

sunway513 added 4 commits April 11, 2026 04:43

Pin setuptools_scm<10 to fix vcs_versioning import error

09d68fc

setuptools_scm 10.x moved to vcs_versioning package, breaking the build with ModuleNotFoundError. Pin to 9.x until pyproject.toml is updated.

sunway513 requested review from a team and Copilot April 12, 2026 19:12

Copilot started reviewing on behalf of sunway513 April 12, 2026 19:14 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

sunway513 added 4 commits April 12, 2026 19:27

fix: black formatting for ternary expressions in parametrize decorators

0c80444

fix: last black formatting nit — ids line for module_refs

a000d2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: comprehensive documentation overhaul#2706

docs: comprehensive documentation overhaul#2706
sunway513 wants to merge 8 commits intomainfrom
docs/comprehensive-doc-overhaul

sunway513 commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		5. Run ruff and pytest before committing. Lint and test locally before
		pushing.

Conversation

sunway513 commented Apr 12, 2026

Summary

Files Changed (18 files)

Secrets Required (for release-notify.yml)

Test plan

Uh oh!

github-actions bot commented Apr 12, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants