Skip to content

docs: comprehensive documentation overhaul#2706

Open
sunway513 wants to merge 8 commits intomainfrom
docs/comprehensive-doc-overhaul
Open

docs: comprehensive documentation overhaul#2706
sunway513 wants to merge 8 commits intomainfrom
docs/comprehensive-doc-overhaul

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

  • Fix 22 factual errors identified in documentation audit (Feb 2026)
  • Rewrite API docs (gemm, attention, operators) to match actual codebase
  • Fix add_new_op tutorial: replace CUDA build system with HIP/Triton AITER JIT
  • Add new pages: ROCm compatibility matrix, supported models, GEMM tuning guide
  • Add MoE and normalization API documentation
  • Auto-detect version in conf.py (was hardcoded to 0.1.0, now 12 versions behind)
  • Deploy docs on every GitHub Release (was only on docs/ path changes)
  • Add Sphinx -W flag to catch broken references in CI
  • Add release notification workflow (Slack + downstream tracking issues)
  • Remove non-functional doc.aiter.amd.com CNAME reference

Files Changed (18 files)

Infrastructure:

  • .github/workflows/docs.yml — release trigger, -W errors, linkcheck, remove CNAME
  • .github/workflows/release-notify.yml — new: Slack + downstream issue automation
  • .github/changelog-config.json — new: auto changelog categorization
  • docs/conf.py — version auto-detection

API docs rewritten:

  • docs/api/gemm.rst — removed 8 nonexistent functions, documented 25+ actual APIs
  • docs/api/attention.rst — removed fabricated GQA/MQA, documented PA/MHA/MLA
  • docs/api/operators.rst — documented actual norm/activation/rope/quant/sample/cache
  • docs/tutorials/add_new_op.rst — CUDA to HIP/Triton, CUDAExtension to AITER JIT

New pages:

  • docs/api/moe.rst, docs/api/normalization.rst
  • docs/compatibility.rst, docs/models.rst, docs/gemm_tuning.rst
  • docs/changelog.rst, docs/contributing.rst
  • docs/advanced/triton_kernels.rst, docs/performance/benchmarks.rst

Secrets Required (for release-notify.yml)

  • AITER_RELEASE_SLACK_WEBHOOK — Slack incoming webhook URL
  • CROSS_REPO_TOKEN — PAT with issues:write on ROCm/ATOM

Test plan

  • CI docs.yml build passes (Sphinx with -W)
  • Linkcheck reports no critical broken links
  • Manual: trigger workflow_dispatch and verify docs deploy to gh-pages
  • Review API docs against actual function signatures

Follow PyTorch's wheel naming convention (e.g. +rocm7.2.1) for AITER
release wheels. This enables building distinct wheels for different
ROCm versions from the same workflow.

Changes:
- Add rocm_version input (auto-detects from container if empty)
- Use SETUPTOOLS_SCM_PRETEND_VERSION for version+rocm suffix
- Include ROCm version in concurrency group to prevent cross-version
  cancellation
- Update artifact naming to include ROCm suffix
- Default runner: aiter-k8s-build -> aiter-1gpu-runner (actually exists)
- Remove non-existent runners: aiter-mi300-1gpu, aiter-mi325-1gpu
- Fix runner typo: linux-aiter-mi355-1 -> linux-aiter-mi35x-1
- Fix Docker username: rocmshard -> rocmshared (missing 'e')
setuptools_scm 10.x moved to vcs_versioning package, breaking the
build with ModuleNotFoundError. Pin to 9.x until pyproject.toml is
updated.
…dd new pages, automate deployment

- Fix conf.py version to auto-detect from setuptools_scm (was hardcoded 0.1.0)
- Fix docs.yml: add release trigger, enable -W (warnings as errors), add linkcheck
- Remove doc.aiter.amd.com CNAME reference (DNS was never configured)
- Rewrite gemm.rst: remove 8 nonexistent functions, document actual 25+ GEMM APIs
- Rewrite attention.rst: remove fabricated GQA/MQA, document PA/MHA/MLA APIs
- Rewrite operators.rst: document actual norm/activation/rope/quant/sample/cache APIs
- Rewrite add_new_op.rst: replace CUDA build system with HIP/Triton AITER JIT pattern
- Add new pages: compatibility matrix, supported models, GEMM tuning guide
- Add new API docs: moe.rst, normalization.rst
- Add stubs: changelog, contributing, triton_kernels, benchmarks
- Add release-notify.yml: Slack webhook + downstream tracking issue automation
- Add changelog-config.json for auto-generated release notes
@sunway513 sunway513 requested review from a team and Copilot April 12, 2026 19:12
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2706 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR overhauls AITER’s Sphinx documentation to better reflect the current ROCm/HIP + Triton JIT-based codebase, and updates CI/workflows to build/deploy docs on releases plus send release notifications.

Changes:

  • Rewrites major API docs (GEMM, attention, operators) and updates the “add new op” tutorial to match the @compile_ops JIT flow.
  • Adds new documentation pages (compatibility matrix, models, GEMM tuning guide, benchmarks, MoE/normalization APIs).
  • Updates GitHub Actions for docs builds (-W, linkcheck, release deploy) and adds a release notification workflow.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
docs/tutorials/add_new_op.rst Reworked tutorial to document HIP/Triton JIT flow and testing patterns
docs/performance/benchmarks.rst New page describing performance/benchmark resources
docs/models.rst New supported-model matrix and tuned-config references
docs/index.rst Updates landing page messaging, install snippet, TOCs, and support matrix
docs/gemm_tuning.rst New GEMM tuning workflow guide
docs/contributing.rst New contributor setup/testing/linting guidance
docs/conf.py Auto-detects docs version instead of hardcoding
docs/compatibility.rst New ROCm + GPU compatibility/install matrix
docs/changelog.rst New changelog landing page pointing to GitHub Releases
docs/api/operators.rst Rewritten operator inventory aligned to actual modules/APIs
docs/api/normalization.rst New normalization API page
docs/api/moe.rst New MoE API page
docs/api/gemm.rst Rewritten GEMM API docs across backends/precisions
docs/api/attention.rst Rewritten attention API docs (MHA/PA/MLA)
docs/advanced/triton_kernels.rst New Triton backend overview page
.github/workflows/release-notify.yml Adds Slack + downstream tracking issue automation on releases
.github/workflows/docs.yml Builds docs with -W, runs linkcheck, deploys on release
.github/workflows/aiter-release.yaml Adds ROCm-version wheel suffixing + artifact naming adjustments
.github/changelog-config.json New config for changelog categorization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to +78
void my_op_fwd(torch::Tensor input, torch::Tensor output) {
int n = input.numel();
auto stream = at::cuda::getCurrentHIPStream().stream();
launch_my_op(input.data_ptr(), output.data_ptr(), n, stream);
Comment on lines +125 to +133
def gen_my_op_fake(input: Tensor) -> Tensor:
"""Fake tensor impl for torch.compile tracing."""
return torch.empty_like(input)

# Build and install
python setup.py develop

# Or for production
python setup.py install
@compile_ops("module_my_op", gen_fake=gen_my_op_fake)
def my_op_fwd(input: Tensor) -> Tensor:
"""My custom operator."""
...
Comment on lines +206 to 222
# op_tests/test_my_op.py
import torch
import aiter
from aiter.test_common import checkAllclose, benchmark

Print Kernel Launches
^^^^^^^^^^^^^^^^^^^^^
@benchmark()
def test_my_op(m, n, dtype):
ret = {}
input = torch.randn(m, n, dtype=dtype, device="cuda")

.. code-block:: bash
# Reference (PyTorch)
ref_output = input.clone() # replace with actual reference

export HIP_VISIBLE_DEVICES=0
export AMD_LOG_LEVEL=3 # Verbose logging
# AITER op
output = torch.empty_like(input)
aiter.my_op_fwd(output, input)

Comment on lines +275 to +276
5. **Run ruff and pytest before committing.** Lint and test locally before
pushing.
Comment on lines +17 to +22
AITER includes kernel-level benchmarks in its test suite. To run them:

.. code-block:: bash

pytest tests/ -k "benchmark"

Comment on lines +42 to +68
traffic and improve throughput.

.. py:function:: rmsnorm2d_fwd_with_add(input, residual, weight, eps)

Fused residual addition and RMS normalization. Computes
``rmsnorm(input + residual)`` in a single kernel.

:param input: Input tensor.
:param residual: Residual tensor to add before normalization.
:param weight: Learnable scale parameter.
:param eps: Numerical stability constant.

.. py:function:: rmsnorm2d_fwd_with_smoothquant(...)

Fused RMS normalization with SmoothQuant. Applies per-channel smooth
quantization scales after normalization.

.. py:function:: rmsnorm2d_fwd_with_dynamicquant(...)

Fused RMS normalization with dynamic quantization. Computes quantization
parameters on the fly and outputs quantized activations.

.. py:function:: add_rmsnorm_quant(...)

Fused residual add + RMS normalization + quantization in a single kernel.
Combines three operations to minimize global memory round-trips.

"type": "section",
"text": {
"type": "mrkdwn",
"text": "${{ github.event.release.body && '```' || 'See release page for details.' }}"
Comment on lines +78 to +81
'### Install',
'```bash',
'pip install amd-aiter --find-links ' + url,
'```',
Comment on lines +158 to +166
if [ -n "$ROCM_VER" ]; then
FULL_VER="${BASE_VER}+rocm${ROCM_VER}"
else
FULL_VER="${BASE_VER}"
fi
echo "Wheel version: $FULL_VER (base=$BASE_VER, rocm=$ROCM_VER)"
echo "full_version=$FULL_VER" >> "$GITHUB_OUTPUT"
echo "rocm_suffix=rocm${ROCM_VER}" >> "$GITHUB_OUTPUT"

Comment on lines +30 to +59
.. py:function:: fmoe(input, w1, w2, topk_weights, topk_ids, ...)

Main fused MoE forward pass. Dispatches tokens to selected experts, applies
expert weights (w1, w2), and combines results.

:param input: Hidden states of shape ``(num_tokens, hidden_dim)``.
:param w1: First expert weight matrix.
:param w2: Second expert weight matrix.
:param topk_weights: Per-token expert weights from gating.
:param topk_ids: Per-token expert indices from gating.

.. py:function:: fmoe_g1u1(input, gate, up, down, topk_weights, topk_ids, ...)

Fused MoE with separate gate, up, and down projections (GLU-style).
Used by architectures that split the MoE FFN into gate/up/down matrices.

.. py:function:: fmoe_int8_g1u0(...)

INT8 quantized fused MoE. Applies INT8 weight-only quantization to the
expert computations.

.. py:function:: fmoe_fp8_blockscale_g1u1(...)

FP8 block-scale quantized fused MoE with gate/up/down projections.
Uses per-block scaling factors for FP8 computation.

.. py:function:: fused_moe(hidden_states, w1, w2, gating_output, topk, ...)

High-level fused MoE entry point (from ``aiter/fused_moe.py``). Combines
gating and expert computation in a single call.
5 test categories to prevent doc rot:
1. API signature consistency — every autofunction/autoclass in RST must be importable
2. Code example syntax — Python code blocks must parse without SyntaxError
3. Sphinx build — build with -W (warnings as errors), catch broken refs
4. Version consistency — conf.py must use auto-detection, no hardcoded version
5. RST structure — no orphan API pages, no CUDA references in ROCm docs

Also adds test_docs.py to docs.yml CI trigger paths.
…odoc imports

- Fix black formatting issues in test_docs.py and conf.py
- Fix ruff: remove unused os import, fix f-string without placeholders
- Remove autofunction/autoclass directives (fail on CPU-only CI, manual docs sufficient)
- Add autodoc_mock_imports for triton/ROCm modules in conf.py
- Remove deprecated display_version theme option
- Fix tutorials/index.rst: remove 9 references to nonexistent pages
- Fix quickstart.rst: point to existing pages instead of nonexistent tutorials
- Fix installation.rst: remove broken tutorials/triton_comms ref
- Fix basic_usage.rst: remove broken tutorial cross-references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants