Open
Conversation
Follow PyTorch's wheel naming convention (e.g. +rocm7.2.1) for AITER release wheels. This enables building distinct wheels for different ROCm versions from the same workflow. Changes: - Add rocm_version input (auto-detects from container if empty) - Use SETUPTOOLS_SCM_PRETEND_VERSION for version+rocm suffix - Include ROCm version in concurrency group to prevent cross-version cancellation - Update artifact naming to include ROCm suffix
- Default runner: aiter-k8s-build -> aiter-1gpu-runner (actually exists) - Remove non-existent runners: aiter-mi300-1gpu, aiter-mi325-1gpu - Fix runner typo: linux-aiter-mi355-1 -> linux-aiter-mi35x-1 - Fix Docker username: rocmshard -> rocmshared (missing 'e')
setuptools_scm 10.x moved to vcs_versioning package, breaking the build with ModuleNotFoundError. Pin to 9.x until pyproject.toml is updated.
…dd new pages, automate deployment - Fix conf.py version to auto-detect from setuptools_scm (was hardcoded 0.1.0) - Fix docs.yml: add release trigger, enable -W (warnings as errors), add linkcheck - Remove doc.aiter.amd.com CNAME reference (DNS was never configured) - Rewrite gemm.rst: remove 8 nonexistent functions, document actual 25+ GEMM APIs - Rewrite attention.rst: remove fabricated GQA/MQA, document PA/MHA/MLA APIs - Rewrite operators.rst: document actual norm/activation/rope/quant/sample/cache APIs - Rewrite add_new_op.rst: replace CUDA build system with HIP/Triton AITER JIT pattern - Add new pages: compatibility matrix, supported models, GEMM tuning guide - Add new API docs: moe.rst, normalization.rst - Add stubs: changelog, contributing, triton_kernels, benchmarks - Add release-notify.yml: Slack webhook + downstream tracking issue automation - Add changelog-config.json for auto-generated release notes
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR overhauls AITER’s Sphinx documentation to better reflect the current ROCm/HIP + Triton JIT-based codebase, and updates CI/workflows to build/deploy docs on releases plus send release notifications.
Changes:
- Rewrites major API docs (GEMM, attention, operators) and updates the “add new op” tutorial to match the
@compile_opsJIT flow. - Adds new documentation pages (compatibility matrix, models, GEMM tuning guide, benchmarks, MoE/normalization APIs).
- Updates GitHub Actions for docs builds (
-W, linkcheck, release deploy) and adds a release notification workflow.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/tutorials/add_new_op.rst | Reworked tutorial to document HIP/Triton JIT flow and testing patterns |
| docs/performance/benchmarks.rst | New page describing performance/benchmark resources |
| docs/models.rst | New supported-model matrix and tuned-config references |
| docs/index.rst | Updates landing page messaging, install snippet, TOCs, and support matrix |
| docs/gemm_tuning.rst | New GEMM tuning workflow guide |
| docs/contributing.rst | New contributor setup/testing/linting guidance |
| docs/conf.py | Auto-detects docs version instead of hardcoding |
| docs/compatibility.rst | New ROCm + GPU compatibility/install matrix |
| docs/changelog.rst | New changelog landing page pointing to GitHub Releases |
| docs/api/operators.rst | Rewritten operator inventory aligned to actual modules/APIs |
| docs/api/normalization.rst | New normalization API page |
| docs/api/moe.rst | New MoE API page |
| docs/api/gemm.rst | Rewritten GEMM API docs across backends/precisions |
| docs/api/attention.rst | Rewritten attention API docs (MHA/PA/MLA) |
| docs/advanced/triton_kernels.rst | New Triton backend overview page |
| .github/workflows/release-notify.yml | Adds Slack + downstream tracking issue automation on releases |
| .github/workflows/docs.yml | Builds docs with -W, runs linkcheck, deploys on release |
| .github/workflows/aiter-release.yaml | Adds ROCm-version wheel suffixing + artifact naming adjustments |
| .github/changelog-config.json | New config for changelog categorization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+75
to
+78
| void my_op_fwd(torch::Tensor input, torch::Tensor output) { | ||
| int n = input.numel(); | ||
| auto stream = at::cuda::getCurrentHIPStream().stream(); | ||
| launch_my_op(input.data_ptr(), output.data_ptr(), n, stream); |
Comment on lines
+125
to
+133
| def gen_my_op_fake(input: Tensor) -> Tensor: | ||
| """Fake tensor impl for torch.compile tracing.""" | ||
| return torch.empty_like(input) | ||
|
|
||
| # Build and install | ||
| python setup.py develop | ||
|
|
||
| # Or for production | ||
| python setup.py install | ||
| @compile_ops("module_my_op", gen_fake=gen_my_op_fake) | ||
| def my_op_fwd(input: Tensor) -> Tensor: | ||
| """My custom operator.""" | ||
| ... |
Comment on lines
+206
to
222
| # op_tests/test_my_op.py | ||
| import torch | ||
| import aiter | ||
| from aiter.test_common import checkAllclose, benchmark | ||
|
|
||
| Print Kernel Launches | ||
| ^^^^^^^^^^^^^^^^^^^^^ | ||
| @benchmark() | ||
| def test_my_op(m, n, dtype): | ||
| ret = {} | ||
| input = torch.randn(m, n, dtype=dtype, device="cuda") | ||
|
|
||
| .. code-block:: bash | ||
| # Reference (PyTorch) | ||
| ref_output = input.clone() # replace with actual reference | ||
|
|
||
| export HIP_VISIBLE_DEVICES=0 | ||
| export AMD_LOG_LEVEL=3 # Verbose logging | ||
| # AITER op | ||
| output = torch.empty_like(input) | ||
| aiter.my_op_fwd(output, input) | ||
|
|
Comment on lines
+275
to
+276
| 5. **Run ruff and pytest before committing.** Lint and test locally before | ||
| pushing. |
Comment on lines
+17
to
+22
| AITER includes kernel-level benchmarks in its test suite. To run them: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| pytest tests/ -k "benchmark" | ||
|
|
Comment on lines
+42
to
+68
| traffic and improve throughput. | ||
|
|
||
| .. py:function:: rmsnorm2d_fwd_with_add(input, residual, weight, eps) | ||
|
|
||
| Fused residual addition and RMS normalization. Computes | ||
| ``rmsnorm(input + residual)`` in a single kernel. | ||
|
|
||
| :param input: Input tensor. | ||
| :param residual: Residual tensor to add before normalization. | ||
| :param weight: Learnable scale parameter. | ||
| :param eps: Numerical stability constant. | ||
|
|
||
| .. py:function:: rmsnorm2d_fwd_with_smoothquant(...) | ||
|
|
||
| Fused RMS normalization with SmoothQuant. Applies per-channel smooth | ||
| quantization scales after normalization. | ||
|
|
||
| .. py:function:: rmsnorm2d_fwd_with_dynamicquant(...) | ||
|
|
||
| Fused RMS normalization with dynamic quantization. Computes quantization | ||
| parameters on the fly and outputs quantized activations. | ||
|
|
||
| .. py:function:: add_rmsnorm_quant(...) | ||
|
|
||
| Fused residual add + RMS normalization + quantization in a single kernel. | ||
| Combines three operations to minimize global memory round-trips. | ||
|
|
| "type": "section", | ||
| "text": { | ||
| "type": "mrkdwn", | ||
| "text": "${{ github.event.release.body && '```' || 'See release page for details.' }}" |
Comment on lines
+78
to
+81
| '### Install', | ||
| '```bash', | ||
| 'pip install amd-aiter --find-links ' + url, | ||
| '```', |
Comment on lines
+158
to
+166
| if [ -n "$ROCM_VER" ]; then | ||
| FULL_VER="${BASE_VER}+rocm${ROCM_VER}" | ||
| else | ||
| FULL_VER="${BASE_VER}" | ||
| fi | ||
| echo "Wheel version: $FULL_VER (base=$BASE_VER, rocm=$ROCM_VER)" | ||
| echo "full_version=$FULL_VER" >> "$GITHUB_OUTPUT" | ||
| echo "rocm_suffix=rocm${ROCM_VER}" >> "$GITHUB_OUTPUT" | ||
|
|
Comment on lines
+30
to
+59
| .. py:function:: fmoe(input, w1, w2, topk_weights, topk_ids, ...) | ||
|
|
||
| Main fused MoE forward pass. Dispatches tokens to selected experts, applies | ||
| expert weights (w1, w2), and combines results. | ||
|
|
||
| :param input: Hidden states of shape ``(num_tokens, hidden_dim)``. | ||
| :param w1: First expert weight matrix. | ||
| :param w2: Second expert weight matrix. | ||
| :param topk_weights: Per-token expert weights from gating. | ||
| :param topk_ids: Per-token expert indices from gating. | ||
|
|
||
| .. py:function:: fmoe_g1u1(input, gate, up, down, topk_weights, topk_ids, ...) | ||
|
|
||
| Fused MoE with separate gate, up, and down projections (GLU-style). | ||
| Used by architectures that split the MoE FFN into gate/up/down matrices. | ||
|
|
||
| .. py:function:: fmoe_int8_g1u0(...) | ||
|
|
||
| INT8 quantized fused MoE. Applies INT8 weight-only quantization to the | ||
| expert computations. | ||
|
|
||
| .. py:function:: fmoe_fp8_blockscale_g1u1(...) | ||
|
|
||
| FP8 block-scale quantized fused MoE with gate/up/down projections. | ||
| Uses per-block scaling factors for FP8 computation. | ||
|
|
||
| .. py:function:: fused_moe(hidden_states, w1, w2, gating_output, topk, ...) | ||
|
|
||
| High-level fused MoE entry point (from ``aiter/fused_moe.py``). Combines | ||
| gating and expert computation in a single call. |
5 test categories to prevent doc rot: 1. API signature consistency — every autofunction/autoclass in RST must be importable 2. Code example syntax — Python code blocks must parse without SyntaxError 3. Sphinx build — build with -W (warnings as errors), catch broken refs 4. Version consistency — conf.py must use auto-detection, no hardcoded version 5. RST structure — no orphan API pages, no CUDA references in ROCm docs Also adds test_docs.py to docs.yml CI trigger paths.
…odoc imports - Fix black formatting issues in test_docs.py and conf.py - Fix ruff: remove unused os import, fix f-string without placeholders - Remove autofunction/autoclass directives (fail on CPU-only CI, manual docs sufficient) - Add autodoc_mock_imports for triton/ROCm modules in conf.py - Remove deprecated display_version theme option - Fix tutorials/index.rst: remove 9 references to nonexistent pages - Fix quickstart.rst: point to existing pages instead of nonexistent tutorials - Fix installation.rst: remove broken tutorials/triton_comms ref - Fix basic_usage.rst: remove broken tutorial cross-references
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Files Changed (18 files)
Infrastructure:
API docs rewritten:
New pages:
Secrets Required (for release-notify.yml)
Test plan