docs: add empirical energy efficiency data to quantization concept guide#2410
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thank you again for the merge, Régis! I've been thinking about extending this into a cross-backend energy benchmark (GPTQ/AWQ/GGUF). Would email or HF community be the best way to share a brief proposal? |
|
Thanks for the pointer to the serving documentation. One observation worth sharing: after filing this issue, the PyTorch torchao team confirmed that energy efficiency is not a priority for torchao, and that native Hugging Face inference is not their target path — their focus is on serving engines like vLLM/SGLang with torch.compile. This creates an important gap: thousands of developers and researchers use HF Transformers directly for inference — many without realizing that their quantization choices may actually increase energy consumption rather than reduce it. Our benchmarks show that for models under ~4B parameters, NF4 quantization raises energy by 20–55%, and FP8 (torchao) can incur up to +701% energy penalty in the current eager-mode path. This is exactly why we built EcoCompute AI — to provide systematic, hardware-aware energy benchmarks so developers can make informed deployment decisions. We believe energy efficiency metrics deserve the same attention as throughput and latency in the ML ecosystem. We'd welcome any collaboration or feedback from the HF team on integrating energy-awareness into the Transformers ecosystem. |
1 similar comment
|
Thanks for the pointer to the serving documentation. One observation worth sharing: after filing this issue, the PyTorch torchao team confirmed that energy efficiency is not a priority for torchao, and that native Hugging Face inference is not their target path — their focus is on serving engines like vLLM/SGLang with torch.compile. This creates an important gap: thousands of developers and researchers use HF Transformers directly for inference — many without realizing that their quantization choices may actually increase energy consumption rather than reduce it. Our benchmarks show that for models under ~4B parameters, NF4 quantization raises energy by 20–55%, and FP8 (torchao) can incur up to +701% energy penalty in the current eager-mode path. This is exactly why we built EcoCompute AI — to provide systematic, hardware-aware energy benchmarks so developers can make informed deployment decisions. We believe energy efficiency metrics deserve the same attention as throughput and latency in the ML ecosystem. We'd welcome any collaboration or feedback from the HF team on integrating energy-awareness into the Transformers ecosystem. |
### Skill Info - **Name**: ecocompute - **ClawHub**: https://clawhub.ai/hongping-zh/ecocompute - **GitHub (skills repo)**: https://github.com/openclaw/skills/tree/main/skills/hongping-zh/ecocompute - **Version**: 2.5.0 - **Author**: Hongping Zhang ### What it does EcoCompute is an LLM energy efficiency advisor powered by 113+ real GPU energy measurements across 3 NVIDIA architectures (RTX 5090 Blackwell, RTX 4090D Ada Lovelace, A800 Ampere) and 5 quantization methods (FP16, FP8, NF4, INT8-mixed, INT8-pure). It provides 5 protocols: OPTIMIZE, DIAGNOSE, COMPARE, ESTIMATE, and AUDIT — helping users avoid common energy waste patterns that cost 17-701% more energy than optimal configurations. ### Why it belongs in this list - **Unique data**: World's first RTX 5090 five-precision energy benchmark - **Officially referenced**: Data cited in HuggingFace Optimum official quantization docs - **Community impact**: torchao team (PyTorch) confirmed FP8 energy findings (Issue #4094) - **Peer-reviewed quality**: 113+ measurements, NVML 10Hz, n=3-10 runs, CV<2% - **Archived**: Zenodo DOI 10.5281/zenodo.18900289 + HuggingFace Hub dataset ### Key findings that agents get wrong without this skill - Default `load_in_8bit=True` wastes 17-147% energy (fixable with one line) - NF4 on small models (<3B) wastes 29% energy - FP8 eager mode wastes 158-701% energy (torchao confirmed) - Batch size 1 wastes up to 95.7% energy ### Links - Paper & data: https://github.com/hongping-zh/ecocompute-ai - HF Dataset: https://huggingface.co/datasets/hongpingzhang/ecocompute-energy-efficiency - HF Optimum PR #2410: huggingface/optimum#2410 --- ## Entry to add (in AI & LLMs section, alphabetical order) ```markdown - [ecocompute](https://clawhub.ai/hongping-zh/ecocompute) - LLM energy efficiency advisor with 113+ RTX 5090/4090D/A800 measurements. Detects quantization energy traps (INT8 +147%, FP8 +701%) and saves 30-701% wasted GPU energy. Referenced in HuggingFace Optimum official docs. ``` --- ## Checklist - [x] Skill is published on ClawHub - [x] Skill is in openclaw/skills repo (auto-synced) - [x] Not submitted within 3 hours of creation (published weeks ago) - [x] Has real-world usage and external validation (HF official docs, torchao confirmation) - [x] Entry includes ClawHub link - [x] Placed in correct category (AI & LLMs) - [x] Alphabetical order maintained
|
Hi @regisss, hope you're doing well! Quick update on the energy efficiency work since you merged this PR — and a request. Progress since PR mergeThe dataset behind this documentation has grown significantly:
This gap matters: thousands of developers use HF Transformers directly for quantized inference, and many don't realize their quantization choices may increase energy consumption. For models under ~4B parameters, NF4 raises energy by 20–55%, and FP8 can be 7x worse in the current eager-mode path. Request: brief feedback for future researchI'm preparing a research paper based on this work: "EcoCompute: Energy Efficiency Benchmark for Quantized Language Models". The paper formalizes the findings that are now part of the Optimum official documentation, including the quantization paradoxes and the five-precision benchmark methodology. Since you reviewed and merged the original data into HuggingFace's official docs, would you be open to providing brief feedback on the research direction? Even a simple confirmation like:
This would help establish credibility for the research direction, especially since the data is already trusted enough to be part of HF's official guides. Completely understand if this isn't possible — happy to discuss via email: contact@hongping-zh.com Thanks again for your support on this work, Régis! Looking forward to continuing the energy efficiency conversation. |
Summary
Adds an "Energy efficiency in practice" section to the quantization concept guide (
docs/source/concept_guides/quantization.mdx), providing empirical data that addresses the existing "(in theory)" qualifier in the introduction.Motivation
The current documentation states that quantization "consumes less energy (in theory)." Through systematic benchmarking across multiple GPU architectures and model sizes, we found that this assumption does not always hold — particularly for small models (<3B parameters) where dequantization overhead can increase energy consumption by 25–56%.
This addition provides practitioners with concrete, data-backed guidance on when quantization helps (and when it doesn't) from an energy perspective.
Changes
Added a new subsection covering:
Data Source
Related
Notes
.mdxformatting style and uses the<Tip>component