From a1b10be2ee4848ad2a665221d5e65d8aee31ba88 Mon Sep 17 00:00:00 2001 From: hongping Date: Mon, 9 Mar 2026 11:38:14 +0800 Subject: [PATCH 1/2] docs: add empirical energy efficiency data to quantization concept guide --- docs/source/concept_guides/quantization.mdx | 28 +++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/docs/source/concept_guides/quantization.mdx b/docs/source/concept_guides/quantization.mdx index da5ebf45c1..927bdfed49 100644 --- a/docs/source/concept_guides/quantization.mdx +++ b/docs/source/concept_guides/quantization.mdx @@ -172,6 +172,34 @@ counterparts. 6. Evaluate the quantized model: is the accuracy good enough? If yes, stop here, otherwise start again at step 3 but with quantization aware training this time. +## Energy efficiency in practice + +The introduction above notes that quantization "consumes less energy (in theory)." Systematic benchmarking across +NVIDIA Ada Lovelace (RTX 4090D) and Blackwell (RTX 5090) architectures reveals that the relationship between +quantization and energy consumption is more nuanced in practice: + +- **Large models (≥5B parameters)**: NF4 quantization achieves near-FP16 energy consumption with significant memory +savings — the expected benefit holds. +- **Small models (<3B parameters)**: NF4 quantization can *increase* energy consumption by 25–56% despite achieving +75% memory compression. The dequantization overhead exceeds the memory bandwidth savings at this scale. +- **INT8 mixed-precision**: The default `llm_int8_threshold=6.0` in `bitsandbytes` adds 17–33% energy overhead +compared to FP16, which is a justified cost for maintaining model accuracy. +- **Batch size effect**: Increasing batch size from 1 to 8–64 reduces per-token energy by 84–96%, often outweighing +the impact of precision choice. + + + +These findings suggest that energy-optimal deployment depends on model size, precision format, batch size, and +hardware generation. Quantization remains beneficial for memory reduction, but its energy impact should be validated +empirically for each deployment scenario. + + + +For detailed benchmarks and interactive visualizations, see the +[EcoCompute-AI toolkit](https://github.com/hongping-zh/ecocompute-ai) and +[interactive dashboard](https://hongping-zh.github.io/ecocompute-dynamic-eval/). +The full dataset is archived on [Zenodo](https://zenodo.org/records/18900289). + ## Supported tools to perform quantization in 🤗 Optimum 🤗 Optimum provides APIs to perform quantization using different tools for different targets: From 11ab33e83617baa4886d699bae164acb6a5d64e9 Mon Sep 17 00:00:00 2001 From: hongping Date: Wed, 11 Mar 2026 15:32:34 +0800 Subject: [PATCH 2/2] Add HF Hub dataset link per reviewer request --- docs/source/concept_guides/quantization.mdx | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/concept_guides/quantization.mdx b/docs/source/concept_guides/quantization.mdx index 927bdfed49..e7d934ba98 100644 --- a/docs/source/concept_guides/quantization.mdx +++ b/docs/source/concept_guides/quantization.mdx @@ -198,7 +198,8 @@ empirically for each deployment scenario. For detailed benchmarks and interactive visualizations, see the [EcoCompute-AI toolkit](https://github.com/hongping-zh/ecocompute-ai) and [interactive dashboard](https://hongping-zh.github.io/ecocompute-dynamic-eval/). -The full dataset is archived on [Zenodo](https://zenodo.org/records/18900289). +The full dataset is available on the [Hugging Face Hub](https://huggingface.co/datasets/hongpingzhang/ecocompute-energy-efficiency) +and archived on [Zenodo](https://zenodo.org/records/18900289). ## Supported tools to perform quantization in 🤗 Optimum