Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a saliencedriven mixed-precision quantization framework that allocates bit-widths at the group-wise. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience- Determined Bit Allocation adaptively assigns bitwidths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience. With its structured partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while improving accuracy. Experiments show that SliMLLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%.

Huang, W., Qin, H., Liu, Y., Li, Y., Liu, Q., Liu, X., et al. (2025). SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [10.48550/arxiv.2405.14917].

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Luca Benini;Michele Magno;
2025

Abstract

Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a saliencedriven mixed-precision quantization framework that allocates bit-widths at the group-wise. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience- Determined Bit Allocation adaptively assigns bitwidths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience. With its structured partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while improving accuracy. Experiments show that SliMLLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%.
2025
Forty-second International Conference on Machine Learning ICML 2025
.
.
Huang, W., Qin, H., Liu, Y., Li, Y., Liu, Q., Liu, X., et al. (2025). SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [10.48550/arxiv.2405.14917].
Huang, Wei; Qin, Haotong; Liu, Yangdong; Li, Yawei; Liu, Qinshuo; Liu, Xianglong; Benini, Luca; Magno, Michele; Zhang, Shiming; Qi, Xiaojuan
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1042816
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact