[Docs] Update the AWQ documentation to highlight performance issue (#1883)

4cefa9b4 · Simon Mo · GitHub · f86bd619 · 4cefa9b4
Unverified Commit 4cefa9b4 authored Dec 02, 2023 by Simon Mo Committed by GitHub Dec 02, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 0 deletions

docs/source/quantization/auto_awq.rst docs/source/quantization/auto_awq.rst +6 -0

No files found.
--- a/docs/source/quantization/auto_awq.rst
+++ b/docs/source/quantization/auto_awq.rst
@@ -3,6 +3,12 @@
 AutoAWQ
 ==================

+.. warning::
+
+   Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
+   accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
+   inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
+
 To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_. 
 Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
 The main benefits are lower latency and memory usage.