Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
norm
vllm
Commits
4cefa9b4
Unverified
Commit
4cefa9b4
authored
Dec 02, 2023
by
Simon Mo
Committed by
GitHub
Dec 02, 2023
Browse files
[Docs] Update the AWQ documentation to highlight performance issue (#1883)
parent
f86bd619
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
6 additions
and
0 deletions
+6
-0
docs/source/quantization/auto_awq.rst
docs/source/quantization/auto_awq.rst
+6
-0
No files found.
docs/source/quantization/auto_awq.rst
View file @
4cefa9b4
...
...
@@ -3,6 +3,12 @@
AutoAWQ
==================
..
warning
::
Please
note
that
AWQ
support
in
vLLM
is
under
-
optimized
at
the
moment
.
We
would
recommend
using
the
unquantized
version
of
the
model
for
better
accuracy
and
higher
throughput
.
Currently
,
you
can
use
AWQ
as
a
way
to
reduce
memory
footprint
.
As
of
now
,
it
is
more
suitable
for
low
latency
inference
with
small
number
of
concurrent
requests
.
vLLM
's AWQ implementation have lower throughput than unquantized version.
To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_.
Quantizing reduces the model'
s
precision
from
FP16
to
INT4
which
effectively
reduces
the
file
size
by
~
70
%.
The
main
benefits
are
lower
latency
and
memory
usage
.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment