Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
norm
vllm
Commits
26c52a5e
Unverified
Commit
26c52a5e
authored
Dec 17, 2023
by
Woosuk Kwon
Committed by
GitHub
Dec 17, 2023
Browse files
[Docs] Add CUDA graph support to docs (#2148)
parent
c3372e87
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
4 additions
and
2 deletions
+4
-2
README.md
README.md
+2
-1
docs/source/index.rst
docs/source/index.rst
+2
-1
No files found.
README.md
View file @
26c52a5e
...
@@ -35,6 +35,7 @@ vLLM is fast with:
...
@@ -35,6 +35,7 @@ vLLM is fast with:
-
State-of-the-art serving throughput
-
State-of-the-art serving throughput
-
Efficient management of attention key and value memory with
**PagedAttention**
-
Efficient management of attention key and value memory with
**PagedAttention**
-
Continuous batching of incoming requests
-
Continuous batching of incoming requests
-
Fast model execution with CUDA/HIP graph
-
Quantization:
[
GPTQ
](
https://arxiv.org/abs/2210.17323
)
,
[
AWQ
](
https://arxiv.org/abs/2306.00978
)
,
[
SqueezeLLM
](
https://arxiv.org/abs/2306.07629
)
-
Quantization:
[
GPTQ
](
https://arxiv.org/abs/2210.17323
)
,
[
AWQ
](
https://arxiv.org/abs/2306.00978
)
,
[
SqueezeLLM
](
https://arxiv.org/abs/2306.07629
)
-
Optimized CUDA kernels
-
Optimized CUDA kernels
...
@@ -45,7 +46,7 @@ vLLM is flexible and easy to use with:
...
@@ -45,7 +46,7 @@ vLLM is flexible and easy to use with:
-
Tensor parallelism support for distributed inference
-
Tensor parallelism support for distributed inference
-
Streaming outputs
-
Streaming outputs
-
OpenAI-compatible API server
-
OpenAI-compatible API server
-
Support NVIDIA GPUs and AMD GPUs
.
-
Support NVIDIA GPUs and AMD GPUs
vLLM seamlessly supports many Hugging Face models, including the following architectures:
vLLM seamlessly supports many Hugging Face models, including the following architectures:
...
...
docs/source/index.rst
View file @
26c52a5e
...
@@ -30,6 +30,7 @@ vLLM is fast with:
...
@@ -30,6 +30,7 @@ vLLM is fast with:
* State-of-the-art serving throughput
* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
* Optimized CUDA kernels
* Optimized CUDA kernels
...
@@ -40,7 +41,7 @@ vLLM is flexible and easy to use with:
...
@@ -40,7 +41,7 @@ vLLM is flexible and easy to use with:
* Tensor parallelism support for distributed inference
* Tensor parallelism support for distributed inference
* Streaming outputs
* Streaming outputs
* OpenAI-compatible API server
* OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs
.
* Support NVIDIA GPUs and AMD GPUs
For more information, check out the following:
For more information, check out the following:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment