Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
19bdaf32
Unverified
Commit
19bdaf32
authored
Jun 03, 2025
by
SorenDreano
Committed by
GitHub
Jun 03, 2025
Browse files
[Doc] Readme standardization (#18695)
Co-authored-by:
Soren Dreano
<
soren@numind.ai
>
parent
02f0c7b2
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
5 deletions
+5
-5
README.md
README.md
+5
-5
No files found.
README.md
View file @
19bdaf32
...
@@ -58,8 +58,8 @@ vLLM is fast with:
...
@@ -58,8 +58,8 @@ vLLM is fast with:
-
Efficient management of attention key and value memory with
[
**PagedAttention**
](
https://blog.vllm.ai/2023/06/20/vllm.html
)
-
Efficient management of attention key and value memory with
[
**PagedAttention**
](
https://blog.vllm.ai/2023/06/20/vllm.html
)
-
Continuous batching of incoming requests
-
Continuous batching of incoming requests
-
Fast model execution with CUDA/HIP graph
-
Fast model execution with CUDA/HIP graph
-
Quantizations:
[
GPTQ
](
https://arxiv.org/abs/2210.17323
)
,
[
AWQ
](
https://arxiv.org/abs/2306.00978
)
,
[
AutoRound
](
https://arxiv.org/abs/2309.05516
)
,INT4, INT8, and FP8
.
-
Quantizations:
[
GPTQ
](
https://arxiv.org/abs/2210.17323
)
,
[
AWQ
](
https://arxiv.org/abs/2306.00978
)
,
[
AutoRound
](
https://arxiv.org/abs/2309.05516
)
,
INT4, INT8, and FP8
-
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
.
-
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
-
Speculative decoding
-
Speculative decoding
-
Chunked prefill
-
Chunked prefill
...
@@ -72,14 +72,14 @@ vLLM is flexible and easy to use with:
...
@@ -72,14 +72,14 @@ vLLM is flexible and easy to use with:
-
Tensor parallelism and pipeline parallelism support for distributed inference
-
Tensor parallelism and pipeline parallelism support for distributed inference
-
Streaming outputs
-
Streaming outputs
-
OpenAI-compatible API server
-
OpenAI-compatible API server
-
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron
.
-
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron
-
Prefix caching support
-
Prefix caching support
-
Multi-LoRA support
-
Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
-
Transformer-like LLMs (e.g., Llama)
-
Transformer-like LLMs (e.g., Llama)
-
Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
-
Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
-
Embedding Models (e.g. E5-Mistral)
-
Embedding Models (e.g.
,
E5-Mistral)
-
Multi-modal LLMs (e.g., LLaVA)
-
Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models
[
here
](
https://docs.vllm.ai/en/latest/models/supported_models.html
)
.
Find the full list of supported models
[
here
](
https://docs.vllm.ai/en/latest/models/supported_models.html
)
.
...
@@ -162,4 +162,4 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
...
@@ -162,4 +162,4 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
## Media Kit
## Media Kit
-
If you wish to use vLLM's logo, please refer to
[
our media kit repo
](
https://github.com/vllm-project/media-kit
)
.
-
If you wish to use vLLM's logo, please refer to
[
our media kit repo
](
https://github.com/vllm-project/media-kit
)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment