Unverified Commit b942c094 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Stop using title frontmatter and fix doc that can only be reached by search (#20623)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b4bab816
--- # Quantized KV Cache
title: Quantized KV Cache
---
## FP8 KV Cache ## FP8 KV Cache
......
--- # AMD Quark
title: AMD Quark
---
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/), throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
......
--- # Supported Hardware
title: Supported Hardware
---
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
......
--- # Reasoning Outputs
title: Reasoning Outputs
---
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions. vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
......
--- # Speculative Decoding
title: Speculative Decoding
---
!!! warning !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does Please note that speculative decoding in vLLM is not yet optimized and does
......
--- # Structured Outputs
title: Structured Outputs
---
vLLM supports the generation of structured outputs using vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or [xgrammar](https://github.com/mlc-ai/xgrammar) or
......
--- # Installation
title: Installation
---
vLLM supports the following hardware platforms: vLLM supports the following hardware platforms:
......
--- # Quickstart
title: Quickstart
---
This guide will help you quickly get started with vLLM to perform: This guide will help you quickly get started with vLLM to perform:
......
--- # Loading models with Run:ai Model Streamer
title: Loading models with Run:ai Model Streamer
---
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md). Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
......
--- # Loading models with CoreWeave's Tensorizer
title: Loading models with CoreWeave's Tensorizer
---
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer). vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
......
--- # Generative Models
title: Generative Models
---
vLLM provides first-class support for generative models, which covers most of LLMs. vLLM provides first-class support for generative models, which covers most of LLMs.
......
--- # TPU
title: TPU
---
# TPU Supported Models # TPU Supported Models
## Text-only Language Models ## Text-only Language Models
......
--- # Pooling Models
title: Pooling Models
---
vLLM also supports pooling models, including embedding, reranking and reward models. vLLM also supports pooling models, including embedding, reranking and reward models.
......
--- # Supported Models
title: Supported Models
---
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks. vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument. If a model supports more than one task, you can set the task via the `--task` argument.
......
--- # Distributed Inference and Serving
title: Distributed Inference and Serving
---
## How to decide the distributed inference strategy? ## How to decide the distributed inference strategy?
......
--- # LangChain
title: LangChain
---
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) . vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
......
--- # LlamaIndex
title: LlamaIndex
---
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
......
--- # Offline Inference
title: Offline Inference
---
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class. Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
...@@ -23,7 +21,7 @@ The available APIs depend on the model type: ...@@ -23,7 +21,7 @@ The available APIs depend on the model type:
!!! info !!! info
[API Reference][offline-inference-api] [API Reference][offline-inference-api]
### Ray Data LLM API ## Ray Data LLM API
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine. Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference: This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
......
--- # OpenAI-Compatible Server
title: OpenAI-Compatible Server
---
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client. vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
......
--- # Frequently Asked Questions
title: Frequently Asked Questions
---
> Q: How can I serve multiple models on a single port using the OpenAI API? > Q: How can I serve multiple models on a single port using the OpenAI API?
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment