Unverified Commit b942c094 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Stop using title frontmatter and fix doc that can only be reached by search (#20623)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b4bab816
---
title: Quantized KV Cache
---
# Quantized KV Cache
## FP8 KV Cache
......
---
title: AMD Quark
---
# AMD Quark
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
......
---
title: Supported Hardware
---
# Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
......
---
title: Reasoning Outputs
---
# Reasoning Outputs
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
......
---
title: Speculative Decoding
---
# Speculative Decoding
!!! warning
Please note that speculative decoding in vLLM is not yet optimized and does
......
---
title: Structured Outputs
---
# Structured Outputs
vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or
......
---
title: Installation
---
# Installation
vLLM supports the following hardware platforms:
......
---
title: Quickstart
---
# Quickstart
This guide will help you quickly get started with vLLM to perform:
......
---
title: Loading models with Run:ai Model Streamer
---
# Loading models with Run:ai Model Streamer
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
......
---
title: Loading models with CoreWeave's Tensorizer
---
# Loading models with CoreWeave's Tensorizer
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
......
---
title: Generative Models
---
# Generative Models
vLLM provides first-class support for generative models, which covers most of LLMs.
......
---
title: TPU
---
# TPU
# TPU Supported Models
## Text-only Language Models
......
---
title: Pooling Models
---
# Pooling Models
vLLM also supports pooling models, including embedding, reranking and reward models.
......
---
title: Supported Models
---
# Supported Models
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument.
......
---
title: Distributed Inference and Serving
---
# Distributed Inference and Serving
## How to decide the distributed inference strategy?
......
---
title: LangChain
---
# LangChain
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
......
---
title: LlamaIndex
---
# LlamaIndex
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
......
---
title: Offline Inference
---
# Offline Inference
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
......@@ -23,7 +21,7 @@ The available APIs depend on the model type:
!!! info
[API Reference][offline-inference-api]
### Ray Data LLM API
## Ray Data LLM API
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
......
---
title: OpenAI-Compatible Server
---
# OpenAI-Compatible Server
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
......
---
title: Frequently Asked Questions
---
# Frequently Asked Questions
> Q: How can I serve multiple models on a single port using the OpenAI API?
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment