"csrc/vscode:/vscode.git/clone" did not exist on "1b14cd542b8f865e63b1481a98bc635466de49c2"
Unverified Commit b942c094 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Stop using title frontmatter and fix doc that can only be reached by search (#20623)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b4bab816
---
title: Quantized KV Cache
---
# Quantized KV Cache
## FP8 KV Cache
......
---
title: AMD Quark
---
# AMD Quark
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
......
---
title: Supported Hardware
---
# Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
......
---
title: Reasoning Outputs
---
# Reasoning Outputs
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
......
---
title: Speculative Decoding
---
# Speculative Decoding
!!! warning
Please note that speculative decoding in vLLM is not yet optimized and does
......
---
title: Structured Outputs
---
# Structured Outputs
vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or
......
---
title: Installation
---
# Installation
vLLM supports the following hardware platforms:
......
---
title: Quickstart
---
# Quickstart
This guide will help you quickly get started with vLLM to perform:
......
---
title: Loading models with Run:ai Model Streamer
---
# Loading models with Run:ai Model Streamer
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
......
---
title: Loading models with CoreWeave's Tensorizer
---
# Loading models with CoreWeave's Tensorizer
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
......
---
title: Generative Models
---
# Generative Models
vLLM provides first-class support for generative models, which covers most of LLMs.
......
---
title: TPU
---
# TPU
# TPU Supported Models
## Text-only Language Models
......
---
title: Pooling Models
---
# Pooling Models
vLLM also supports pooling models, including embedding, reranking and reward models.
......
---
title: Supported Models
---
# Supported Models
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument.
......
---
title: Distributed Inference and Serving
---
# Distributed Inference and Serving
## How to decide the distributed inference strategy?
......
---
title: LangChain
---
# LangChain
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
......
---
title: LlamaIndex
---
# LlamaIndex
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
......
---
title: Offline Inference
---
# Offline Inference
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
......@@ -23,7 +21,7 @@ The available APIs depend on the model type:
!!! info
[API Reference][offline-inference-api]
### Ray Data LLM API
## Ray Data LLM API
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
......
---
title: OpenAI-Compatible Server
---
# OpenAI-Compatible Server
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
......
---
title: Frequently Asked Questions
---
# Frequently Asked Questions
> Q: How can I serve multiple models on a single port using the OpenAI API?
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment