Stop using title frontmatter and fix doc that can only be reached by search (#20623)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Stop using title frontmatter and fix doc that can only be reached by search (#20623)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
b942c094 · Harry Mellor · GitHub · b4bab816 · b942c094 · b942c094
Unverified Commit b942c094 authored Jul 08, 2025 by Harry Mellor Committed by GitHub Jul 08, 2025
20 changed files
--- a/docs/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
---
-title: Quantized KV Cache
---
+# Quantized KV Cache

 ## FP8 KV Cache


--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
---
-title: AMD Quark
---
+# AMD Quark

 Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
 throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),

--- a/docs/features/quantization/supported_hardware.md
+++ b/docs/features/quantization/supported_hardware.md
---
-title: Supported Hardware
---
+# Supported Hardware

 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:


--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
---
-title: Reasoning Outputs
---
+# Reasoning Outputs

 vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.


--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode.md
---
-title: Speculative Decoding
---
+# Speculative Decoding

 !!! warning
    Please note that speculative decoding in vLLM is not yet optimized and does

--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
---
-title: Structured Outputs
---
+# Structured Outputs

 vLLM supports the generation of structured outputs using
 [xgrammar](https://github.com/mlc-ai/xgrammar) or

--- a/docs/getting_started/installation/README.md
+++ b/docs/getting_started/installation/README.md
---
-title: Installation
---
+# Installation

 vLLM supports the following hardware platforms:


--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
---
-title: Quickstart
---
+# Quickstart

 This guide will help you quickly get started with vLLM to perform:


--- a/docs/models/extensions/runai_model_streamer.md
+++ b/docs/models/extensions/runai_model_streamer.md
---
-title: Loading models with Run:ai Model Streamer
---
+# Loading models with Run:ai Model Streamer

 Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
 Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).

--- a/docs/models/extensions/tensorizer.md
+++ b/docs/models/extensions/tensorizer.md
---
-title: Loading models with CoreWeave's Tensorizer
---
+# Loading models with CoreWeave's Tensorizer

 vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
 vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized

--- a/docs/models/generative_models.md
+++ b/docs/models/generative_models.md
---
-title: Generative Models
---
+# Generative Models

 vLLM provides first-class support for generative models, which covers most of LLMs.


--- a/docs/models/hardware_supported_models/tpu.md
+++ b/docs/models/hardware_supported_models/tpu.md
---
-title: TPU
---
+# TPU

 # TPU Supported Models
 ## Text-only Language Models

--- a/docs/models/pooling_models.md
+++ b/docs/models/pooling_models.md
---
-title: Pooling Models
---
+# Pooling Models

 vLLM also supports pooling models, including embedding, reranking and reward models.


--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
---
-title: Supported Models
---
+# Supported Models

 vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
 If a model supports more than one task, you can set the task via the `--task` argument.

--- a/docs/serving/distributed_serving.md
+++ b/docs/serving/distributed_serving.md
---
-title: Distributed Inference and Serving
---
+# Distributed Inference and Serving

 ## How to decide the distributed inference strategy?


--- a/docs/serving/integrations/langchain.md
+++ b/docs/serving/integrations/langchain.md
---
-title: LangChain
---
+# LangChain

 vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .


--- a/docs/serving/integrations/llamaindex.md
+++ b/docs/serving/integrations/llamaindex.md
---
-title: LlamaIndex
---
+# LlamaIndex

 vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .


--- a/docs/serving/offline_inference.md
+++ b/docs/serving/offline_inference.md
---
-title: Offline Inference
---
+# Offline Inference

 Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.

@@ -23,7 +21,7 @@ The available APIs depend on the model type:
 !!! info
    [API Reference][offline-inference-api]

-### Ray Data LLM API
+## Ray Data LLM API

 Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
 This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:

--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
---
-title: OpenAI-Compatible Server
---
+# OpenAI-Compatible Server

 vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.


--- a/docs/usage/faq.md
+++ b/docs/usage/faq.md
---
-title: Frequently Asked Questions
---
+# Frequently Asked Questions

 > Q: How can I serve multiple models on a single port using the OpenAI API?