[Docs] Rewrite offline inference guide (#20594)

Signed-off-by: Ricardo Decal <rdecal@anyscale.com>

[Docs] Rewrite offline inference guide (#20594)
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
0d914c81 · Ricardo Decal · GitHub · 6e428cdd · 0d914c81
Unverified Commit 0d914c81 authored Jul 07, 2025 by Ricardo Decal Committed by GitHub Jul 07, 2025
Show whitespace changes
Inline Side-by-side

Showing with 19 additions and 8 deletions

docs/serving/offline_inference.md docs/serving/offline_inference.md +19 -8

No files found.
--- a/docs/serving/offline_inference.md
+++ b/docs/serving/offline_inference.md
@@ -3,10 +3,7 @@ title: Offline Inference
 ---
 [](){ #offline-inference }
-You can run vLLM in your own code on a list of prompts.
+Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
-The offline API is based on the [LLM][vllm.LLM] class.
-To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
 For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
 and runs it in vLLM using the default configuration.
@@ -14,16 +11,30 @@ and runs it in vLLM using the default configuration.
 ```python
 from vllm import LLM
+# Initialize the vLLM engine.
 llm = LLM(model="facebook/opt-125m")
 ```
-After initializing the `LLM` instance, you can perform model inference using various APIs.
+After initializing the `LLM` instance, use the available APIs to perform model inference.
-The available APIs depend on the type of model that is being run:
+The available APIs depend on the model type:
 - [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
 - [Pooling models][pooling-models] output their hidden states directly.
-Please refer to the above pages for more details about each API.
 !!! info
    [API Reference][offline-inference-api]
+### Ray Data LLM API
+Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
+This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
+- Streaming execution processes datasets that exceed aggregate cluster memory.
+- Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
+- Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
+- Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
+The following example shows how to run batched inference with Ray Data and vLLM:
+<gh-file:examples/offline_inference/batch_llm_inference.py>
+For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html).