offline_inference.md 1.67 KB
Newer Older
1
2
3
---
title: Offline Inference
---
4

5
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
6
7
8
9
10

For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.

```python
Reid's avatar
Reid committed
11
12
from vllm import LLM

13
# Initialize the vLLM engine.
14
15
16
llm = LLM(model="facebook/opt-125m")
```

17
18
After initializing the `LLM` instance, use the available APIs to perform model inference.
The available APIs depend on the model type:
19

20
21
- [Generative models](../models/generative_models.md) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](../models/pooling_models.md) output their hidden states directly.
22

23
24
!!! info
    [API Reference][offline-inference-api]
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

### Ray Data LLM API

Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:

- Streaming execution processes datasets that exceed aggregate cluster memory.
- Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
- Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
- Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.

The following example shows how to run batched inference with Ray Data and vLLM:
<gh-file:examples/offline_inference/batch_llm_inference.py>

For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html).