offline_inference.md 1.65 KB
Newer Older
1
# Offline Inference
2

3
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
4
5
6
7
8

For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.

```python
Reid's avatar
Reid committed
9
10
from vllm import LLM

11
# Initialize the vLLM engine.
12
13
14
llm = LLM(model="facebook/opt-125m")
```

15
16
After initializing the `LLM` instance, use the available APIs to perform model inference.
The available APIs depend on the model type:
17

18
19
- [Generative models](../models/generative_models.md) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](../models/pooling_models.md) output their hidden states directly.
20

21
22
!!! info
    [API Reference][offline-inference-api]
23

24
## Ray Data LLM API
25
26
27
28
29
30
31
32
33
34
35
36
37

Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:

- Streaming execution processes datasets that exceed aggregate cluster memory.
- Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
- Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
- Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.

The following example shows how to run batched inference with Ray Data and vLLM:
<gh-file:examples/offline_inference/batch_llm_inference.py>

For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html).