offline_inference.md 989 Bytes
Newer Older
1
2
3
4
---
title: Offline Inference
---
[](){ #offline-inference }
5
6
7

You can run vLLM in your own code on a list of prompts.

8
The offline API is based on the [LLM][vllm.LLM] class.
9
10
11
12
13
14
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.

For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.

```python
Reid's avatar
Reid committed
15
16
from vllm import LLM

17
18
19
20
21
22
llm = LLM(model="facebook/opt-125m")
```

After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:

23
24
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
- [Pooling models][pooling-models] output their hidden states directly.
25
26
27

Please refer to the above pages for more details about each API.

28
29
!!! info
    [API Reference][offline-inference-api]