Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
0d914c81
Unverified
Commit
0d914c81
authored
Jul 07, 2025
by
Ricardo Decal
Committed by
GitHub
Jul 07, 2025
Browse files
[Docs] Rewrite offline inference guide (#20594)
Signed-off-by:
Ricardo Decal
<
rdecal@anyscale.com
>
parent
6e428cdd
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
19 additions
and
8 deletions
+19
-8
docs/serving/offline_inference.md
docs/serving/offline_inference.md
+19
-8
No files found.
docs/serving/offline_inference.md
View file @
0d914c81
...
@@ -3,10 +3,7 @@ title: Offline Inference
...
@@ -3,10 +3,7 @@ title: Offline Inference
---
---
[](
){
#offline-inference }
[](
){
#offline-inference }
You can run vLLM in your own code on a list of prompts.
Offline inference is possible in your own code using vLLM's
[
`LLM`
][
vllm.LLM
]
class.
The offline API is based on the
[
LLM
][
vllm.LLM
]
class.
To initialize the vLLM engine, create a new instance of
`LLM`
and specify the model to run.
For example, the following code downloads the
[
`facebook/opt-125m`
](
https://huggingface.co/facebook/opt-125m
)
model from HuggingFace
For example, the following code downloads the
[
`facebook/opt-125m`
](
https://huggingface.co/facebook/opt-125m
)
model from HuggingFace
and runs it in vLLM using the default configuration.
and runs it in vLLM using the default configuration.
...
@@ -14,16 +11,30 @@ and runs it in vLLM using the default configuration.
...
@@ -14,16 +11,30 @@ and runs it in vLLM using the default configuration.
```
python
```
python
from
vllm
import
LLM
from
vllm
import
LLM
# Initialize the vLLM engine.
llm
=
LLM
(
model
=
"facebook/opt-125m"
)
llm
=
LLM
(
model
=
"facebook/opt-125m"
)
```
```
After initializing the
`LLM`
instance,
you can
perform model inference
using various APIs
.
After initializing the
`LLM`
instance,
use the available APIs to
perform model inference.
The available APIs depend on the
type of model that is being run
:
The available APIs depend on the
model type
:
-
[
Generative models
][
generative-models
]
output logprobs which are sampled from to obtain the final output text.
-
[
Generative models
][
generative-models
]
output logprobs which are sampled from to obtain the final output text.
-
[
Pooling models
][
pooling-models
]
output their hidden states directly.
-
[
Pooling models
][
pooling-models
]
output their hidden states directly.
Please refer to the above pages for more details about each API.
!!! info
!!! info
[
API Reference
][
offline-inference-api
]
[
API Reference
][
offline-inference-api
]
### Ray Data LLM API
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
-
Streaming execution processes datasets that exceed aggregate cluster memory.
-
Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
-
Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
-
Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
The following example shows how to run batched inference with Ray Data and vLLM:
<gh-file:examples
/
offline_inference
/
batch_llm_inference.py
>
For more information about the Ray Data LLM API, see the
[
Ray Data LLM documentation
](
https://docs.ray.io/en/latest/data/working-with-llms.html
)
.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment