Unverified Commit d85c47d6 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Replace "online inference" with "online serving" (#11923)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent ef725fea
...@@ -61,7 +61,7 @@ function cpu_tests() { ...@@ -61,7 +61,7 @@ function cpu_tests() {
pytest -s -v -k cpu_model \ pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py" tests/basic_correctness/test_chunked_prefill.py"
# online inference # online serving
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c " docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e set -e
export VLLM_CPU_KVCACHE_SPACE=10 export VLLM_CPU_KVCACHE_SPACE=10
......
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding. vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding.
This document shows you some examples of the different options that are available to generate structured outputs. This document shows you some examples of the different options that are available to generate structured outputs.
## Online Inference (OpenAI API) ## Online Serving (OpenAI API)
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API. You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
...@@ -239,7 +239,7 @@ The main available options inside `GuidedDecodingParams` are: ...@@ -239,7 +239,7 @@ The main available options inside `GuidedDecodingParams` are:
- `backend` - `backend`
- `whitespace_pattern` - `whitespace_pattern`
These parameters can be used in the same way as the parameters from the Online Inference examples above. These parameters can be used in the same way as the parameters from the Online Serving examples above.
One example for the usage of the `choices` parameter is shown below: One example for the usage of the `choices` parameter is shown below:
```python ```python
......
...@@ -83,7 +83,7 @@ $ python setup.py develop ...@@ -83,7 +83,7 @@ $ python setup.py develop
## Supported Features ## Supported Features
- [Offline inference](#offline-inference) - [Offline inference](#offline-inference)
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server) - Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
- HPU autodetection - no need to manually select device within vLLM - HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops, - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
...@@ -385,5 +385,5 @@ the below: ...@@ -385,5 +385,5 @@ the below:
completely. With HPU Graphs disabled, you are trading latency and completely. With HPU Graphs disabled, you are trading latency and
throughput at lower batches for potentially higher throughput on throughput at lower batches for potentially higher throughput on
higher batches. You can do that by adding `--enforce-eager` flag to higher batches. You can do that by adding `--enforce-eager` flag to
server (for online inference), or by passing `enforce_eager=True` server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference). argument to LLM constructor (for offline inference).
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
This guide will help you quickly get started with vLLM to perform: This guide will help you quickly get started with vLLM to perform:
- [Offline batched inference](#quickstart-offline) - [Offline batched inference](#quickstart-offline)
- [Online inference using OpenAI-compatible server](#quickstart-online) - [Online serving using OpenAI-compatible server](#quickstart-online)
## Prerequisites ## Prerequisites
......
...@@ -118,7 +118,7 @@ print("Loaded chat template:", custom_template) ...@@ -118,7 +118,7 @@ print("Loaded chat template:", custom_template)
outputs = llm.chat(conversation, chat_template=custom_template) outputs = llm.chat(conversation, chat_template=custom_template)
``` ```
## Online Inference ## Online Serving
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
......
...@@ -127,7 +127,7 @@ print(f"Score: {score}") ...@@ -127,7 +127,7 @@ print(f"Score: {score}")
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py> A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
## Online Inference ## Online Serving
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
......
...@@ -552,7 +552,7 @@ See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the mod ...@@ -552,7 +552,7 @@ See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the mod
````{important} ````{important}
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference) To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt: or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
Offline inference: Offline inference:
```python ```python
...@@ -562,7 +562,7 @@ llm = LLM( ...@@ -562,7 +562,7 @@ llm = LLM(
) )
``` ```
Online inference: Online serving:
```bash ```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4 vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
``` ```
......
...@@ -199,7 +199,7 @@ for o in outputs: ...@@ -199,7 +199,7 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
## Online Inference ## Online Serving
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
......
"""An example showing how to use vLLM to serve multimodal models """An example showing how to use vLLM to serve multimodal models
and run online inference with OpenAI client. and run online serving with OpenAI client.
Launch the vLLM server with the following command: Launch the vLLM server with the following command:
...@@ -309,7 +309,7 @@ def main(args) -> None: ...@@ -309,7 +309,7 @@ def main(args) -> None:
if __name__ == "__main__": if __name__ == "__main__":
parser = FlexibleArgumentParser( parser = FlexibleArgumentParser(
description='Demo on using OpenAI client for online inference with ' description='Demo on using OpenAI client for online serving with '
'multimodal language models served with vLLM.') 'multimodal language models served with vLLM.')
parser.add_argument('--chat-type', parser.add_argument('--chat-type',
'-c', '-c',
......
...@@ -237,8 +237,8 @@ def test_models_with_multiple_audios(vllm_runner, audio_assets, dtype: str, ...@@ -237,8 +237,8 @@ def test_models_with_multiple_audios(vllm_runner, audio_assets, dtype: str,
@pytest.mark.asyncio @pytest.mark.asyncio
async def test_online_inference(client, audio_assets): async def test_online_serving(client, audio_assets):
"""Exercises online inference with/without chunked prefill enabled.""" """Exercises online serving with/without chunked prefill enabled."""
messages = [{ messages = [{
"role": "role":
......
...@@ -1068,7 +1068,7 @@ def input_processor_for_molmo(ctx: InputContext, inputs: DecoderOnlyInputs): ...@@ -1068,7 +1068,7 @@ def input_processor_for_molmo(ctx: InputContext, inputs: DecoderOnlyInputs):
trust_remote_code=model_config.trust_remote_code) trust_remote_code=model_config.trust_remote_code)
# NOTE: message formatting for raw text prompt is only applied for # NOTE: message formatting for raw text prompt is only applied for
# offline inference; for online inference, the prompt is always in # offline inference; for online serving, the prompt is always in
# instruction format and tokenized. # instruction format and tokenized.
if prompt is not None and re.match(r"^User:[\s\S]*?(Assistant:)*$", if prompt is not None and re.match(r"^User:[\s\S]*?(Assistant:)*$",
prompt): prompt):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment