Replace "online inference" with "online serving" (#11923)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Replace "online inference" with "online serving" (#11923)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
d85c47d6 · Harry Mellor · GitHub · ef725fea · d85c47d6 · d85c47d6
Unverified Commit d85c47d6 authored Jan 10, 2025 by Harry Mellor Committed by GitHub Jan 10, 2025
11 changed files
--- a/.buildkite/run-cpu-test.sh
+++ b/.buildkite/run-cpu-test.sh
@@ -61,7 +61,7 @@ function cpu_tests() {
    pytest -s -v -k cpu_model \
    tests/basic_correctness/test_chunked_prefill.py"  

-  # online inference
+  # online serving
  docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
    set -e
    export VLLM_CPU_KVCACHE_SPACE=10 

--- a/docs/source/features/structured_outputs.md
+++ b/docs/source/features/structured_outputs.md
@@ -5,7 +5,7 @@
 vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding.
 This document shows you some examples of the different options that are available to generate structured outputs.

-## Online Inference (OpenAI API)
+## Online Serving (OpenAI API)

 You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.

@@ -239,7 +239,7 @@ The main available options inside `GuidedDecodingParams` are:
 - `backend`
 - `whitespace_pattern`

-These parameters can be used in the same way as the parameters from the Online Inference examples above.
+These parameters can be used in the same way as the parameters from the Online Serving examples above.
 One example for the usage of the `choices` parameter is shown below:

 ```python

--- a/docs/source/getting_started/installation/hpu-gaudi.md
+++ b/docs/source/getting_started/installation/hpu-gaudi.md
@@ -83,7 +83,7 @@ $ python setup.py develop
 ## Supported Features

 - [Offline inference](#offline-inference)
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
+- Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
 - HPU autodetection - no need to manually select device within vLLM
 - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
 - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
@@ -385,5 +385,5 @@ the below:
  completely. With HPU Graphs disabled, you are trading latency and
  throughput at lower batches for potentially higher throughput on
  higher batches. You can do that by adding `--enforce-eager` flag to
-  server (for online inference), or by passing `enforce_eager=True`
+  server (for online serving), or by passing `enforce_eager=True`
  argument to LLM constructor (for offline inference).
--- a/docs/source/getting_started/quickstart.md
+++ b/docs/source/getting_started/quickstart.md
@@ -5,7 +5,7 @@
 This guide will help you quickly get started with vLLM to perform:

 - [Offline batched inference](#quickstart-offline)
- [Online inference using OpenAI-compatible server](#quickstart-online)
+- [Online serving using OpenAI-compatible server](#quickstart-online)

 ## Prerequisites


--- a/docs/source/models/generative_models.md
+++ b/docs/source/models/generative_models.md
@@ -118,7 +118,7 @@ print("Loaded chat template:", custom_template)
 outputs = llm.chat(conversation, chat_template=custom_template)
 ```

-## Online Inference
+## Online Serving

 Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:


--- a/docs/source/models/pooling_models.md
+++ b/docs/source/models/pooling_models.md
@@ -127,7 +127,7 @@ print(f"Score: {score}")

 A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>

-## Online Inference
+## Online Serving

 Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:


--- a/docs/source/models/supported_models.md
+++ b/docs/source/models/supported_models.md
@@ -552,7 +552,7 @@ See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the mod

 ````{important}
 To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
-or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
+or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:

 Offline inference:
 ```python
@@ -562,7 +562,7 @@ llm = LLM(
 )
 ```

-Online inference:
+Online serving:
 ```bash
 vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
 ```

--- a/docs/source/serving/multimodal_inputs.md
+++ b/docs/source/serving/multimodal_inputs.md
@@ -199,7 +199,7 @@ for o in outputs:
    print(generated_text)
 ```

-## Online Inference
+## Online Serving

 Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).


--- a/examples/online_serving/openai_chat_completion_client_for_multimodal.py
+++ b/examples/online_serving/openai_chat_completion_client_for_multimodal.py
 """An example showing how to use vLLM to serve multimodal models 
-and run online inference with OpenAI client.
+and run online serving with OpenAI client.

 Launch the vLLM server with the following command:

@@ -309,7 +309,7 @@ def main(args) -> None:

 if __name__ == "__main__":
    parser = FlexibleArgumentParser(
-        description='Demo on using OpenAI client for online inference with '
+        description='Demo on using OpenAI client for online serving with '
        'multimodal language models served with vLLM.')
    parser.add_argument('--chat-type',
                        '-c',

--- a/tests/models/decoder_only/audio_language/test_ultravox.py
+++ b/tests/models/decoder_only/audio_language/test_ultravox.py
@@ -237,8 +237,8 @@ def test_models_with_multiple_audios(vllm_runner, audio_assets, dtype: str,


 @pytest.mark.asyncio
-async def test_online_inference(client, audio_assets):
-    """Exercises online inference with/without chunked prefill enabled."""
+async def test_online_serving(client, audio_assets):
+    """Exercises online serving with/without chunked prefill enabled."""

    messages = [{
        "role":

--- a/vllm/model_executor/models/molmo.py
+++ b/vllm/model_executor/models/molmo.py
@@ -1068,7 +1068,7 @@ def input_processor_for_molmo(ctx: InputContext, inputs: DecoderOnlyInputs):
        trust_remote_code=model_config.trust_remote_code)

    # NOTE: message formatting for raw text prompt is only applied for
-    # offline inference; for online inference, the prompt is always in
+    # offline inference; for online serving, the prompt is always in
    # instruction format and tokenized.
    if prompt is not None and re.match(r"^User:[\s\S]*?(Assistant:)*$",
                                       prompt):