Unverified Commit aba8d6ee authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Doc] Move examples into categories (#11840)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 2a0596bc
...@@ -31,7 +31,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in ...@@ -31,7 +31,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in
## Offline Batched Inference ## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference.py> With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/offline_inference.py>
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`: The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
...@@ -133,7 +133,7 @@ completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct", ...@@ -133,7 +133,7 @@ completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
print("Completion result:", completion) print("Completion result:", completion)
``` ```
A more detailed client example can be found here: <gh-file:examples/openai_completion_client.py> A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
### OpenAI Chat Completions API with vLLM ### OpenAI Chat Completions API with vLLM
......
...@@ -24,7 +24,7 @@ To isolate the model downloading and loading issue, you can use the `--load-form ...@@ -24,7 +24,7 @@ To isolate the model downloading and loading issue, you can use the `--load-form
## Model is too large ## Model is too large
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Enable more logging ## Enable more logging
......
...@@ -9,7 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor ...@@ -9,7 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor
For more information on CoreWeave's Tensorizer, please refer to For more information on CoreWeave's Tensorizer, please refer to
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html). the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html).
```{note} ```{note}
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
......
...@@ -46,7 +46,7 @@ for output in outputs: ...@@ -46,7 +46,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference.py> A code example can be found here: <gh-file:examples/offline_inference/offline_inference.py>
### `LLM.beam_search` ### `LLM.beam_search`
...@@ -103,7 +103,7 @@ for output in outputs: ...@@ -103,7 +103,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_chat.py> A code example can be found here: <gh-file:examples/offline_inference/offline_inference_chat.py>
If the model doesn't have a chat template or you want to specify another one, If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template: you can explicitly pass a chat template:
......
...@@ -65,7 +65,7 @@ embeds = output.outputs.embedding ...@@ -65,7 +65,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})") print(f"Embeddings: {embeds!r} (size={len(embeds)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_embedding.py> A code example can be found here: <gh-file:examples/offline_inference/offline_inference_embedding.py>
### `LLM.classify` ### `LLM.classify`
...@@ -80,7 +80,7 @@ probs = output.outputs.probs ...@@ -80,7 +80,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})") print(f"Class Probabilities: {probs!r} (size={len(probs)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_classification.py> A code example can be found here: <gh-file:examples/offline_inference/offline_inference_classification.py>
### `LLM.score` ### `LLM.score`
...@@ -102,7 +102,7 @@ score = output.outputs.score ...@@ -102,7 +102,7 @@ score = output.outputs.score
print(f"Score: {score}") print(f"Score: {score}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_scoring.py> A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
## Online Inference ## Online Inference
......
...@@ -51,7 +51,7 @@ $ --pipeline-parallel-size 2 ...@@ -51,7 +51,7 @@ $ --pipeline-parallel-size 2
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration. If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command. The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/online_serving/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
Pick a node as the head node, and run the following command: Pick a node as the head node, and run the following command:
......
...@@ -60,7 +60,7 @@ for o in outputs: ...@@ -60,7 +60,7 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference_vision_language.py> Full example: <gh-file:examples/offline_inference/offline_inference_vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead: To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
...@@ -91,7 +91,7 @@ for o in outputs: ...@@ -91,7 +91,7 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference_vision_language_multi_image.py> Full example: <gh-file:examples/offline_inference/offline_inference_vision_language_multi_image.py>
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos: Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
...@@ -125,13 +125,13 @@ for o in outputs: ...@@ -125,13 +125,13 @@ for o in outputs:
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
instead of using multi-image input. instead of using multi-image input.
Full example: <gh-file:examples/offline_inference_vision_language.py> Full example: <gh-file:examples/offline_inference/offline_inference_vision_language.py>
### Audio ### Audio
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary. You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
Full example: <gh-file:examples/offline_inference_audio_language.py> Full example: <gh-file:examples/offline_inference/offline_inference_audio_language.py>
### Embedding ### Embedding
...@@ -271,7 +271,7 @@ chat_response = client.chat.completions.create( ...@@ -271,7 +271,7 @@ chat_response = client.chat.completions.create(
print("Chat completion output:", chat_response.choices[0].message.content) print("Chat completion output:", chat_response.choices[0].message.content)
``` ```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
```{tip} ```{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine, Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
...@@ -342,7 +342,7 @@ result = chat_completion_from_url.choices[0].message.content ...@@ -342,7 +342,7 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from image url:", result) print("Chat completion output from image url:", result)
``` ```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note} ````{note}
By default, the timeout for fetching videos through HTTP URL is `30` seconds. By default, the timeout for fetching videos through HTTP URL is `30` seconds.
...@@ -445,7 +445,7 @@ result = chat_completion_from_url.choices[0].message.content ...@@ -445,7 +445,7 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result) print("Chat completion output from audio url:", result)
``` ```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note} ````{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds. By default, the timeout for fetching audios through HTTP URL is `10` seconds.
...@@ -529,4 +529,4 @@ Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of th ...@@ -529,4 +529,4 @@ Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of th
example below for details. example below for details.
``` ```
Full example: <gh-file:examples/openai_chat_embedding_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
...@@ -191,7 +191,7 @@ The order of priorities is `command line > config file values > defaults`. ...@@ -191,7 +191,7 @@ The order of priorities is `command line > config file values > defaults`.
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions); Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/openai_completion_client.py> Code example: <gh-file:examples/online_serving/openai_completion_client.py>
#### Extra parameters #### Extra parameters
...@@ -222,7 +222,7 @@ We support both [Vision](https://platform.openai.com/docs/guides/vision)- and ...@@ -222,7 +222,7 @@ We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
see our [Multimodal Inputs](#multimodal-inputs) guide for more information. see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
- *Note: `image_url.detail` parameter is not supported.* - *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/openai_chat_completion_client.py> Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
#### Extra parameters #### Extra parameters
...@@ -255,7 +255,7 @@ which will be treated as a single prompt to the model. ...@@ -255,7 +255,7 @@ which will be treated as a single prompt to the model.
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details. This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
``` ```
Code example: <gh-file:examples/openai_embedding_client.py> Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
#### Extra parameters #### Extra parameters
...@@ -299,7 +299,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_ ...@@ -299,7 +299,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats. The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/openai_pooling_client.py> Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
(score-api)= (score-api)=
### Score API ### Score API
...@@ -309,7 +309,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent ...@@ -309,7 +309,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html). You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: <gh-file:examples/openai_cross_encoder_score.py> Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
#### Single inference #### Single inference
......
...@@ -3,7 +3,8 @@ Demonstrate prompting of text-to-text ...@@ -3,7 +3,8 @@ Demonstrate prompting of text-to-text
encoder/decoder models, specifically Florence-2 encoder/decoder models, specifically Florence-2
''' '''
# TODO(Isotr0py): # TODO(Isotr0py):
# Move to offline_inference_vision_language.py after porting vision backbone # Move to offline_inference/offline_inference_vision_language.py
# after porting vision backbone
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
dtype = "float" dtype = "float"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment