Unverified Commit 482cdc49 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Doc] Rename offline inference examples (#11927)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 20410b2f
...@@ -30,7 +30,7 @@ function cpu_tests() { ...@@ -30,7 +30,7 @@ function cpu_tests() {
# offline inference # offline inference
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c " docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
set -e set -e
python3 examples/offline_inference/offline_inference.py" python3 examples/offline_inference/basic.py"
# Run basic model test # Run basic model test
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c " docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
......
...@@ -24,5 +24,5 @@ remove_docker_container ...@@ -24,5 +24,5 @@ remove_docker_container
# Run the image and test offline inference # Run the image and test offline inference
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c ' docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference/offline_inference.py python3 examples/offline_inference/basic.py
' '
...@@ -13,4 +13,4 @@ trap remove_docker_container EXIT ...@@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container remove_docker_container
# Run the image and launch offline inference # Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/offline_inference.py docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
\ No newline at end of file \ No newline at end of file
...@@ -51,4 +51,4 @@ docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \ ...@@ -51,4 +51,4 @@ docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \ -e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
--name "${container_name}" \ --name "${container_name}" \
${image_name} \ ${image_name} \
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/offline_inference_neuron.py" /bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py"
...@@ -13,4 +13,4 @@ trap remove_docker_container EXIT ...@@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container remove_docker_container
# Run the image and launch offline inference # Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/offline_inference.py docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/basic.py
...@@ -23,4 +23,4 @@ docker run --privileged --net host --shm-size=16G -it \ ...@@ -23,4 +23,4 @@ docker run --privileged --net host --shm-size=16G -it \
&& pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \ && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \
&& python3 /workspace/vllm/tests/tpu/test_compilation.py \ && python3 /workspace/vllm/tests/tpu/test_compilation.py \
&& python3 /workspace/vllm/tests/tpu/test_quantization_accuracy.py \ && python3 /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
&& python3 /workspace/vllm/examples/offline_inference/offline_inference_tpu.py" && python3 /workspace/vllm/examples/offline_inference/tpu.py"
...@@ -14,6 +14,6 @@ remove_docker_container ...@@ -14,6 +14,6 @@ remove_docker_container
# Run the image and test offline inference/tensor parallel # Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c ' docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference/offline_inference.py python3 examples/offline_inference/basic.py
python3 examples/offline_inference/offline_inference_cli.py -tp 2 python3 examples/offline_inference/cli.py -tp 2
' '
...@@ -187,19 +187,19 @@ steps: ...@@ -187,19 +187,19 @@ steps:
- examples/ - examples/
commands: commands:
- pip install tensorizer # for tensorizer test - pip install tensorizer # for tensorizer test
- python3 offline_inference/offline_inference.py - python3 offline_inference/basic.py
- python3 offline_inference/cpu_offload.py - python3 offline_inference/cpu_offload.py
- python3 offline_inference/offline_inference_chat.py - python3 offline_inference/chat.py
- python3 offline_inference/offline_inference_with_prefix.py - python3 offline_inference/prefix_caching.py
- python3 offline_inference/llm_engine_example.py - python3 offline_inference/llm_engine_example.py
- python3 offline_inference/offline_inference_vision_language.py - python3 offline_inference/vision_language.py
- python3 offline_inference/offline_inference_vision_language_multi_image.py - python3 offline_inference/vision_language_multi_image.py
- python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors - python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference/offline_inference_encoder_decoder.py - python3 offline_inference/encoder_decoder.py
- python3 offline_inference/offline_inference_classification.py - python3 offline_inference/classification.py
- python3 offline_inference/offline_inference_embedding.py - python3 offline_inference/embedding.py
- python3 offline_inference/offline_inference_scoring.py - python3 offline_inference/scoring.py
- python3 offline_inference/offline_profile.py --model facebook/opt-125m run_num_steps --num-steps 2 - python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2
- label: Prefix Caching Test # 9min - label: Prefix Caching Test # 9min
mirror_hardwares: [amd] mirror_hardwares: [amd]
......
...@@ -26,7 +26,7 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve ...@@ -26,7 +26,7 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve
### Offline Inference ### Offline Inference
Refer to <gh-file:examples/offline_inference/offline_inference_with_profiler.py> for an example. Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example.
### OpenAI Server ### OpenAI Server
......
...@@ -257,4 +257,4 @@ outputs = llm.generate( ...@@ -257,4 +257,4 @@ outputs = llm.generate(
print(outputs[0].outputs[0].text) print(outputs[0].outputs[0].text)
``` ```
Full example: <gh-file:examples/offline_inference/offline_inference_structured_outputs.py> Full example: <gh-file:examples/offline_inference/structured_outputs.py>
...@@ -95,7 +95,7 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install ...@@ -95,7 +95,7 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install
$ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library $ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
$ find / -name *libtcmalloc* # find the dynamic link library path $ find / -name *libtcmalloc* # find the dynamic link library path
$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD $ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
$ python examples/offline_inference/offline_inference.py # run vLLM $ python examples/offline_inference/basic.py # run vLLM
``` ```
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP: - When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
...@@ -132,7 +132,7 @@ CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ ...@@ -132,7 +132,7 @@ CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15 # On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
$ export VLLM_CPU_OMP_THREADS_BIND=0-7 $ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/offline_inference.py $ python examples/offline_inference/basic.py
``` ```
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access. - If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.
......
...@@ -40,7 +40,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in ...@@ -40,7 +40,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in
## Offline Batched Inference ## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/offline_inference.py> With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic.py>
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`: The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
......
...@@ -46,7 +46,7 @@ for output in outputs: ...@@ -46,7 +46,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference.py> A code example can be found here: <gh-file:examples/offline_inference/basic.py>
### `LLM.beam_search` ### `LLM.beam_search`
...@@ -103,7 +103,7 @@ for output in outputs: ...@@ -103,7 +103,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_chat.py> A code example can be found here: <gh-file:examples/offline_inference/chat.py>
If the model doesn't have a chat template or you want to specify another one, If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template: you can explicitly pass a chat template:
......
...@@ -88,7 +88,7 @@ embeds = output.outputs.embedding ...@@ -88,7 +88,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})") print(f"Embeddings: {embeds!r} (size={len(embeds)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_embedding.py> A code example can be found here: <gh-file:examples/offline_inference/embedding.py>
### `LLM.classify` ### `LLM.classify`
...@@ -103,7 +103,7 @@ probs = output.outputs.probs ...@@ -103,7 +103,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})") print(f"Class Probabilities: {probs!r} (size={len(probs)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_classification.py> A code example can be found here: <gh-file:examples/offline_inference/classification.py>
### `LLM.score` ### `LLM.score`
...@@ -125,7 +125,7 @@ score = output.outputs.score ...@@ -125,7 +125,7 @@ score = output.outputs.score
print(f"Score: {score}") print(f"Score: {score}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py> A code example can be found here: <gh-file:examples/offline_inference/scoring.py>
## Online Serving ## Online Serving
......
...@@ -60,7 +60,7 @@ for o in outputs: ...@@ -60,7 +60,7 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference/offline_inference_vision_language.py> Full example: <gh-file:examples/offline_inference/vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead: To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
...@@ -91,7 +91,7 @@ for o in outputs: ...@@ -91,7 +91,7 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference/offline_inference_vision_language_multi_image.py> Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos: Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
...@@ -125,13 +125,13 @@ for o in outputs: ...@@ -125,13 +125,13 @@ for o in outputs:
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
instead of using multi-image input. instead of using multi-image input.
Full example: <gh-file:examples/offline_inference/offline_inference_vision_language.py> Full example: <gh-file:examples/offline_inference/vision_language.py>
### Audio ### Audio
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary. You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
Full example: <gh-file:examples/offline_inference/offline_inference_audio_language.py> Full example: <gh-file:examples/offline_inference/audio_language.py>
### Embedding ### Embedding
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment