Commit ec5e299c authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.7.3' into v0.7.3-dev

parents 47bd229c ed6e9075
# Seed Parameter Behavior in vLLM
## Overview
The `seed` parameter in vLLM is used to control the random states for various random number generators. This parameter can affect the behavior of random operations in user code, especially when working with models in vLLM.
## Default Behavior
By default, the `seed` parameter is set to `None`. When the `seed` parameter is `None`, the global random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that the random operations will behave as expected, without any fixed random states.
## Specifying a Seed
If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly. This can be useful for reproducibility, as it ensures that the random operations produce the same results across multiple runs.
## Example Usage
### Without Specifying a Seed
```python
import random
from vllm import LLM
# Initialize a vLLM model without specifying a seed
model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Try generating random numbers
print(random.randint(0, 100)) # Outputs different numbers across runs
```
### Specifying a Seed
```python
import random
from vllm import LLM
# Initialize a vLLM model with a specific seed
model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", seed=42)
# Try generating random numbers
print(random.randint(0, 100)) # Outputs the same number across runs
```
## Important Notes
- If the `seed` parameter is not specified, the behavior of global random states remains unaffected.
- If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set to that value.
- This behavior can be useful for reproducibility but may lead to non-intuitive behavior if the user is not explicitly aware of it.
## Conclusion
Understanding the behavior of the `seed` parameter in vLLM is crucial for ensuring the expected behavior of random operations in your code. By default, the `seed` parameter is set to `None`, which means that the global random states are not affected. However, specifying a seed value can help achieve reproducibility in your experiments.
.vertical-table-header th.head:not(.stub) {
writing-mode: sideways-lr;
white-space: nowrap;
max-width: 0;
p {
margin: 0;
}
}
...@@ -12,6 +12,7 @@ ...@@ -12,6 +12,7 @@
# add these directories to sys.path here. If the directory is relative to the # add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here. # documentation root, use os.path.abspath to make it absolute, like shown here.
import datetime
import inspect import inspect
import logging import logging
import os import os
...@@ -27,7 +28,7 @@ sys.path.append(os.path.abspath("../..")) ...@@ -27,7 +28,7 @@ sys.path.append(os.path.abspath("../.."))
# -- Project information ----------------------------------------------------- # -- Project information -----------------------------------------------------
project = 'vLLM' project = 'vLLM'
copyright = '2024, vLLM Team' copyright = f'{datetime.datetime.now().year}, vLLM Team'
author = 'the vLLM Team' author = 'the vLLM Team'
# -- General configuration --------------------------------------------------- # -- General configuration ---------------------------------------------------
...@@ -78,8 +79,12 @@ html_theme_options = { ...@@ -78,8 +79,12 @@ html_theme_options = {
'use_repository_button': True, 'use_repository_button': True,
'use_edit_page_button': True, 'use_edit_page_button': True,
} }
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"] html_static_path = ["_static"]
html_js_files = ["custom.js"] html_js_files = ["custom.js"]
html_css_files = ["custom.css"]
myst_url_schemes = { myst_url_schemes = {
'http': None, 'http': None,
...@@ -121,11 +126,6 @@ if READTHEDOCS_VERSION_TYPE == "tag": ...@@ -121,11 +126,6 @@ if READTHEDOCS_VERSION_TYPE == "tag":
if os.path.exists(header_file): if os.path.exists(header_file):
os.remove(header_file) os.remove(header_file)
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ['_static']
# Generate additional rst documentation here. # Generate additional rst documentation here.
def setup(app): def setup(app):
......
# Profiling vLLM # Profiling vLLM
:::{warning}
Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
:::
We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/` We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`
The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set. The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.
When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag. When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
:::{warning}
Only enable profiling in a development environment.
:::
Traces can be visualized using <https://ui.perfetto.dev/>. Traces can be visualized using <https://ui.perfetto.dev/>.
:::{tip} :::{tip}
......
...@@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command. ...@@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command.
vllm serve <model> vllm serve <model>
``` ```
The code for the `vllm` CLI can be found in <gh-file:vllm/scripts.py>. The code for the `vllm` CLI can be found in <gh-file:vllm/entrypoints/cli/main.py>.
Sometimes you may see the API server entrypoint used directly instead of via the Sometimes you may see the API server entrypoint used directly instead of via the
`vllm` CLI command. For example: `vllm` CLI command. For example:
......
...@@ -20,93 +20,93 @@ The table below shows the compatibility of various quantization implementations ...@@ -20,93 +20,93 @@ The table below shows the compatibility of various quantization implementations
* AWS Inferentia * AWS Inferentia
* Google TPU * Google TPU
- * AWQ - * AWQ
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
- * GPTQ - * GPTQ
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
- * Marlin (GPTQ/AWQ/FP8) - * Marlin (GPTQ/AWQ/FP8)
* *
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
* *
* *
* *
- * INT8 (W8A8) - * INT8 (W8A8)
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
* ✅︎ * ✅︎
* *
* *
- * FP8 (W8A8) - * FP8 (W8A8)
* *
* *
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
* *
* *
- * AQLM - * AQLM
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
* *
* *
* *
- * bitsandbytes - * bitsandbytes
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
* *
* *
* *
- * DeepSpeedFP - * DeepSpeedFP
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
* *
* *
* *
- * GGUF - * GGUF
* ✅︎ * ✅︎
* ✅︎ * ✅︎
...@@ -114,16 +114,16 @@ The table below shows the compatibility of various quantization implementations ...@@ -114,16 +114,16 @@ The table below shows the compatibility of various quantization implementations
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* *
* *
* *
* *
::: :::
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware. - ✅︎ indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware. - indicates that the quantization method is not supported on the specified hardware.
:::{note} :::{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods. This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
......
...@@ -45,7 +45,7 @@ To perform the same with an online mode launch the server: ...@@ -45,7 +45,7 @@ To perform the same with an online mode launch the server:
```bash ```bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \ python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \ --seed 42 -tp 1 --speculative_model facebook/opt-125m \
--num_speculative_tokens 5 --gpu_memory_utilization 0.8 --num_speculative_tokens 5 --gpu_memory_utilization 0.8
``` ```
...@@ -175,7 +175,7 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95) ...@@ -175,7 +175,7 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM( llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct", model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=4, tensor_parallel_size=4,
speculative_model="path/to/modified/eagle/model", speculative_model="yuhuili/EAGLE-LLaMA3-Instruct-8B",
speculative_draft_tensor_parallel_size=1, speculative_draft_tensor_parallel_size=1,
) )
...@@ -190,14 +190,12 @@ for output in outputs: ...@@ -190,14 +190,12 @@ for output in outputs:
A few important things to consider when using the EAGLE based draft models: A few important things to consider when using the EAGLE based draft models:
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be 1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
used directly with vLLM due to differences in the expected layer names and model definition. be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
to convert them. Note that this script does not modify the model's weights. [script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
and specify `speculative_model="path/to/modified/eagle/model"`. If weight-loading problems still occur when using
In the above example, use the script to first convert the latest version of vLLM, please leave a comment or raise an issue.
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
and then use the converted checkpoint as the draft model in vLLM.
2. The EAGLE based draft models need to be run without tensor parallelism 2. The EAGLE based draft models need to be run without tensor parallelism
(i.e. speculative_draft_tensor_parallel_size is set to 1), although (i.e. speculative_draft_tensor_parallel_size is set to 1), although
......
# Tool Calling # Tool Calling
vLLM currently supports named function calling, as well as the `auto` and `none` options for the `tool_choice` field in the chat completion API. The `tool_choice` option `required` is **not yet supported** but on the roadmap. vLLM currently supports named function calling, as well as the `auto` and `none` options for the `tool_choice` field in the chat completion API. The `tool_choice` option `required` is **not yet supported** but [on the roadmap](gh-issue:13002).
## Quickstart ## Quickstart
......
...@@ -147,7 +147,7 @@ class Example: ...@@ -147,7 +147,7 @@ class Example:
return content return content
content += "## Example materials\n\n" content += "## Example materials\n\n"
for file in self.other_files: for file in sorted(self.other_files):
include = "include" if file.suffix == ".md" else "literalinclude" include = "include" if file.suffix == ".md" else "literalinclude"
content += f":::{{admonition}} {file.relative_to(self.path)}\n" content += f":::{{admonition}} {file.relative_to(self.path)}\n"
content += ":class: dropdown\n\n" content += ":class: dropdown\n\n"
...@@ -194,7 +194,7 @@ def generate_examples(): ...@@ -194,7 +194,7 @@ def generate_examples():
path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md", path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md",
title="Offline Inference", title="Offline Inference",
description= description=
"Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches.", # noqa: E501 "Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches. We recommend starting with <project:basic.md>.", # noqa: E501
caption="Examples", caption="Examples",
), ),
} }
......
...@@ -10,7 +10,7 @@ Second, install Python packages for vLLM CPU backend building: ...@@ -10,7 +10,7 @@ Second, install Python packages for vLLM CPU backend building:
```console ```console
pip install --upgrade pip pip install --upgrade pip
pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
``` ```
......
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment