Commit ec5e299c authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.7.3' into v0.7.3-dev

parents 47bd229c ed6e9075
# Seed Parameter Behavior in vLLM
## Overview
The `seed` parameter in vLLM is used to control the random states for various random number generators. This parameter can affect the behavior of random operations in user code, especially when working with models in vLLM.
## Default Behavior
By default, the `seed` parameter is set to `None`. When the `seed` parameter is `None`, the global random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that the random operations will behave as expected, without any fixed random states.
## Specifying a Seed
If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly. This can be useful for reproducibility, as it ensures that the random operations produce the same results across multiple runs.
## Example Usage
### Without Specifying a Seed
```python
import random
from vllm import LLM
# Initialize a vLLM model without specifying a seed
model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Try generating random numbers
print(random.randint(0, 100)) # Outputs different numbers across runs
```
### Specifying a Seed
```python
import random
from vllm import LLM
# Initialize a vLLM model with a specific seed
model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", seed=42)
# Try generating random numbers
print(random.randint(0, 100)) # Outputs the same number across runs
```
## Important Notes
- If the `seed` parameter is not specified, the behavior of global random states remains unaffected.
- If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set to that value.
- This behavior can be useful for reproducibility but may lead to non-intuitive behavior if the user is not explicitly aware of it.
## Conclusion
Understanding the behavior of the `seed` parameter in vLLM is crucial for ensuring the expected behavior of random operations in your code. By default, the `seed` parameter is set to `None`, which means that the global random states are not affected. However, specifying a seed value can help achieve reproducibility in your experiments.
.vertical-table-header th.head:not(.stub) {
writing-mode: sideways-lr;
white-space: nowrap;
max-width: 0;
p {
margin: 0;
}
}
......@@ -12,6 +12,7 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import datetime
import inspect
import logging
import os
......@@ -27,7 +28,7 @@ sys.path.append(os.path.abspath("../.."))
# -- Project information -----------------------------------------------------
project = 'vLLM'
copyright = '2024, vLLM Team'
copyright = f'{datetime.datetime.now().year}, vLLM Team'
author = 'the vLLM Team'
# -- General configuration ---------------------------------------------------
......@@ -78,8 +79,12 @@ html_theme_options = {
'use_repository_button': True,
'use_edit_page_button': True,
}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
html_js_files = ["custom.js"]
html_css_files = ["custom.css"]
myst_url_schemes = {
'http': None,
......@@ -121,11 +126,6 @@ if READTHEDOCS_VERSION_TYPE == "tag":
if os.path.exists(header_file):
os.remove(header_file)
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ['_static']
# Generate additional rst documentation here.
def setup(app):
......
# Profiling vLLM
:::{warning}
Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
:::
We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`
The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.
When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
:::{warning}
Only enable profiling in a development environment.
:::
Traces can be visualized using <https://ui.perfetto.dev/>.
:::{tip}
......
......@@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command.
vllm serve <model>
```
The code for the `vllm` CLI can be found in <gh-file:vllm/scripts.py>.
The code for the `vllm` CLI can be found in <gh-file:vllm/entrypoints/cli/main.py>.
Sometimes you may see the API server entrypoint used directly instead of via the
`vllm` CLI command. For example:
......
......@@ -20,93 +20,93 @@ The table below shows the compatibility of various quantization implementations
* AWS Inferentia
* Google TPU
- * AWQ
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
* ✅︎
* ✅︎
*
*
*
*
- * GPTQ
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
* ✅︎
* ✅︎
*
*
*
*
- * Marlin (GPTQ/AWQ/FP8)
*
*
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
*
*
*
*
*
- * INT8 (W8A8)
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
* ✅︎
*
*
*
*
- * FP8 (W8A8)
*
*
*
*
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
*
*
*
- * AQLM
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
*
*
*
*
*
- * bitsandbytes
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
*
*
*
*
*
- * DeepSpeedFP
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
*
*
*
*
*
- * GGUF
* ✅︎
* ✅︎
......@@ -114,16 +114,16 @@ The table below shows the compatibility of various quantization implementations
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
*
*
*
:::
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- indicates that the quantization method is not supported on the specified hardware.
:::{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
......
......@@ -45,7 +45,7 @@ To perform the same with an online mode launch the server:
```bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
--seed 42 -tp 1 --speculative_model facebook/opt-125m \
--num_speculative_tokens 5 --gpu_memory_utilization 0.8
```
......@@ -175,7 +175,7 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=4,
speculative_model="path/to/modified/eagle/model",
speculative_model="yuhuili/EAGLE-LLaMA3-Instruct-8B",
speculative_draft_tensor_parallel_size=1,
)
......@@ -190,14 +190,12 @@ for output in outputs:
A few important things to consider when using the EAGLE based draft models:
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
used directly with vLLM due to differences in the expected layer names and model definition.
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
to convert them. Note that this script does not modify the model's weights.
In the above example, use the script to first convert
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
and then use the converted checkpoint as the draft model in vLLM.
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
and specify `speculative_model="path/to/modified/eagle/model"`. If weight-loading problems still occur when using
the latest version of vLLM, please leave a comment or raise an issue.
2. The EAGLE based draft models need to be run without tensor parallelism
(i.e. speculative_draft_tensor_parallel_size is set to 1), although
......
# Tool Calling
vLLM currently supports named function calling, as well as the `auto` and `none` options for the `tool_choice` field in the chat completion API. The `tool_choice` option `required` is **not yet supported** but on the roadmap.
vLLM currently supports named function calling, as well as the `auto` and `none` options for the `tool_choice` field in the chat completion API. The `tool_choice` option `required` is **not yet supported** but [on the roadmap](gh-issue:13002).
## Quickstart
......
......@@ -147,7 +147,7 @@ class Example:
return content
content += "## Example materials\n\n"
for file in self.other_files:
for file in sorted(self.other_files):
include = "include" if file.suffix == ".md" else "literalinclude"
content += f":::{{admonition}} {file.relative_to(self.path)}\n"
content += ":class: dropdown\n\n"
......@@ -194,7 +194,7 @@ def generate_examples():
path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md",
title="Offline Inference",
description=
"Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches.", # noqa: E501
"Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches. We recommend starting with <project:basic.md>.", # noqa: E501
caption="Examples",
),
}
......
......@@ -10,7 +10,7 @@ Second, install Python packages for vLLM CPU backend building:
```console
pip install --upgrade pip
pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
```
......
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment