Unverified Commit 43f3d9e6 authored by Rafael Vasquez's avatar Rafael Vasquez Committed by GitHub
Browse files

[CI/Build] Add markdown linter (#11857)


Signed-off-by: default avatarRafael Vasquez <rafvasq21@gmail.com>
parent b25cfab9
......@@ -13,7 +13,7 @@ on:
- "docs/**"
jobs:
sphinx-lint:
doc-lint:
runs-on: ubuntu-latest
strategy:
matrix:
......@@ -29,4 +29,4 @@ jobs:
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
- name: Linting docs
run: tools/sphinx-lint.sh
run: tools/doc-lint.sh
......@@ -16,4 +16,5 @@ make html
```bash
python -m http.server -d build/html/
```
Launch your browser and open localhost:8000.
......@@ -9,4 +9,3 @@ interfaces_base
interfaces
adapters
```
......@@ -6,6 +6,7 @@ vLLM is a community project. Our compute resources for development and testing a
<!-- Note: Please keep these consistent with README.md. -->
Cash Donations:
- a16z
- Dropbox
- Sequoia Capital
......@@ -13,6 +14,7 @@ Cash Donations:
- ZhenFund
Compute Resources:
- AMD
- Anyscale
- AWS
......
......@@ -200,6 +200,7 @@ def get_mm_max_tokens_per_item(self, seq_len: int) -> Mapping[str, int]:
```{note}
Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP.
```
:::
::::
......@@ -248,6 +249,7 @@ def get_dummy_processor_inputs(
mm_data=mm_data,
)
```
:::
::::
......@@ -312,6 +314,7 @@ def _get_mm_fields_config(
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
```
:::
::::
......@@ -369,6 +372,7 @@ def _get_prompt_replacements(
),
]
```
:::
::::
......
......@@ -37,8 +37,6 @@ pytest tests/
Currently, the repository is not fully checked by `mypy`.
```
# Contribution Guidelines
## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
......
......@@ -28,8 +28,8 @@ memory to share data between processes under the hood, particularly for tensor p
You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To build vLLM:
```console
$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
```
```{note}
......
......@@ -13,14 +13,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
To install the Cerebrium client, run:
```console
$ pip install cerebrium
$ cerebrium login
pip install cerebrium
cerebrium login
```
Next, create your Cerebrium project, run:
```console
$ cerebrium init vllm-project
cerebrium init vllm-project
```
Next, to install the required packages, add the following to your cerebrium.toml:
......@@ -58,10 +58,10 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
Then, run the following code to deploy it to the cloud:
```console
$ cerebrium deploy
cerebrium deploy
```
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case` /run`)
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
```python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
......
......@@ -13,16 +13,16 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
To install dstack client, run:
```console
$ pip install "dstack[all]
$ dstack server
pip install "dstack[all]
dstack server
```
Next, to configure your dstack project, run:
```console
$ mkdir -p vllm-dstack
$ cd vllm-dstack
$ dstack init
mkdir -p vllm-dstack
cd vllm-dstack
dstack init
```
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
......
......@@ -334,12 +334,12 @@ run: |
1. Start the chat web UI:
```console
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
```
```console
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
```
2. Then, we can access the GUI at the returned gradio link:
```console
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
```
```console
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
```
......@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
To install Llama Stack, run
```console
$ pip install llama-stack -q
pip install llama-stack -q
```
## Inference using OpenAI Compatible API
......
......@@ -14,234 +14,235 @@ Before you begin, ensure that you have the following:
## Deployment Steps
1. **Create a PVC , Secret and Deployment for vLLM**
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
```
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
```yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
stringData:
token: "REPLACE_WITH_TOKEN"
```
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
Here are two examples for using NVIDIA GPU and AMD GPU.
- NVIDIA GPU
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
1. Create a PVC, Secret and Deployment for vLLM
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
```
- AMD GPU
You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
# PVC
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
hostNetwork: true
hostIPC: true
containers:
- name: mistral-7b
image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
securityContext:
seccompProfile:
type: Unconfined
runAsGroup: 44
capabilities:
add:
- SYS_PTRACE
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
```
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
```yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
stringData:
token: "REPLACE_WITH_TOKEN"
```
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
Here are two examples for using NVIDIA GPU and AMD GPU.
NVIDIA GPU:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
```
AMD GPU:
You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
# PVC
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
hostNetwork: true
hostIPC: true
containers:
- name: mistral-7b
image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
securityContext:
seccompProfile:
type: Unconfined
runAsGroup: 44
capabilities:
add:
- SYS_PTRACE
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
amd.com/gpu: "1"
requests:
cpu: "6"
memory: 6G
amd.com/gpu: "1"
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
```
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
2. Create a Kubernetes Service for vLLM
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
```yaml
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
amd.com/gpu: "1"
requests:
cpu: "6"
memory: 6G
amd.com/gpu: "1"
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
```
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
2. **Create a Kubernetes Service for vLLM**
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
```yaml
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
```
3. **Deploy and Test**
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
```console
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
```
To test the deployment, run the following `curl` command:
```console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
If the service is correctly deployed, you should receive a response from the vLLM model.
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
```
3. Deploy and Test
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
```console
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
```
To test the deployment, run the following `curl` command:
```console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
If the service is correctly deployed, you should receive a response from the vLLM model.
## Conclusion
......
......@@ -6,7 +6,7 @@ The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
```
```text
Block 1 Block 2 Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
......@@ -14,19 +14,16 @@ Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
```
In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
```
```text
hash(prefix tokens + block tokens) <--> KV Block
```
With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
## Generalized Caching Policy
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
......@@ -41,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra
However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
......@@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
```console
$ pip install autoawq
pip install autoawq
```
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
......@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
```
AWQ models are also supported directly through the LLM entrypoint:
......
......@@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
Below are the steps to utilize BitsAndBytes with vLLM.
```console
$ pip install bitsandbytes>=0.45.0
pip install bitsandbytes>=0.45.0
```
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
......@@ -17,7 +17,7 @@ vLLM reads the model's config file and supports both in-flight quantization and
You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
And usually, these repositories have a config.json file that includes a quantization_config section.
## Read quantized checkpoint.
## Read quantized checkpoint
```python
from vllm import LLM
......@@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")
```
## OpenAI Compatible Server
Append the following to your 4bit model arguments:
```
```console
--quantization bitsandbytes --load-format bitsandbytes
```
......@@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
$ pip install llmcompressor
pip install llmcompressor
```
## Quantization Process
......@@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR)
Install `vllm` and `lm-evaluation-harness`:
```console
$ pip install vllm lm-eval==0.4.4
pip install vllm lm-eval==0.4.4
```
Load and run the model in `vllm`:
......
......@@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
To install AMMO (AlgorithMic Model Optimization):
```console
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
```
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
......
......@@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
```console
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
```console
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
```
```{warning}
......
......@@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
$ pip install llmcompressor
pip install llmcompressor
```
## Quantization Process
......
......@@ -192,11 +192,11 @@ A few important things to consider when using the EAGLE based draft models:
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
used directly with vLLM due to differences in the expected layer names and model definition.
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
to convert them. Note that this script does not modify the model's weights.
In the above example, use the script to first convert
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
and then use the converted checkpoint as the draft model in vLLM.
2. The EAGLE based draft models need to be run without tensor parallelism
......@@ -207,7 +207,6 @@ A few important things to consider when using the EAGLE based draft models:
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
A variety of EAGLE draft models are available on the Hugging Face hub:
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
......@@ -224,7 +223,6 @@ A variety of EAGLE draft models are available on the Hugging Face hub:
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
## Lossless guarantees of Speculative Decoding
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
......@@ -250,8 +248,6 @@ speculative decoding, breaking down the guarantees into three key areas:
same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
**Conclusion**
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
......@@ -259,8 +255,6 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
**Mitigation Strategies**
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
## Resources for vLLM contributors
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment