Unverified Commit 43f3d9e6 authored by Rafael Vasquez's avatar Rafael Vasquez Committed by GitHub
Browse files

[CI/Build] Add markdown linter (#11857)


Signed-off-by: default avatarRafael Vasquez <rafvasq21@gmail.com>
parent b25cfab9
...@@ -13,7 +13,7 @@ on: ...@@ -13,7 +13,7 @@ on:
- "docs/**" - "docs/**"
jobs: jobs:
sphinx-lint: doc-lint:
runs-on: ubuntu-latest runs-on: ubuntu-latest
strategy: strategy:
matrix: matrix:
...@@ -29,4 +29,4 @@ jobs: ...@@ -29,4 +29,4 @@ jobs:
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install -r requirements-lint.txt pip install -r requirements-lint.txt
- name: Linting docs - name: Linting docs
run: tools/sphinx-lint.sh run: tools/doc-lint.sh
...@@ -16,4 +16,5 @@ make html ...@@ -16,4 +16,5 @@ make html
```bash ```bash
python -m http.server -d build/html/ python -m http.server -d build/html/
``` ```
Launch your browser and open localhost:8000. Launch your browser and open localhost:8000.
...@@ -9,4 +9,3 @@ interfaces_base ...@@ -9,4 +9,3 @@ interfaces_base
interfaces interfaces
adapters adapters
``` ```
...@@ -6,6 +6,7 @@ vLLM is a community project. Our compute resources for development and testing a ...@@ -6,6 +6,7 @@ vLLM is a community project. Our compute resources for development and testing a
<!-- Note: Please keep these consistent with README.md. --> <!-- Note: Please keep these consistent with README.md. -->
Cash Donations: Cash Donations:
- a16z - a16z
- Dropbox - Dropbox
- Sequoia Capital - Sequoia Capital
...@@ -13,6 +14,7 @@ Cash Donations: ...@@ -13,6 +14,7 @@ Cash Donations:
- ZhenFund - ZhenFund
Compute Resources: Compute Resources:
- AMD - AMD
- Anyscale - Anyscale
- AWS - AWS
......
...@@ -200,6 +200,7 @@ def get_mm_max_tokens_per_item(self, seq_len: int) -> Mapping[str, int]: ...@@ -200,6 +200,7 @@ def get_mm_max_tokens_per_item(self, seq_len: int) -> Mapping[str, int]:
```{note} ```{note}
Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP. Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP.
``` ```
::: :::
:::: ::::
...@@ -248,6 +249,7 @@ def get_dummy_processor_inputs( ...@@ -248,6 +249,7 @@ def get_dummy_processor_inputs(
mm_data=mm_data, mm_data=mm_data,
) )
``` ```
::: :::
:::: ::::
...@@ -312,6 +314,7 @@ def _get_mm_fields_config( ...@@ -312,6 +314,7 @@ def _get_mm_fields_config(
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument. pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
``` ```
::: :::
:::: ::::
...@@ -369,6 +372,7 @@ def _get_prompt_replacements( ...@@ -369,6 +372,7 @@ def _get_prompt_replacements(
), ),
] ]
``` ```
::: :::
:::: ::::
......
...@@ -37,8 +37,6 @@ pytest tests/ ...@@ -37,8 +37,6 @@ pytest tests/
Currently, the repository is not fully checked by `mypy`. Currently, the repository is not fully checked by `mypy`.
``` ```
# Contribution Guidelines
## Issues ## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
......
...@@ -28,8 +28,8 @@ memory to share data between processes under the hood, particularly for tensor p ...@@ -28,8 +28,8 @@ memory to share data between processes under the hood, particularly for tensor p
You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To build vLLM: You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To build vLLM:
```console ```console
$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2 # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
``` ```
```{note} ```{note}
......
...@@ -13,14 +13,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr ...@@ -13,14 +13,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
To install the Cerebrium client, run: To install the Cerebrium client, run:
```console ```console
$ pip install cerebrium pip install cerebrium
$ cerebrium login cerebrium login
``` ```
Next, create your Cerebrium project, run: Next, create your Cerebrium project, run:
```console ```console
$ cerebrium init vllm-project cerebrium init vllm-project
``` ```
Next, to install the required packages, add the following to your cerebrium.toml: Next, to install the required packages, add the following to your cerebrium.toml:
...@@ -58,10 +58,10 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95): ...@@ -58,10 +58,10 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
Then, run the following code to deploy it to the cloud: Then, run the following code to deploy it to the cloud:
```console ```console
$ cerebrium deploy cerebrium deploy
``` ```
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case` /run`) If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
```python ```python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \ curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
......
...@@ -13,16 +13,16 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), ...@@ -13,16 +13,16 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
To install dstack client, run: To install dstack client, run:
```console ```console
$ pip install "dstack[all] pip install "dstack[all]
$ dstack server dstack server
``` ```
Next, to configure your dstack project, run: Next, to configure your dstack project, run:
```console ```console
$ mkdir -p vllm-dstack mkdir -p vllm-dstack
$ cd vllm-dstack cd vllm-dstack
$ dstack init dstack init
``` ```
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
......
...@@ -334,12 +334,12 @@ run: | ...@@ -334,12 +334,12 @@ run: |
1. Start the chat web UI: 1. Start the chat web UI:
```console ```console
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm) sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
``` ```
2. Then, we can access the GUI at the returned gradio link: 2. Then, we can access the GUI at the returned gradio link:
```console ```console
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
``` ```
...@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta ...@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
To install Llama Stack, run To install Llama Stack, run
```console ```console
$ pip install llama-stack -q pip install llama-stack -q
``` ```
## Inference using OpenAI Compatible API ## Inference using OpenAI Compatible API
......
...@@ -14,17 +14,17 @@ Before you begin, ensure that you have the following: ...@@ -14,17 +14,17 @@ Before you begin, ensure that you have the following:
## Deployment Steps ## Deployment Steps
1. **Create a PVC , Secret and Deployment for vLLM** 1. Create a PVC, Secret and Deployment for vLLM
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
```yaml ```yaml
apiVersion: v1 apiVersion: v1
kind: PersistentVolumeClaim kind: PersistentVolumeClaim
metadata: metadata:
name: mistral-7b name: mistral-7b
namespace: default namespace: default
spec: spec:
accessModes: accessModes:
- ReadWriteOnce - ReadWriteOnce
resources: resources:
...@@ -32,36 +32,36 @@ spec: ...@@ -32,36 +32,36 @@ spec:
storage: 50Gi storage: 50Gi
storageClassName: default storageClassName: default
volumeMode: Filesystem volumeMode: Filesystem
``` ```
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
```yaml ```yaml
apiVersion: v1 apiVersion: v1
kind: Secret kind: Secret
metadata: metadata:
name: hf-token-secret name: hf-token-secret
namespace: default namespace: default
type: Opaque type: Opaque
stringData: stringData:
token: "REPLACE_WITH_TOKEN" token: "REPLACE_WITH_TOKEN"
``` ```
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model. Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
Here are two examples for using NVIDIA GPU and AMD GPU. Here are two examples for using NVIDIA GPU and AMD GPU.
- NVIDIA GPU NVIDIA GPU:
```yaml ```yaml
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
metadata: metadata:
name: mistral-7b name: mistral-7b
namespace: default namespace: default
labels: labels:
app: mistral-7b app: mistral-7b
spec: spec:
replicas: 1 replicas: 1
selector: selector:
matchLabels: matchLabels:
...@@ -121,21 +121,21 @@ spec: ...@@ -121,21 +121,21 @@ spec:
port: 8000 port: 8000
initialDelaySeconds: 60 initialDelaySeconds: 60
periodSeconds: 5 periodSeconds: 5
``` ```
- AMD GPU AMD GPU:
You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X. You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
```yaml ```yaml
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
metadata: metadata:
name: mistral-7b name: mistral-7b
namespace: default namespace: default
labels: labels:
app: mistral-7b app: mistral-7b
spec: spec:
replicas: 1 replicas: 1
selector: selector:
matchLabels: matchLabels:
...@@ -193,20 +193,21 @@ spec: ...@@ -193,20 +193,21 @@ spec:
mountPath: /root/.cache/huggingface mountPath: /root/.cache/huggingface
- name: shm - name: shm
mountPath: /dev/shm mountPath: /dev/shm
``` ```
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
2. **Create a Kubernetes Service for vLLM** 2. Create a Kubernetes Service for vLLM
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment: Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
```yaml ```yaml
apiVersion: v1 apiVersion: v1
kind: Service kind: Service
metadata: metadata:
name: mistral-7b name: mistral-7b
namespace: default namespace: default
spec: spec:
ports: ports:
- name: http-mistral-7b - name: http-mistral-7b
port: 80 port: 80
...@@ -217,21 +218,21 @@ spec: ...@@ -217,21 +218,21 @@ spec:
app: mistral-7b app: mistral-7b
sessionAffinity: None sessionAffinity: None
type: ClusterIP type: ClusterIP
``` ```
3. **Deploy and Test** 3. Deploy and Test
Apply the deployment and service configurations using `kubectl apply -f <filename>`: Apply the deployment and service configurations using `kubectl apply -f <filename>`:
```console ```console
kubectl apply -f deployment.yaml kubectl apply -f deployment.yaml
kubectl apply -f service.yaml kubectl apply -f service.yaml
``` ```
To test the deployment, run the following `curl` command: To test the deployment, run the following `curl` command:
```console ```console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \ curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3", "model": "mistralai/Mistral-7B-Instruct-v0.3",
...@@ -239,9 +240,9 @@ curl http://mistral-7b.default.svc.cluster.local/v1/completions \ ...@@ -239,9 +240,9 @@ curl http://mistral-7b.default.svc.cluster.local/v1/completions \
"max_tokens": 7, "max_tokens": 7,
"temperature": 0 "temperature": 0
}' }'
``` ```
If the service is correctly deployed, you should receive a response from the vLLM model. If the service is correctly deployed, you should receive a response from the vLLM model.
## Conclusion ## Conclusion
......
...@@ -6,7 +6,7 @@ The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is ...@@ -6,7 +6,7 @@ The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
``` ```text
Block 1 Block 2 Block 3 Block 1 Block 2 Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance] [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->| Block 1: |<--- block tokens ---->|
...@@ -14,19 +14,16 @@ Block 2: |<------- prefix ------>| |<--- block tokens --->| ...@@ -14,19 +14,16 @@ Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->| Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
``` ```
In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping: In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
``` ```text
hash(prefix tokens + block tokens) <--> KV Block hash(prefix tokens + block tokens) <--> KV Block
``` ```
With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space. With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system. This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
## Generalized Caching Policy ## Generalized Caching Policy
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full. Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
...@@ -41,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra ...@@ -41,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra
However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above: However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency. * Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images. * Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
...@@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage. ...@@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq). You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
```console ```console
$ pip install autoawq pip install autoawq
``` ```
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`: After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
...@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"') ...@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command: To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console ```console
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
``` ```
AWQ models are also supported directly through the LLM entrypoint: AWQ models are also supported directly through the LLM entrypoint:
......
...@@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal ...@@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
Below are the steps to utilize BitsAndBytes with vLLM. Below are the steps to utilize BitsAndBytes with vLLM.
```console ```console
$ pip install bitsandbytes>=0.45.0 pip install bitsandbytes>=0.45.0
``` ```
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint. vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
...@@ -17,7 +17,7 @@ vLLM reads the model's config file and supports both in-flight quantization and ...@@ -17,7 +17,7 @@ vLLM reads the model's config file and supports both in-flight quantization and
You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>. You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
And usually, these repositories have a config.json file that includes a quantization_config section. And usually, these repositories have a config.json file that includes a quantization_config section.
## Read quantized checkpoint. ## Read quantized checkpoint
```python ```python
from vllm import LLM from vllm import LLM
...@@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b" ...@@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \ llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes") quantization="bitsandbytes", load_format="bitsandbytes")
``` ```
## OpenAI Compatible Server ## OpenAI Compatible Server
Append the following to your 4bit model arguments: Append the following to your 4bit model arguments:
``` ```console
--quantization bitsandbytes --load-format bitsandbytes --quantization bitsandbytes --load-format bitsandbytes
``` ```
...@@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b ...@@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console ```console
$ pip install llmcompressor pip install llmcompressor
``` ```
## Quantization Process ## Quantization Process
...@@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR) ...@@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR)
Install `vllm` and `lm-evaluation-harness`: Install `vllm` and `lm-evaluation-harness`:
```console ```console
$ pip install vllm lm-eval==0.4.4 pip install vllm lm-eval==0.4.4
``` ```
Load and run the model in `vllm`: Load and run the model in `vllm`:
......
...@@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO). ...@@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
To install AMMO (AlgorithMic Model Optimization): To install AMMO (AlgorithMic Model Optimization):
```console ```console
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
``` ```
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
......
...@@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul ...@@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command: To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
```console ```console
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion. # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
``` ```
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs: You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
```console ```console
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion. # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
``` ```
```{warning} ```{warning}
......
...@@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi ...@@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console ```console
$ pip install llmcompressor pip install llmcompressor
``` ```
## Quantization Process ## Quantization Process
......
...@@ -207,7 +207,6 @@ A few important things to consider when using the EAGLE based draft models: ...@@ -207,7 +207,6 @@ A few important things to consider when using the EAGLE based draft models:
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565). investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
A variety of EAGLE draft models are available on the Hugging Face hub: A variety of EAGLE draft models are available on the Hugging Face hub:
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters | | Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
...@@ -224,7 +223,6 @@ A variety of EAGLE draft models are available on the Hugging Face hub: ...@@ -224,7 +223,6 @@ A variety of EAGLE draft models are available on the Hugging Face hub:
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B | | Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B | | Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
## Lossless guarantees of Speculative Decoding ## Lossless guarantees of Speculative Decoding
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
...@@ -250,8 +248,6 @@ speculative decoding, breaking down the guarantees into three key areas: ...@@ -250,8 +248,6 @@ speculative decoding, breaking down the guarantees into three key areas:
same request across runs. For more details, see the FAQ section same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
**Conclusion**
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors: can occur due to following factors:
...@@ -259,8 +255,6 @@ can occur due to following factors: ...@@ -259,8 +255,6 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability. due to non-deterministic behavior in batched operations or numerical instability.
**Mitigation Strategies**
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
## Resources for vLLM contributors ## Resources for vLLM contributors
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment