Merge tag 'v0.5.4' into v0.5.4-dtk24.04.1

e661d594 · zhuwenwen · 6b16ea2e · 4db5176d · e661d594 · e661d594
Commit e661d594 authored Aug 12, 2024 by zhuwenwen
20 changed files
--- a/docs/source/performance_benchmark/benchmarks.rst
+++ b/docs/source/performance_benchmark/benchmarks.rst
+.. _benchmarks:
+Benchmark suites of vLLM
+========================
+vLLM contains two sets of benchmarks:
+ **Performance benchmarks**: benchmark vLLM's performance under various workloads at a high frequency (when a pull request (PR for short) of vLLM is being merged). See `vLLM performance dashboard <https://perf.vllm.ai>`_ for the latest performance results.
+ **Nightly benchmarks**: compare vLLM's performance against alternatives (tgi, trt-llm, and lmdeploy) when there are major updates of vLLM (e.g., bumping up to a new version). The latest results are available in the `vLLM GitHub README <https://github.com/vllm-project/vllm/blob/main/README.md>`_.
+Trigger a benchmark
+-------------------
+The performance benchmarks and nightly benchmarks can be triggered by submitting a PR to vLLM, and label the PR with `perf-benchmarks` and `nightly-benchmarks`.
+.. note::
+   Please refer to `vLLM performance benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/tests/descriptions.md>`_ and `vLLM nightly benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md>`_ for detailed descriptions on benchmark environment, workload and metrics.
--- a/docs/source/quantization/bnb.rst
+++ b/docs/source/quantization/bnb.rst
+.. _bits_and_bytes:
+BitsAndBytes
+==================
+vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
+BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
+Compared to other quantization methods,  BitsAndBytes eliminates the need for calibrating the quantized model with input data.
+Below are the steps to utilize BitsAndBytes with vLLM.
+.. code-block:: console
+    $ pip install bitsandbytes>=0.42.0
+vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
+You can find bitsandbytes quantized models on https://huggingface.co/models?other=bitsandbytes.
+And usually, these repositories have a config.json file that includes a quantization_config section.
+Read quantized checkpoint.
+--------------------------
+.. code-block:: python
+    from vllm import LLM
+    import torch
+    # unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
+    model_id = "unsloth/tinyllama-bnb-4bit"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    quantization="bitsandbytes", load_format="bitsandbytes")
+Inflight quantization: load as 4bit quantization
+------------------------------------------------
+.. code-block:: python
+    from vllm import LLM
+    import torch
+    model_id = "huggyllama/llama-7b"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    quantization="bitsandbytes", load_format="bitsandbytes")
--- a/docs/source/serving/distributed_serving.rst
+++ b/docs/source/serving/distributed_serving.rst
@@ -50,7 +50,7 @@ You can also additionally specify :code:`--pipeline-parallel-size` to enable pip
    $     --pipeline-parallel-size 2
 .. note::
-    Pipeline parallel is a beta feature. It is only supported for online serving as well as LLaMa, GPT2, and Mixtral style models.
+    Pipeline parallel is a beta feature. It is only supported for online serving as well as LLaMa, GPT2, Mixtral, Qwen, Qwen2, and Nemotron style models.
 Multi-Node Inference and Serving
 --------------------------------
@@ -79,7 +79,7 @@ On the rest of the worker nodes, run the following command:
    $                   --worker \
    $                   /path/to/the/huggingface/home/in/this/node
-Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster.
+Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument ``ip_of_head_node`` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
 Then, on any node, use ``docker exec -it node /bin/bash`` to enter the container, execute ``ray status`` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
@@ -101,7 +101,7 @@ You can also use tensor parallel without pipeline parallel, just set the tensor
 To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like ``--privileged -e NCCL_IB_HCA=mlx5`` to the ``run_cluster.sh`` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with ``NCCL_DEBUG=TRACE`` environment variable set, e.g. ``NCCL_DEBUG=TRACE vllm serve ...`` and check the logs for the NCCL version and the network used. If you find ``[send] via NET/Socket`` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find ``[send] via NET/IB/GDRDMA`` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
 .. warning::
-    After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information.
+    After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information. If you need to set some environment variables for the communication configuration, you can append them to the ``run_cluster.sh`` script, e.g. ``-e NCCL_SOCKET_IFNAME=eth0``. Note that setting environment variables in the shell (e.g. ``NCCL_SOCKET_IFNAME=eth0 vllm serve ...``) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See the `discussion <https://github.com/vllm-project/vllm/issues/6803>`_ for more information.
 .. warning::

--- a/docs/source/serving/run_on_sky.rst
+++ b/docs/source/serving/run_on_sky.rst
@@ -5,9 +5,9 @@ Deploying and scaling up with SkyPilot
 .. raw:: html
-    <p align="center">
+  <p align="center">
-        <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
+    <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
-    </p>
+  </p>
 vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
@@ -21,8 +21,8 @@ Prerequisites
 .. code-block:: console
-    pip install skypilot-nightly
+  pip install skypilot-nightly
-    sky check
+  sky check
 Run on a single instance
@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil
 .. code-block:: yaml
-    resources:
+  resources:
-        accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-        use_spot: True
+    use_spot: True
-        disk_size: 512  # Ensure model checkpoints can fit.
+    disk_size: 512  # Ensure model checkpoints can fit.
-        disk_tier: best
+    disk_tier: best
-        ports: 8081  # Expose to internet traffic.
+    ports: 8081  # Expose to internet traffic.
-    envs:
+  envs:
-        MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-        HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-    setup: |
+  setup: |
-        conda create -n vllm python=3.10 -y
+    conda create -n vllm python=3.10 -y
-        conda activate vllm
+    conda activate vllm
-        pip install vllm==0.4.0.post1
+    pip install vllm==0.4.0.post1
-        # Install Gradio for web UI.
+    # Install Gradio for web UI.
-        pip install gradio openai
+    pip install gradio openai
-        pip install flash-attn==2.5.7
+    pip install flash-attn==2.5.7
-    run: |
+  run: |
-        conda activate vllm
+    conda activate vllm
-        echo 'Starting vllm api server...'
+    echo 'Starting vllm api server...'
-        python -u -m vllm.entrypoints.openai.api_server \
+    python -u -m vllm.entrypoints.openai.api_server \
-            --port 8081 \
+      --port 8081 \
-            --model $MODEL_NAME \
+      --model $MODEL_NAME \
-            --trust-remote-code \
+      --trust-remote-code \
-            --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-            2>&1 | tee api_server.log &
+      2>&1 | tee api_server.log &
-        echo 'Waiting for vllm api server to start...'
+    echo 'Waiting for vllm api server to start...'
-        while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
+    while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
-        echo 'Starting gradio server...'
+    echo 'Starting gradio server...'
-        git clone https://github.com/vllm-project/vllm.git || true
+    git clone https://github.com/vllm-project/vllm.git || true
-        python vllm/examples/gradio_openai_chatbot_webserver.py \
+    python vllm/examples/gradio_openai_chatbot_webserver.py \
-            -m $MODEL_NAME \
+      -m $MODEL_NAME \
-            --port 8811 \
+      --port 8811 \
-            --model-url http://localhost:8081/v1 \
+      --model-url http://localhost:8081/v1 \
-            --stop-token-ids 128009,128001
+      --stop-token-ids 128009,128001
 Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): 
 .. code-block:: console
-    HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
+  HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
 Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
 .. code-block:: console
-    (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
+  (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
 **Optional**: Serve the 70B model instead of the default 8B and use more GPU:
 .. code-block:: console
-    HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
+  HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
 Scale up to multiple replicas
@@ -99,151 +99,212 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
 .. code-block:: yaml
-    service:
+  service:
-        replicas: 2
+    replicas: 2
-        # An actual request for readiness probe.
+    # An actual request for readiness probe.
-        readiness_probe:
+    readiness_probe:
-            path: /v1/chat/completions
+      path: /v1/chat/completions
-            post_data:
+      post_data:
-            model: $MODEL_NAME
+      model: $MODEL_NAME
-            messages:
+      messages:
-                - role: user
+        - role: user
-                content: Hello! What is your name?
+          content: Hello! What is your name?
-        max_tokens: 1
+    max_tokens: 1
 .. raw:: html
-    <details>
+  <details>
-    <summary>Click to see the full recipe YAML</summary>
+  <summary>Click to see the full recipe YAML</summary>
 .. code-block:: yaml
-    service:
+  service:
-        replicas: 2
+    replicas: 2
-        # An actual request for readiness probe.
+    # An actual request for readiness probe.
-        readiness_probe:
+    readiness_probe:
-            path: /v1/chat/completions
+      path: /v1/chat/completions
-            post_data:
+      post_data:
-            model: $MODEL_NAME
+        model: $MODEL_NAME
-            messages:
+        messages:
-                - role: user
+          - role: user
-                content: Hello! What is your name?
+            content: Hello! What is your name?
        max_tokens: 1
-    resources:
+  resources:
-        accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-        use_spot: True
+    use_spot: True
-        disk_size: 512  # Ensure model checkpoints can fit.
+    disk_size: 512  # Ensure model checkpoints can fit.
-        disk_tier: best
+    disk_tier: best
-        ports: 8081  # Expose to internet traffic.
+    ports: 8081  # Expose to internet traffic.
-    envs:
+  envs:
-        MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-        HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-    setup: |
+  setup: |
-        conda create -n vllm python=3.10 -y
+    conda create -n vllm python=3.10 -y
-        conda activate vllm
+    conda activate vllm
-        pip install vllm==0.4.0.post1
+    pip install vllm==0.4.0.post1
-        # Install Gradio for web UI.
+    # Install Gradio for web UI.
-        pip install gradio openai
+    pip install gradio openai
-        pip install flash-attn==2.5.7
+    pip install flash-attn==2.5.7
-    run: |
+  run: |
-        conda activate vllm
+    conda activate vllm
-        echo 'Starting vllm api server...'
+    echo 'Starting vllm api server...'
-        python -u -m vllm.entrypoints.openai.api_server \
+    python -u -m vllm.entrypoints.openai.api_server \
-            --port 8081 \
+      --port 8081 \
-            --model $MODEL_NAME \
+      --model $MODEL_NAME \
-            --trust-remote-code \
+      --trust-remote-code \
-            --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-            2>&1 | tee api_server.log &
+      2>&1 | tee api_server.log
-        echo 'Waiting for vllm api server to start...'
-        while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
-        echo 'Starting gradio server...'
-        git clone https://github.com/vllm-project/vllm.git || true
-        python vllm/examples/gradio_openai_chatbot_webserver.py \
-            -m $MODEL_NAME \
-            --port 8811 \
-            --model-url http://localhost:8081/v1 \
-            --stop-token-ids 128009,128001
 .. raw:: html
-    </details>
+  </details>
 Start the serving the Llama-3 8B model on multiple replicas:
 .. code-block:: console
-    HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
+  HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
 Wait until the service is ready:
 .. code-block:: console
-    watch -n10 sky serve status vllm
+  watch -n10 sky serve status vllm
 .. raw:: html
-    <details>
+  <details>
-    <summary>Example outputs:</summary>
+  <summary>Example outputs:</summary>
 .. code-block:: console
-    Services
+  Services
-    NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
+  NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
-    vllm  1        35s     READY   2/2       xx.yy.zz.100:30001
+  vllm  1        35s     READY   2/2       xx.yy.zz.100:30001
-    Service Replicas
+  Service Replicas
-    SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES          STATUS  REGION
+  SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
-    vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+  vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
-    vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+  vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
 .. raw:: html
-    </details>
+  </details>
 After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
 .. code-block:: console
-    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
+  ENDPOINT=$(sky serve status --endpoint 8081 vllm)
-    curl -L http://$ENDPOINT/v1/chat/completions \
+  curl -L http://$ENDPOINT/v1/chat/completions \
-        -H "Content-Type: application/json" \
+    -H "Content-Type: application/json" \
-        -d '{
+    -d '{
-            "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
-            "messages": [
+      "messages": [
-            {
+      {
-                "role": "system",
+        "role": "system",
-                "content": "You are a helpful assistant."
+        "content": "You are a helpful assistant."
-            },
+      },
-            {
+      {
-                "role": "user",
+        "role": "user",
-                "content": "Who are you?"
+        "content": "Who are you?"
-            }
+      }
-            ],
+      ],
-            "stop_token_ids": [128009,  128001]
+      "stop_token_ids": [128009,  128001]
-        }'
+    }'
-To enable autoscaling, you could specify additional configs in `services`:
+To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
 .. code-block:: yaml
-    services:
+  service:
-        replica_policy:
+    replica_policy:
-            min_replicas: 0
+      min_replicas: 2
-            max_replicas: 3
+      max_replicas: 4
-        target_qps_per_replica: 2
+      target_qps_per_replica: 2
 This will scale the service up to when the QPS exceeds 2 for each replica.
+.. raw:: html
+  <details>
+  <summary>Click to see the full recipe YAML</summary>
+.. code-block:: yaml
+  service:
+    replica_policy:
+      min_replicas: 2
+      max_replicas: 4
+      target_qps_per_replica: 2
+    # An actual request for readiness probe.
+    readiness_probe:
+      path: /v1/chat/completions
+      post_data:
+        model: $MODEL_NAME
+        messages:
+          - role: user
+            content: Hello! What is your name?
+        max_tokens: 1
+  resources:
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    use_spot: True
+    disk_size: 512  # Ensure model checkpoints can fit.
+    disk_tier: best
+    ports: 8081  # Expose to internet traffic.
+  envs:
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+  setup: |
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+    pip install vllm==0.4.0.post1
+    # Install Gradio for web UI.
+    pip install gradio openai
+    pip install flash-attn==2.5.7
+  run: |
+    conda activate vllm
+    echo 'Starting vllm api server...'
+    python -u -m vllm.entrypoints.openai.api_server \
+      --port 8081 \
+      --model $MODEL_NAME \
+      --trust-remote-code \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      2>&1 | tee api_server.log
+.. raw:: html
+  </details>
+To update the service with the new config:
+.. code-block:: console
+  HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
+To stop the service:
+.. code-block:: console
+  sky serve down vllm
 **Optional**: Connect a GUI to the endpoint
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -253,58 +314,53 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
 .. raw:: html
-    <details>
+  <details>
-    <summary>Click to see the full GUI YAML</summary>
+  <summary>Click to see the full GUI YAML</summary>
 .. code-block:: yaml
-    envs:
+  envs:
-        MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-        ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. 
+    ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. 
-    resources:
+  resources:
-        cpus: 2
+    cpus: 2
-    setup: |
+  setup: |
-        conda activate vllm
+    conda create -n vllm python=3.10 -y
-        if [ $? -ne 0 ]; then
+    conda activate vllm
-            conda create -n vllm python=3.10 -y
-            conda activate vllm
+    # Install Gradio for web UI.
-        fi
+    pip install gradio openai
-        # Install Gradio for web UI.
+  run: |
-        pip install gradio openai
+    conda activate vllm
+    export PATH=$PATH:/sbin
-    run: |
-        conda activate vllm
+    echo 'Starting gradio server...'
-        export PATH=$PATH:/sbin
+    git clone https://github.com/vllm-project/vllm.git || true
-        WORKER_IP=$(hostname -I | cut -d' ' -f1)
+    python vllm/examples/gradio_openai_chatbot_webserver.py \
-        CONTROLLER_PORT=21001
+      -m $MODEL_NAME \
-        WORKER_PORT=21002
+      --port 8811 \
+      --model-url http://$ENDPOINT/v1 \
-        echo 'Starting gradio server...'
+      --stop-token-ids 128009,128001 | tee ~/gradio.log
-        git clone https://github.com/vllm-project/vllm.git || true
-        python vllm/examples/gradio_openai_chatbot_webserver.py \
-            -m $MODEL_NAME \
-            --port 8811 \
-            --model-url http://$ENDPOINT/v1 \
-            --stop-token-ids 128009,128001 | tee ~/gradio.log
 .. raw:: html
-    </details>
+  </details>
 1. Start the chat web UI:
 .. code-block:: console
-    sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
+  sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
 2. Then, we can access the GUI at the returned gradio link:
 .. code-block:: console
-    | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+  | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
--- a/examples/api_client.py
+++ b/examples/api_client.py
@@ -31,7 +31,10 @@ def post_http_request(prompt: str,
        "max_tokens": 16,
        "stream": stream,
    }
-    response = requests.post(api_url, headers=headers, json=pload, stream=True)
+    response = requests.post(api_url,
+                             headers=headers,
+                             json=pload,
+                             stream=stream)
    return response

--- a/examples/fp8/quantizer/README.md
+++ b/examples/fp8/quantizer/README.md
@@ -16,7 +16,7 @@
 #### Run on H100 system for speed if FP8; number of GPUs depends on the model size
 #### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
-`python quantize.py --model_dir ./ll2-7b --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./ll2_7b_fp8 --calib_size 512 --tp_size 1`
+`python quantize.py --model-dir ./ll2-7b --dtype float16 --qformat fp8 --kv-cache-dtype fp8 --output-dir ./ll2_7b_fp8 --calib-size 512 --tp-size 1`
 Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
 ```

--- a/examples/fuyu_example.py
+++ b/examples/fuyu_example.py
-import requests
-from PIL import Image
-from vllm import LLM, SamplingParams
-def run_fuyu():
-    llm = LLM(model="adept/fuyu-8b", max_model_len=4096)
-    # single-image prompt
-    prompt = "What is the highest life expectancy at of male?\n"
-    url = "https://huggingface.co/adept/fuyu-8b/resolve/main/chart.png"
-    image = Image.open(requests.get(url, stream=True).raw)
-    sampling_params = SamplingParams(temperature=0, max_tokens=64)
-    outputs = llm.generate(
-        {
-            "prompt": prompt,
-            "multi_modal_data": {
-                "image": image
-            },
-        },
-        sampling_params=sampling_params)
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-if __name__ == "__main__":
-    run_fuyu()
--- a/examples/llava_example.py
+++ b/examples/llava_example.py
-from vllm import LLM
-from vllm.assets.image import ImageAsset
-def run_llava():
-    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
-    image = ImageAsset("stop_sign").pil_image
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": {
-            "image": image
-        },
-    })
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-if __name__ == "__main__":
-    run_llava()
--- a/examples/llava_next_example.py
+++ b/examples/llava_next_example.py
-from io import BytesIO
-import requests
-from PIL import Image
-from vllm import LLM, SamplingParams
-def run_llava_next():
-    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
-    prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
-    url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
-    image = Image.open(BytesIO(requests.get(url).content))
-    sampling_params = SamplingParams(temperature=0.8,
-                                     top_p=0.95,
-                                     max_tokens=100)
-    outputs = llm.generate(
-        {
-            "prompt": prompt,
-            "multi_modal_data": {
-                "image": image
-            }
-        },
-        sampling_params=sampling_params)
-    generated_text = ""
-    for o in outputs:
-        generated_text += o.outputs[0].text
-    print(f"LLM output:{generated_text}")
-if __name__ == "__main__":
-    run_llava_next()
--- a/examples/offline_inference_vision_language.py
+++ b/examples/offline_inference_vision_language.py
+"""
+This example shows how to use vLLM for running offline inference 
+with the correct prompt format on vision language models.
+For most models, the prompt format should follow corresponding examples
+on HuggingFace model repository.
+"""
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+from vllm.assets.image import ImageAsset
+from vllm.utils import FlexibleArgumentParser
+# Input image and question
+image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
+question = "What is the content of this image?"
+# LLaVA-1.5
+def run_llava(question):
+    prompt = f"USER: <image>\n{question}\nASSISTANT:"
+    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
+    return llm, prompt
+# LLaVA-1.6/LLaVA-NeXT
+def run_llava_next(question):
+    prompt = f"[INST] <image>\n{question} [/INST]"
+    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf")
+    return llm, prompt
+# Fuyu
+def run_fuyu(question):
+    prompt = f"{question}\n"
+    llm = LLM(model="adept/fuyu-8b")
+    return llm, prompt
+# Phi-3-Vision
+def run_phi3v(question):
+    prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"  # noqa: E501
+    # Note: The default setting of max_num_seqs (256) and
+    # max_model_len (128k) for this model may cause OOM.
+    # You may lower either to run this example on lower-end GPUs.
+    # In this example, we override max_num_seqs to 5 while
+    # keeping the original context length of 128k.
+    llm = LLM(
+        model="microsoft/Phi-3-vision-128k-instruct",
+        trust_remote_code=True,
+        max_num_seqs=5,
+    )
+    return llm, prompt
+# PaliGemma
+def run_paligemma(question):
+    # PaliGemma has special prompt format for VQA
+    prompt = "caption en"
+    llm = LLM(model="google/paligemma-3b-mix-224")
+    return llm, prompt
+# Chameleon
+def run_chameleon(question):
+    prompt = f"{question}<image>"
+    llm = LLM(model="facebook/chameleon-7b")
+    return llm, prompt
+# MiniCPM-V
+def run_minicpmv(question):
+    # 2.0
+    # The official repo doesn't work yet, so we need to use a fork for now
+    # For more details, please see: See: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630 # noqa
+    # model_name = "HwwwH/MiniCPM-V-2"
+    # 2.5
+    model_name = "openbmb/MiniCPM-Llama3-V-2_5"
+    tokenizer = AutoTokenizer.from_pretrained(model_name,
+                                              trust_remote_code=True)
+    llm = LLM(
+        model=model_name,
+        trust_remote_code=True,
+    )
+    messages = [{
+        'role': 'user',
+        'content': f'(<image>./</image>)\n{question}'
+    }]
+    prompt = tokenizer.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+    return llm, prompt
+# InternVL
+def run_internvl(question):
+    # Generally, InternVL can use chatml template for conversation
+    TEMPLATE = "<|im_start|>User\n{prompt}<|im_end|>\n<|im_start|>Assistant\n"
+    prompt = f"<image>\n{question}\n"
+    prompt = TEMPLATE.format(prompt=prompt)
+    llm = LLM(
+        model="OpenGVLab/InternVL2-4B",
+        trust_remote_code=True,
+        max_num_seqs=5,
+    )
+    return llm, prompt
+# BLIP-2
+def run_blip2(question):
+    # BLIP-2 prompt format is inaccurate on HuggingFace model repository.
+    # See https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/15#64ff02f3f8cf9e4f5b038262 #noqa
+    prompt = f"Question: {question} Answer:"
+    llm = LLM(model="Salesforce/blip2-opt-2.7b")
+    return llm, prompt
+model_example_map = {
+    "llava": run_llava,
+    "llava-next": run_llava_next,
+    "fuyu": run_fuyu,
+    "phi3_v": run_phi3v,
+    "paligemma": run_paligemma,
+    "chameleon": run_chameleon,
+    "minicpmv": run_minicpmv,
+    "blip-2": run_blip2,
+    "internvl_chat": run_internvl,
+}
+def main(args):
+    model = args.model_type
+    if model not in model_example_map:
+        raise ValueError(f"Model type {model} is not supported.")
+    llm, prompt = model_example_map[model](question)
+    # We set temperature to 0.2 so that outputs can be different
+    # even when all prompts are identical when running batch inference.
+    sampling_params = SamplingParams(temperature=0.2, max_tokens=64)
+    assert args.num_prompts > 0
+    if args.num_prompts == 1:
+        # Single inference
+        inputs = {
+            "prompt": prompt,
+            "multi_modal_data": {
+                "image": image
+            },
+        }
+    else:
+        # Batch inference
+        inputs = [{
+            "prompt": prompt,
+            "multi_modal_data": {
+                "image": image
+            },
+        } for _ in range(args.num_prompts)]
+    outputs = llm.generate(inputs, sampling_params=sampling_params)
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+if __name__ == "__main__":
+    parser = FlexibleArgumentParser(
+        description='Demo on using vLLM for offline inference with '
+        'vision language models')
+    parser.add_argument('--model-type',
+                        '-m',
+                        type=str,
+                        default="llava",
+                        choices=model_example_map.keys(),
+                        help='Huggingface "model_type".')
+    parser.add_argument('--num-prompts',
+                        type=int,
+                        default=1,
+                        help='Number of prompts to run.')
+    args = parser.parse_args()
+    main(args)
--- a/examples/openai_embedding_client.py
+++ b/examples/openai_embedding_client.py
@@ -13,11 +13,14 @@ client = OpenAI(
 models = client.models.list()
 model = models.data[0].id
-responses = client.embeddings.create(input=[
+responses = client.embeddings.create(
-    "Hello my name is",
+    input=[
-    "The best thing about vLLM is that it supports many different models"
+        "Hello my name is",
-],
+        "The best thing about vLLM is that it supports many different models"
-                                     model=model)
+    ],
+    model=model,
+    encoding_format="float",
+)
 for data in responses.data:
    print(data.embedding)  # list of float of len 4096
--- a/examples/openai_vision_api_client.py
+++ b/examples/openai_vision_api_client.py
@@ -42,6 +42,7 @@ chat_completion_from_url = client.chat.completions.create(
        ],
    }],
    model=model,
+    max_tokens=64,
 )
 result = chat_completion_from_url.choices[0].message.content
@@ -78,6 +79,7 @@ chat_completion_from_base64 = client.chat.completions.create(
        ],
    }],
    model=model,
+    max_tokens=64,
 )
 result = chat_completion_from_base64.choices[0].message.content

--- a/examples/paligemma_example.py
+++ b/examples/paligemma_example.py
-from vllm import LLM
-from vllm.assets.image import ImageAsset
-def run_paligemma():
-    llm = LLM(model="google/paligemma-3b-mix-224")
-    prompt = "caption es"
-    image = ImageAsset("stop_sign").pil_image
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": {
-            "image": image
-        },
-    })
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-if __name__ == "__main__":
-    run_paligemma()
--- a/examples/phi3v_example.py
+++ b/examples/phi3v_example.py
-from vllm import LLM, SamplingParams
-from vllm.assets.image import ImageAsset
-def run_phi3v():
-    model_path = "microsoft/Phi-3-vision-128k-instruct"
-    # Note: The default setting of max_num_seqs (256) and
-    # max_model_len (128k) for this model may cause OOM.
-    # You may lower either to run this example on lower-end GPUs.
-    # In this example, we override max_num_seqs to 5 while
-    # keeping the original context length of 128k.
-    llm = LLM(
-        model=model_path,
-        trust_remote_code=True,
-        max_num_seqs=5,
-    )
-    image = ImageAsset("cherry_blossom").pil_image
-    # single-image prompt
-    prompt = "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n"  # noqa: E501
-    sampling_params = SamplingParams(temperature=0, max_tokens=64)
-    outputs = llm.generate(
-        {
-            "prompt": prompt,
-            "multi_modal_data": {
-                "image": image
-            },
-        },
-        sampling_params=sampling_params)
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-if __name__ == "__main__":
-    run_phi3v()
--- a/examples/template_blip2.jinja
+++ b/examples/template_blip2.jinja
+{%- for message in messages -%}
+    {%- if message['role'] == 'user' -%}
+        {{- 'Question: ' + message['content'] + ' ' -}}
+    {%- elif message['role'] == 'assistant' -%}
+        {{- 'Answer: ' + message['content'] + ' ' -}}
+    {%- endif -%}
+{%- endfor -%}
+{%- if add_generation_prompt -%}
+    {{- 'Answer:' -}}
+{% endif %}
--- a/format.sh
+++ b/format.sh
@@ -96,23 +96,20 @@ echo 'vLLM yapf: Done'
 # Run mypy
 echo 'vLLM mypy:'
-mypy tests --config-file pyproject.toml
+mypy --follow-imports skip  # Note that this is less strict than CI
-mypy vllm/*.py --config-file pyproject.toml
+mypy tests --follow-imports skip
-mypy vllm/attention --config-file pyproject.toml
+mypy vllm/attention --follow-imports skip
-mypy vllm/core --config-file pyproject.toml
+mypy vllm/core --follow-imports skip
-mypy vllm/distributed --config-file pyproject.toml
+mypy vllm/distributed --follow-imports skip
-mypy vllm/engine  --config-file pyproject.toml
+mypy vllm/engine  --follow-imports skip
-mypy vllm/entrypoints --config-file pyproject.toml
+mypy vllm/entrypoints --follow-imports skip
-mypy vllm/executor --config-file pyproject.toml
+mypy vllm/executor --follow-imports skip
-mypy vllm/logging --config-file pyproject.toml
+mypy vllm/lora --follow-imports skip
-mypy vllm/lora --config-file pyproject.toml
+mypy vllm/model_executor  --follow-imports skip
-mypy vllm/model_executor  --config-file pyproject.toml
+mypy vllm/prompt_adapter --follow-imports skip
-mypy vllm/multimodal --config-file pyproject.toml
+mypy vllm/spec_decode --follow-imports skip
-mypy vllm/prompt_adapter --config-file pyproject.toml
+mypy vllm/worker --follow-imports skip
-mypy vllm/spec_decode --config-file pyproject.toml
+echo 'vLLM mypy: Done'
-mypy vllm/transformers_utils --config-file pyproject.toml
-mypy vllm/usage --config-file pyproject.toml
-mypy vllm/worker --config-file pyproject.toml
 # If git diff returns a file that is in the skip list, the file may be checked anyway:
@@ -131,7 +128,7 @@ spell_check_all(){
  codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}"
 }
-# Spelling  check of files that differ from main branch.
+# Spelling check of files that differ from main branch.
 spell_check_changed() {
    # The `if` guard ensures that the list of filenames is not empty, which
    # could cause ruff to receive 0 positional arguments, making it hang
@@ -245,12 +242,6 @@ echo 'vLLM isort: Done'
 # NOTE: Keep up to date with .github/workflows/clang-format.yml
 CLANG_FORMAT_EXCLUDES=(
    'csrc/moe/topk_softmax_kernels.cu'
-    'csrc/punica/bgmv/bgmv_bf16_bf16_bf16.cu'
-    'csrc/punica/bgmv/bgmv_config.h'
-    'csrc/punica/bgmv/bgmv_impl.cuh'
-    'csrc/punica/bgmv/vec_dtypes.cuh'
-    'csrc/punica/punica_ops.cu'
-    'csrc/punica/type_convert.h'
 )
 # Format specified files with clang-format

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,7 +5,7 @@ requires = [
    "ninja",
    "packaging",
    "setuptools >= 49.4.0",
-    "torch == 2.3.1",
+    "torch == 2.4.0",
    "wheel",
 ]
 build-backend = "setuptools.build_meta"
@@ -48,9 +48,22 @@ python_version = "3.8"
 ignore_missing_imports = true
 check_untyped_defs = true
-follow_imports = "skip"
+follow_imports = "silent"
-files = "vllm"
+# After fixing type errors resulting from follow_imports: "skip" -> "silent",
+# move the directory here and remove it from format.sh and mypy.yaml
+files = [
+    "vllm/*.py",
+    "vllm/adapter_commons",
+    "vllm/assets",
+    "vllm/inputs",
+    "vllm/logging",
+    "vllm/multimodal",
+    "vllm/platforms",
+    "vllm/transformers_utils",
+    "vllm/triton_utils",
+    "vllm/usage",
+]
 # TODO(woosuk): Include the code from Megatron and HuggingFace.
 exclude = [
    "vllm/model_executor/parallel_utils/|vllm/model_executor/models/",

--- a/requirements-adag.txt
+++ b/requirements-adag.txt
+# Dependencies for Ray accelerated DAG
+cupy-cuda12x
+ray >= 2.32
\ No newline at end of file
--- a/requirements-build.txt
+++ b/requirements-build.txt
@@ -3,5 +3,5 @@ cmake>=3.21
 ninja
 packaging
 setuptools>=49.4.0
-torch==2.3.1
+torch==2.4.0
 wheel
--- a/requirements-common.txt
+++ b/requirements-common.txt
@@ -6,7 +6,7 @@ numpy < 2.0.0
 requests
 tqdm
 py-cpuinfo
-transformers >= 4.42.4  # Required for Gemma 2 and for additional chat template parameters.
+transformers >= 4.43.2  # Required for Chameleon and Llama 3.1 hotfox.
 tokenizers >= 0.19.1  # Required for Llama 3.
 fastapi
 aiohttp