Merge branch 'v0.5.4-dtk24.04.1'

e7c1b7f3 · zhuwenwen · 7462218e · 04c62b93 · e7c1b7f3 · e7c1b7f3
Commit e7c1b7f3 authored Sep 06, 2024 by zhuwenwen
20 changed files
--- a/docs/source/quantization/bnb.rst
+++ b/docs/source/quantization/bnb.rst
+.. _bits_and_bytes:
+BitsAndBytes
+==================
+vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
+BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
+Compared to other quantization methods,  BitsAndBytes eliminates the need for calibrating the quantized model with input data.
+Below are the steps to utilize BitsAndBytes with vLLM.
+.. code-block:: console
+    $ pip install bitsandbytes>=0.42.0
+vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
+You can find bitsandbytes quantized models on https://huggingface.co/models?other=bitsandbytes.
+And usually, these repositories have a config.json file that includes a quantization_config section.
+Read quantized checkpoint.
+--------------------------
+.. code-block:: python
+    from vllm import LLM
+    import torch
+    # unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
+    model_id = "unsloth/tinyllama-bnb-4bit"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    quantization="bitsandbytes", load_format="bitsandbytes")
+Inflight quantization: load as 4bit quantization
+------------------------------------------------
+.. code-block:: python
+    from vllm import LLM
+    import torch
+    model_id = "huggyllama/llama-7b"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    quantization="bitsandbytes", load_format="bitsandbytes")
--- a/docs/source/quantization/fp8.rst
+++ b/docs/source/quantization/fp8.rst
@@ -3,7 +3,10 @@
 FP8
 ==================
-vLLM supports FP8 (8-bit floating point) computation using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. Currently, only Hopper and Ada Lovelace GPUs are supported. Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
+vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. 
+Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. 
+Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
+Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
 Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127>`_.

--- a/docs/source/quantization/supported_hardware.rst
+++ b/docs/source/quantization/supported_hardware.rst
+.. _supported_hardware_for_quantization:
+Supported Hardware for Quantization Kernels
+===========================================
+The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
+==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
+Implementation  Volta   Turing   Ampere   Ada    Hopper  AMD GPU  Intel GPU  x86 CPU  AWS Inferentia  Google TPU
+==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
+AQLM            ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+AWQ             ❌      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+DeepSpeedFP     ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+FP8             ❌      ❌       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+Marlin          ❌      ❌       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+GPTQ            ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+SqueezeLLM      ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+bitsandbytes    ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
+Notes:
+^^^^^^
+- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
+- "✅" indicates that the quantization method is supported on the specified hardware.
+- "❌" indicates that the quantization method is not supported on the specified hardware.
+Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
+For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
\ No newline at end of file
--- a/docs/source/serving/deploying_with_cerebrium.rst
+++ b/docs/source/serving/deploying_with_cerebrium.rst
+.. _deploying_with_cerebrium:
+Deploying with Cerebrium
+============================
+.. raw:: html
+    <p align="center">
+        <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
+    </p>
+vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
+To install the Cerebrium client, run:
+.. code-block:: console
+    $ pip install cerebrium
+    $ cerebrium login
+Next, create your Cerebrium project, run:
+.. code-block:: console
+    $ cerebrium init vllm-project
+Next, to install the required packages, add the following to your cerebrium.toml:
+.. code-block:: toml
+    [cerebrium.deployment]
+    docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
+    [cerebrium.dependencies.pip]
+    vllm = "latest"
+Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
+.. code-block:: python
+    from vllm import LLM, SamplingParams
+    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
+        outputs = llm.generate(prompts, sampling_params)
+        # Print the outputs.
+        results = []
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            results.append({"prompt": prompt, "generated_text": generated_text})
+        return {"results": results}
+Then, run the following code to deploy it to the cloud
+.. code-block:: console
+    $ cerebrium deploy
+If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
+.. code-block:: python
+    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
+     -H 'Content-Type: application/json' \
+     -H 'Authorization: <JWT TOKEN>' \
+     --data '{
+       "prompts": [
+         "Hello, my name is",
+         "The president of the United States is",
+         "The capital of France is",
+         "The future of AI is"
+       ]
+     }'
+You should get a response like:
+.. code-block:: python
+    {
+        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
+        "result": {
+            "result": [
+                {
+                    "prompt": "Hello, my name is",
+                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
+                },
+                {
+                    "prompt": "The president of the United States is",
+                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
+                },
+                {
+                    "prompt": "The capital of France is",
+                    "generated_text": " Paris.\n"
+                },
+                {
+                    "prompt": "The future of AI is",
+                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
+                }
+            ]
+        },
+        "run_time_ms": 152.53663063049316
+    }
+You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/source/serving/deploying_with_docker.rst
+++ b/docs/source/serving/deploying_with_docker.rst
@@ -3,9 +3,8 @@
 Deploying with Docker
 ============================
-vLLM offers official docker image for deployment.
+vLLM offers an official Docker image for deployment.
-The image can be used to run OpenAI compatible server.
+The image can be used to run OpenAI compatible server and is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
-The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
 .. code-block:: console
@@ -25,7 +24,7 @@ The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.co
        memory to share data between processes under the hood, particularly for tensor parallel inference.
-You can build and run vLLM from source via the provided dockerfile. To build vLLM:
+You can build and run vLLM from source via the provided `Dockerfile <https://github.com/vllm-project/vllm/blob/main/Dockerfile>`_. To build vLLM:
 .. code-block:: console

--- a/docs/source/serving/deploying_with_dstack.rst
+++ b/docs/source/serving/deploying_with_dstack.rst
@@ -40,7 +40,7 @@ Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7
        gpu: 24GB
    commands:
        - pip install vllm
-        - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
+        - vllm serve $MODEL --port 8000
    model:
        format: openai
        type: chat

--- a/docs/source/serving/distributed_serving.rst
+++ b/docs/source/serving/distributed_serving.rst
@@ -3,7 +3,26 @@
 Distributed Inference and Serving
 =================================
-vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
+How to decide the distributed inference strategy?
+-------------------------------------------------
+Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
+- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
+- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
+- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
+In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
+After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like ``# GPU blocks: 790``. Multiply the number by ``16`` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
+.. note::
+    There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
+Details for Distributed Inference and Serving
+----------------------------------------------
+vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_.  We also support pipeline parallel as a beta feature for online serving. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
 Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured :code:`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the :code:`LLM` class :code:`distributed-executor-backend` argument or :code:`--distributed-executor-backend` API server argument. Set it to :code:`mp` for multiprocessing or :code:`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
@@ -19,20 +38,73 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh
 .. code-block:: console
-    $ python -m vllm.entrypoints.api_server \
+    $ vllm serve facebook/opt-13b \
-    $     --model facebook/opt-13b \
    $     --tensor-parallel-size 4
-To scale vLLM beyond a single machine, install and start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
+You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
+.. code-block:: console
+    $ vllm serve gpt2 \
+    $     --tensor-parallel-size 4 \
+    $     --pipeline-parallel-size 2
+.. note::
+    Pipeline parallel is a beta feature. It is only supported for online serving as well as LLaMa, GPT2, Mixtral, Qwen, Qwen2, and Nemotron style models.
+Multi-Node Inference and Serving
+--------------------------------
+If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
+The first step, is to start containers and organize them into a cluster. We have provided a helper `script <https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh>`_ to start the cluster.
+Pick a node as the head node, and run the following command:
 .. code-block:: console
-    $ pip install ray
+    $ bash run_cluster.sh \
+    $                   vllm/vllm-openai \
+    $                   ip_of_head_node \
+    $                   --head \
+    $                   /path/to/the/huggingface/home/in/this/node
+On the rest of the worker nodes, run the following command:
+.. code-block:: console
+    $ bash run_cluster.sh \
+    $                   vllm/vllm-openai \
+    $                   ip_of_head_node \
+    $                   --worker \
+    $                   /path/to/the/huggingface/home/in/this/node
+Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument ``ip_of_head_node`` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
+Then, on any node, use ``docker exec -it node /bin/bash`` to enter the container, execute ``ray status`` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
+After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
+.. code-block:: console
+    $ vllm serve /path/to/the/model/in/the/container \
+    $     --tensor-parallel-size 8 \
+    $     --pipeline-parallel-size 2
+You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
+.. code-block:: console
+    $ vllm serve /path/to/the/model/in/the/container \
+    $     --tensor-parallel-size 16
+To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like ``--privileged -e NCCL_IB_HCA=mlx5`` to the ``run_cluster.sh`` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with ``NCCL_DEBUG=TRACE`` environment variable set, e.g. ``NCCL_DEBUG=TRACE vllm serve ...`` and check the logs for the NCCL version and the network used. If you find ``[send] via NET/Socket`` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find ``[send] via NET/IB/GDRDMA`` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
+.. warning::
+    After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information. If you need to set some environment variables for the communication configuration, you can append them to the ``run_cluster.sh`` script, e.g. ``-e NCCL_SOCKET_IFNAME=eth0``. Note that setting environment variables in the shell (e.g. ``NCCL_SOCKET_IFNAME=eth0 vllm serve ...``) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See the `discussion <https://github.com/vllm-project/vllm/issues/6803>`_ for more information.
-    $ # On head node
+.. warning::
-    $ ray start --head
-    $ # On worker nodes
+    Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
-    $ ray start --address=<ray-head-address>
-After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines.
+    When you use huggingface repo id to refer to the model, you should append your huggingface token to the ``run_cluster.sh`` script, e.g. ``-e HF_TOKEN=``. The recommended way is to download the model first, and then use the path to refer to the model.
\ No newline at end of file
--- a/docs/source/serving/env_vars.rst
+++ b/docs/source/serving/env_vars.rst
@@ -3,6 +3,11 @@ Environment Variables
 vLLM uses the following environment variables to configure the system:
+.. warning::
+    Please note that ``VLLM_PORT`` and ``VLLM_HOST_IP`` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use ``--host $VLLM_HOST_IP`` and ``--port $VLLM_PORT`` to start the API server, it will not work.
+    All environment variables used by vLLM are prefixed with ``VLLM_``. **Special care should be taken for Kubernetes users**: please do not name the service as ``vllm``, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because `Kubernetes sets environment variables for each service with the capitalized service name as the prefix <https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables>`_.
 .. literalinclude:: ../../../vllm/envs.py
    :language: python
    :start-after: begin-env-vars-definition

--- a/docs/source/serving/faq.rst
+++ b/docs/source/serving/faq.rst
+Frequently Asked Questions
+===========================
+    Q: How can I serve multiple models on a single port using the OpenAI API?
+A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
+----------------------------------------
+    Q: Which model to use for offline inference embedding?
+A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
--- a/docs/source/serving/integrations.rst
+++ b/docs/source/serving/integrations.rst
@@ -8,6 +8,7 @@ Integrations
   deploying_with_kserve
   deploying_with_triton
   deploying_with_bentoml
+   deploying_with_cerebrium
   deploying_with_lws
   deploying_with_dstack
   serving_with_langchain
--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat
 You can start the server using Python, or using [Docker](deploying_with_docker.rst):
 ```bash
-python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
+vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
 ```
 To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
@@ -97,9 +97,7 @@ template, or the template in string form. Without a chat template, the server wi
 and all chat requests will error.
 ```bash
-python -m vllm.entrypoints.openai.api_server \
+vllm serve <model> --chat-template ./path-to-chat-template.jinja
-  --model ... \
-  --chat-template ./path-to-chat-template.jinja
 ```
 vLLM community provides a set of chat templates for popular models. You can find them in the examples
@@ -109,8 +107,8 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
 ```{argparse}
 :module: vllm.entrypoints.openai.cli_args
-:func: make_arg_parser
+:func: create_parser_for_docs
-:prog: -m vllm.entrypoints.openai.api_server
+:prog: vllm serve
 ```
 ## Tool calling in the chat completion API

--- a/docs/source/serving/run_on_sky.rst
+++ b/docs/source/serving/run_on_sky.rst
@@ -5,9 +5,9 @@ Deploying and scaling up with SkyPilot
 .. raw:: html
-    <p align="center">
+  <p align="center">
-        <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
+    <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
-    </p>
+  </p>
 vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
@@ -21,8 +21,8 @@ Prerequisites
 .. code-block:: console
-    pip install skypilot-nightly
+  pip install skypilot-nightly
-    sky check
+  sky check
 Run on a single instance
@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil
 .. code-block:: yaml
-    resources:
+  resources:
-        accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-        use_spot: True
+    use_spot: True
-        disk_size: 512  # Ensure model checkpoints can fit.
+    disk_size: 512  # Ensure model checkpoints can fit.
-        disk_tier: best
+    disk_tier: best
-        ports: 8081  # Expose to internet traffic.
+    ports: 8081  # Expose to internet traffic.
-    envs:
+  envs:
-        MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-        HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-    setup: |
+  setup: |
-        conda create -n vllm python=3.10 -y
+    conda create -n vllm python=3.10 -y
-        conda activate vllm
+    conda activate vllm
-        pip install vllm==0.4.0.post1
+    pip install vllm==0.4.0.post1
-        # Install Gradio for web UI.
+    # Install Gradio for web UI.
-        pip install gradio openai
+    pip install gradio openai
-        pip install flash-attn==2.5.7
+    pip install flash-attn==2.5.7
-    run: |
+  run: |
-        conda activate vllm
+    conda activate vllm
-        echo 'Starting vllm api server...'
+    echo 'Starting vllm api server...'
-        python -u -m vllm.entrypoints.openai.api_server \
+    python -u -m vllm.entrypoints.openai.api_server \
-            --port 8081 \
+      --port 8081 \
-            --model $MODEL_NAME \
+      --model $MODEL_NAME \
-            --trust-remote-code \
+      --trust-remote-code \
-            --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-            2>&1 | tee api_server.log &
+      2>&1 | tee api_server.log &
-        echo 'Waiting for vllm api server to start...'
+    echo 'Waiting for vllm api server to start...'
-        while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
+    while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
-        echo 'Starting gradio server...'
+    echo 'Starting gradio server...'
-        git clone https://github.com/vllm-project/vllm.git || true
+    git clone https://github.com/vllm-project/vllm.git || true
-        python vllm/examples/gradio_openai_chatbot_webserver.py \
+    python vllm/examples/gradio_openai_chatbot_webserver.py \
-            -m $MODEL_NAME \
+      -m $MODEL_NAME \
-            --port 8811 \
+      --port 8811 \
-            --model-url http://localhost:8081/v1 \
+      --model-url http://localhost:8081/v1 \
-            --stop-token-ids 128009,128001
+      --stop-token-ids 128009,128001
 Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): 
 .. code-block:: console
-    HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
+  HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
 Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
 .. code-block:: console
-    (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
+  (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
 **Optional**: Serve the 70B model instead of the default 8B and use more GPU:
 .. code-block:: console
-    HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
+  HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
 Scale up to multiple replicas
@@ -99,151 +99,212 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
 .. code-block:: yaml
-    service:
+  service:
-        replicas: 2
+    replicas: 2
-        # An actual request for readiness probe.
+    # An actual request for readiness probe.
-        readiness_probe:
+    readiness_probe:
-            path: /v1/chat/completions
+      path: /v1/chat/completions
-            post_data:
+      post_data:
-            model: $MODEL_NAME
+      model: $MODEL_NAME
-            messages:
+      messages:
-                - role: user
+        - role: user
-                content: Hello! What is your name?
+          content: Hello! What is your name?
-        max_tokens: 1
+    max_tokens: 1
 .. raw:: html
-    <details>
+  <details>
-    <summary>Click to see the full recipe YAML</summary>
+  <summary>Click to see the full recipe YAML</summary>
 .. code-block:: yaml
-    service:
+  service:
-        replicas: 2
+    replicas: 2
-        # An actual request for readiness probe.
+    # An actual request for readiness probe.
-        readiness_probe:
+    readiness_probe:
-            path: /v1/chat/completions
+      path: /v1/chat/completions
-            post_data:
+      post_data:
-            model: $MODEL_NAME
+        model: $MODEL_NAME
-            messages:
+        messages:
-                - role: user
+          - role: user
-                content: Hello! What is your name?
+            content: Hello! What is your name?
        max_tokens: 1
-    resources:
+  resources:
-        accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-        use_spot: True
+    use_spot: True
-        disk_size: 512  # Ensure model checkpoints can fit.
+    disk_size: 512  # Ensure model checkpoints can fit.
-        disk_tier: best
+    disk_tier: best
-        ports: 8081  # Expose to internet traffic.
+    ports: 8081  # Expose to internet traffic.
-    envs:
+  envs:
-        MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-        HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-    setup: |
+  setup: |
-        conda create -n vllm python=3.10 -y
+    conda create -n vllm python=3.10 -y
-        conda activate vllm
+    conda activate vllm
-        pip install vllm==0.4.0.post1
+    pip install vllm==0.4.0.post1
-        # Install Gradio for web UI.
+    # Install Gradio for web UI.
-        pip install gradio openai
+    pip install gradio openai
-        pip install flash-attn==2.5.7
+    pip install flash-attn==2.5.7
-    run: |
+  run: |
-        conda activate vllm
+    conda activate vllm
-        echo 'Starting vllm api server...'
+    echo 'Starting vllm api server...'
-        python -u -m vllm.entrypoints.openai.api_server \
+    python -u -m vllm.entrypoints.openai.api_server \
-            --port 8081 \
+      --port 8081 \
-            --model $MODEL_NAME \
+      --model $MODEL_NAME \
-            --trust-remote-code \
+      --trust-remote-code \
-            --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-            2>&1 | tee api_server.log &
+      2>&1 | tee api_server.log
-        echo 'Waiting for vllm api server to start...'
-        while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
-        echo 'Starting gradio server...'
-        git clone https://github.com/vllm-project/vllm.git || true
-        python vllm/examples/gradio_openai_chatbot_webserver.py \
-            -m $MODEL_NAME \
-            --port 8811 \
-            --model-url http://localhost:8081/v1 \
-            --stop-token-ids 128009,128001
 .. raw:: html
-    </details>
+  </details>
 Start the serving the Llama-3 8B model on multiple replicas:
 .. code-block:: console
-    HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
+  HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
 Wait until the service is ready:
 .. code-block:: console
-    watch -n10 sky serve status vllm
+  watch -n10 sky serve status vllm
 .. raw:: html
-    <details>
+  <details>
-    <summary>Example outputs:</summary>
+  <summary>Example outputs:</summary>
 .. code-block:: console
-    Services
+  Services
-    NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
+  NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
-    vllm  1        35s     READY   2/2       xx.yy.zz.100:30001
+  vllm  1        35s     READY   2/2       xx.yy.zz.100:30001
-    Service Replicas
+  Service Replicas
-    SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES          STATUS  REGION
+  SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
-    vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+  vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
-    vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+  vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
 .. raw:: html
-    </details>
+  </details>
 After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
 .. code-block:: console
-    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
+  ENDPOINT=$(sky serve status --endpoint 8081 vllm)
-    curl -L http://$ENDPOINT/v1/chat/completions \
+  curl -L http://$ENDPOINT/v1/chat/completions \
-        -H "Content-Type: application/json" \
+    -H "Content-Type: application/json" \
-        -d '{
+    -d '{
-            "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
-            "messages": [
+      "messages": [
-            {
+      {
-                "role": "system",
+        "role": "system",
-                "content": "You are a helpful assistant."
+        "content": "You are a helpful assistant."
-            },
+      },
-            {
+      {
-                "role": "user",
+        "role": "user",
-                "content": "Who are you?"
+        "content": "Who are you?"
-            }
+      }
-            ],
+      ],
-            "stop_token_ids": [128009,  128001]
+      "stop_token_ids": [128009,  128001]
-        }'
+    }'
-To enable autoscaling, you could specify additional configs in `services`:
+To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
 .. code-block:: yaml
-    services:
+  service:
-        replica_policy:
+    replica_policy:
-            min_replicas: 0
+      min_replicas: 2
-            max_replicas: 3
+      max_replicas: 4
-        target_qps_per_replica: 2
+      target_qps_per_replica: 2
 This will scale the service up to when the QPS exceeds 2 for each replica.
+.. raw:: html
+  <details>
+  <summary>Click to see the full recipe YAML</summary>
+.. code-block:: yaml
+  service:
+    replica_policy:
+      min_replicas: 2
+      max_replicas: 4
+      target_qps_per_replica: 2
+    # An actual request for readiness probe.
+    readiness_probe:
+      path: /v1/chat/completions
+      post_data:
+        model: $MODEL_NAME
+        messages:
+          - role: user
+            content: Hello! What is your name?
+        max_tokens: 1
+  resources:
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    use_spot: True
+    disk_size: 512  # Ensure model checkpoints can fit.
+    disk_tier: best
+    ports: 8081  # Expose to internet traffic.
+  envs:
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+  setup: |
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+    pip install vllm==0.4.0.post1
+    # Install Gradio for web UI.
+    pip install gradio openai
+    pip install flash-attn==2.5.7
+  run: |
+    conda activate vllm
+    echo 'Starting vllm api server...'
+    python -u -m vllm.entrypoints.openai.api_server \
+      --port 8081 \
+      --model $MODEL_NAME \
+      --trust-remote-code \
+      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      2>&1 | tee api_server.log
+.. raw:: html
+  </details>
+To update the service with the new config:
+.. code-block:: console
+  HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
+To stop the service:
+.. code-block:: console
+  sky serve down vllm
 **Optional**: Connect a GUI to the endpoint
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -253,58 +314,53 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
 .. raw:: html
-    <details>
+  <details>
-    <summary>Click to see the full GUI YAML</summary>
+  <summary>Click to see the full GUI YAML</summary>
 .. code-block:: yaml
-    envs:
+  envs:
-        MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-        ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. 
+    ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. 
-    resources:
+  resources:
-        cpus: 2
+    cpus: 2
-    setup: |
+  setup: |
-        conda activate vllm
+    conda create -n vllm python=3.10 -y
-        if [ $? -ne 0 ]; then
+    conda activate vllm
-            conda create -n vllm python=3.10 -y
-            conda activate vllm
+    # Install Gradio for web UI.
-        fi
+    pip install gradio openai
-        # Install Gradio for web UI.
+  run: |
-        pip install gradio openai
+    conda activate vllm
+    export PATH=$PATH:/sbin
-    run: |
-        conda activate vllm
+    echo 'Starting gradio server...'
-        export PATH=$PATH:/sbin
+    git clone https://github.com/vllm-project/vllm.git || true
-        WORKER_IP=$(hostname -I | cut -d' ' -f1)
+    python vllm/examples/gradio_openai_chatbot_webserver.py \
-        CONTROLLER_PORT=21001
+      -m $MODEL_NAME \
-        WORKER_PORT=21002
+      --port 8811 \
+      --model-url http://$ENDPOINT/v1 \
-        echo 'Starting gradio server...'
+      --stop-token-ids 128009,128001 | tee ~/gradio.log
-        git clone https://github.com/vllm-project/vllm.git || true
-        python vllm/examples/gradio_openai_chatbot_webserver.py \
-            -m $MODEL_NAME \
-            --port 8811 \
-            --model-url http://$ENDPOINT/v1 \
-            --stop-token-ids 128009,128001 | tee ~/gradio.log
 .. raw:: html
-    </details>
+  </details>
 1. Start the chat web UI:
 .. code-block:: console
-    sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
+  sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
 2. Then, we can access the GUI at the returned gradio link:
 .. code-block:: console
-    | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+  | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
--- a/docs/source/serving/tensorizer.rst
+++ b/docs/source/serving/tensorizer.rst
+.. _tensorizer:
+Loading Models with CoreWeave's Tensorizer
+==========================================
+vLLM supports loading models with `CoreWeave's Tensorizer <https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer>`_.
+vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
+at runtime extremely quickly directly to the GPU, resulting in significantly
+shorter Pod startup times and CPU memory usage. Tensor encryption is also supported.
+For more information on CoreWeave's Tensorizer, please refer to
+`CoreWeave's Tensorizer documentation <https://github.com/coreweave/tensorizer>`_. For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
+the `vLLM example script <https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html>`_.
\ No newline at end of file
--- a/examples/api_client.py
+++ b/examples/api_client.py
-"""Example Python client for vllm.entrypoints.api_server"""
+"""Example Python client for `vllm.entrypoints.api_server`
+NOTE: The API server is used only for demonstration and simple performance
+benchmarks. It is not intended for production use.
+For production use, we recommend `vllm serve` and the OpenAI client API.
+"""
 import argparse
 import json
@@ -27,7 +31,10 @@ def post_http_request(prompt: str,
        "max_tokens": 16,
        "stream": stream,
    }
-    response = requests.post(api_url, headers=headers, json=pload, stream=True)
+    response = requests.post(api_url,
+                             headers=headers,
+                             json=pload,
+                             stream=stream)
    return response

--- a/examples/aqlm_example.py
+++ b/examples/aqlm_example.py
-import argparse
 from vllm import LLM, SamplingParams
+from vllm.utils import FlexibleArgumentParser
 def main():
-    parser = argparse.ArgumentParser(description='AQLM examples')
+    parser = FlexibleArgumentParser(description='AQLM examples')
    parser.add_argument('--model',
                        '-m',
@@ -17,7 +16,7 @@ def main():
                        type=int,
                        default=0,
                        help='known good models by index, [0-4]')
-    parser.add_argument('--tensor_parallel_size',
+    parser.add_argument('--tensor-parallel-size',
                        '-t',
                        type=int,
                        default=1,

--- a/examples/cpu_offload.py
+++ b/examples/cpu_offload.py
+from vllm import LLM, SamplingParams
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+# Create an LLM.
+llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", cpu_offload_gb=10)
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/examples/fp8/extract_scales.py
+++ b/examples/fp8/extract_scales.py
@@ -2,7 +2,7 @@ import argparse
 import glob
 import json
 import os
-from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple
+from typing import Any, Callable, Dict, List, Optional, Tuple
 import numpy as np
 import torch
@@ -19,7 +19,7 @@ def _prepare_hf_weights(
    quantized_model_dir: str,
    load_format: str = "auto",
    fall_back_to_pt: bool = True,
-) -> Tuple[str, List[str], bool]:
+) -> Tuple[List[str], bool]:
    if not os.path.isdir(quantized_model_dir):
        raise FileNotFoundError(
            f"The quantized model directory `{quantized_model_dir}` "
@@ -94,7 +94,7 @@ def _hf_tensorfile_iterator(filename: str, load_format: str,
 def _kv_scales_extractor(
-        hf_tensor_files: Iterable[str],
+        hf_tensor_files: List[str],
        use_safetensors: bool,
        rank_keyword: str = "rank",
        expected_tp_size: Optional[int] = None) -> Dict[int, Dict[int, float]]:
@@ -115,7 +115,7 @@ def _kv_scales_extractor(
    for char in rank_keyword:
        assert not char.isdecimal(
        ), f"Rank keyword {rank_keyword} contains a numeric character!"
-    rank_scales_map = {}
+    rank_scales_map: Dict[int, Dict[int, float]] = {}
    for tensor_file in hf_tensor_files:
        try:
            rank_idx = tensor_file.find(rank_keyword)
@@ -141,7 +141,7 @@ def _kv_scales_extractor(
            raise
        if rank not in rank_scales_map:
-            layer_scales_map = {}
+            layer_scales_map: Dict[int, float] = {}
            rank_scales_map[rank] = layer_scales_map
        else:
            raise RuntimeError(
@@ -222,7 +222,7 @@ def _metadata_extractor(quantized_model_dir: str,
            "does not exist.")
    metadata_files = glob.glob(os.path.join(quantized_model_dir, "*.json"))
-    result = {}
+    result: Dict[str, Any] = {}
    for file in metadata_files:
        with open(file) as f:
            try:
@@ -327,7 +327,7 @@ if __name__ == "__main__":
        "--quantization-param-path <filename>). This is only used "
        "if the KV cache dtype is FP8 and on ROCm (AMD GPU).")
    parser.add_argument(
-        "--quantized_model",
+        "--quantized-model",
        help="Specify the directory containing a single quantized HF model. "
        "It is expected that the quantization format is FP8_E4M3, for use "
        "on ROCm (AMD GPU).",
@@ -339,18 +339,18 @@ if __name__ == "__main__":
        choices=["auto", "safetensors", "npz", "pt"],
        default="auto")
    parser.add_argument(
-        "--output_dir",
+        "--output-dir",
        help="Optionally specify the output directory. By default the "
        "KV cache scaling factors will be saved in the model directory, "
        "however you can override this behavior here.",
        default=None)
    parser.add_argument(
-        "--output_name",
+        "--output-name",
        help="Optionally specify the output filename.",
        # TODO: Change this once additional scaling factors are enabled
        default="kv_cache_scales.json")
    parser.add_argument(
-        "--tp_size",
+        "--tp-size",
        help="Optionally specify the tensor-parallel (TP) size that the "
        "quantized model should correspond to. If specified, during KV "
        "cache scaling factor extraction the observed TP size will be "

--- a/examples/fp8/quantizer/README.md
+++ b/examples/fp8/quantizer/README.md
@@ -16,7 +16,7 @@
 #### Run on H100 system for speed if FP8; number of GPUs depends on the model size
 #### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
-`python quantize.py --model_dir ./ll2-7b --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./ll2_7b_fp8 --calib_size 512 --tp_size 1`
+`python quantize.py --model-dir ./ll2-7b --dtype float16 --qformat fp8 --kv-cache-dtype fp8 --output-dir ./ll2_7b_fp8 --calib-size 512 --tp-size 1`
 Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
 ```

--- a/examples/llava_example.py
+++ b/examples/llava_example.py
-import argparse
-import os
-import subprocess
-import torch
-from PIL import Image
-from vllm import LLM
-from vllm.multimodal.image import ImageFeatureData, ImagePixelData
-# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.
-# You can use `.buildkite/download-images.sh` to download them
-def run_llava_pixel_values(*, disable_image_processor: bool = False):
-    llm = LLM(
-        model="llava-hf/llava-1.5-7b-hf",
-        image_input_type="pixel_values",
-        image_token_id=32000,
-        image_input_shape="1,3,336,336",
-        image_feature_size=576,
-        disable_image_processor=disable_image_processor,
-    )
-    prompt = "<image>" * 576 + (
-        "\nUSER: What is the content of this image?\nASSISTANT:")
-    if disable_image_processor:
-        image = torch.load("images/stop_sign_pixel_values.pt")
-    else:
-        image = Image.open("images/stop_sign.jpg")
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": ImagePixelData(image),
-    })
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-def run_llava_image_features():
-    llm = LLM(
-        model="llava-hf/llava-1.5-7b-hf",
-        image_input_type="image_features",
-        image_token_id=32000,
-        image_input_shape="1,576,1024",
-        image_feature_size=576,
-    )
-    prompt = "<image>" * 576 + (
-        "\nUSER: What is the content of this image?\nASSISTANT:")
-    image: torch.Tensor = torch.load("images/stop_sign_image_features.pt")
-    outputs = llm.generate({
-        "prompt": prompt,
-        "multi_modal_data": ImageFeatureData(image),
-    })
-    for o in outputs:
-        generated_text = o.outputs[0].text
-        print(generated_text)
-def main(args):
-    if args.type == "pixel_values":
-        run_llava_pixel_values()
-    else:
-        run_llava_image_features()
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Demo on Llava")
-    parser.add_argument("--type",
-                        type=str,
-                        choices=["pixel_values", "image_features"],
-                        default="pixel_values",
-                        help="image input type")
-    args = parser.parse_args()
-    # Download from s3
-    s3_bucket_path = "s3://air-example-data-2/vllm_opensource_llava/"
-    local_directory = "images"
-    # Make sure the local directory exists or create it
-    os.makedirs(local_directory, exist_ok=True)
-    # Use AWS CLI to sync the directory, assume anonymous access
-    subprocess.check_call([
-        "aws",
-        "s3",
-        "sync",
-        s3_bucket_path,
-        local_directory,
-        "--no-sign-request",
-    ])
-    main(args)
--- a/examples/llm_engine_example.py
+++ b/examples/llm_engine_example.py
@@ -2,6 +2,7 @@ import argparse
 from typing import List, Tuple
 from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
+from vllm.utils import FlexibleArgumentParser
 def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
@@ -55,7 +56,7 @@ def main(args: argparse.Namespace):
 if __name__ == '__main__':
-    parser = argparse.ArgumentParser(
+    parser = FlexibleArgumentParser(
        description='Demo on using the LLMEngine class directly')
    parser = EngineArgs.add_cli_args(parser)
    args = parser.parse_args()