Merge branch 'v0.6.2-eval' into v0.6.2-dev

2b1b8c7d · zhuwenwen · 3f42b83d · 367708c7 · 2b1b8c7d · 2b1b8c7d
Commit 2b1b8c7d authored Nov 05, 2024 by zhuwenwen
5 changed files
--- a/README.md
+++ b/README.md
 # <div align="center"><strong>vLLM</strong></div>
 ## 简介
-vLLM是一个快速且易于使用的LLM推理和服务库，使用PageAttention高效管理kv内存，Continuous batching传入请求，支持很多Hugging Face模型，如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
+vLLM是一个快速且易于使用的LLM推理和服务库,使用PageAttention高效管理kv内存,Continuous batching传入请求,支持很多Hugging Face模型,如LLaMA & LLaMA-2、Qwen、Chatglm2 & Chatglm3等。
 ## 暂不支持的官方功能
- **量化推理**：目前支持fp16的推理和gptq,awq-int4推理，mralin的权重量化、kv-cache fp8推理方案暂不支持
+- **量化推理**:目前支持fp16的推理和gptq,awq-int4推理,mralin的权重量化、kv-cache fp8推理方案暂不支持
- **模块支持**：目前不支持Sliding window attention
+- **模块支持**:目前不支持Sliding window attention
 ## 支持模型结构列表
 | 结构 | 模型 | 模型并行 | FP16 |
 | :------: | :------: | :------: | :------: |
-| LlamaForCausalLM      | LLaMA、LLaMA-2、LLaMA-3、Codellama、deepseek、Yi | Yes | Yes |
+| LlamaForCausalLM      | Llama 3.1,Llama 3,Llama 2,Llama,Yi,Codellama、deepseek  | Yes | Yes |  
-| QWenLMHeadModel       | QWen、Qwen-VL                                   | Yes | Yes |
+| QWenLMHeadModel       | QWen,Qwen-VL                                               | Yes | Yes |
-| Qwen2ForCausalLM      | QWen1.5、CodeQwen1.5、QWen2                     | Yes | Yes |
+| Qwen2ForCausalLM      | QWen2,QWen1.5,CodeQwen1.5                                 | Yes | Yes |
-| ChatGLMModel          | chatglm2、chatglm3                              | Yes | Yes |
+| ChatGLMModel          | glm-4v-9b,chatglm3,chatglm2                               | Yes | Yes |
-| BaiChuanForCausalLM   | Baichuan、Baichuan2                             | Yes | Yes |
+| DeepseekV2ForCausalLM | DeepSeek-V2                                                 | Yes | Yes |
-| BloomForCausalLM      | BLOOM                                           | Yes | Yes |
+| BaiChuanForCausalLM   | Baichuan2,Baichuan                                         | Yes | Yes |
-| InternLMForCausalLM   | InternLM                                        | Yes | Yes |
+| BloomForCausalLM      | BLOOM                                                       | Yes | Yes |
-| InternLM2ForCausalLM  | InternLM2                                       | Yes | Yes |
+| InternLMForCausalLM   | InternLM                                                    | Yes | Yes |
-| DeepseekV2ForCausalLM | DeepSeek-V2                                     | Yes | Yes |
+| InternLM2ForCausalLM  | InternLM2                                                   | Yes | Yes |
-| MixtralForCausalLM    | Mixtral-8x7B                                    | Yes | Yes |
+| MiniCPMForCausalLM    | MiniCPM                                                     | Yes | Yes |
-| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B            | Yes | Yes |
+| MiniCPM3ForCausalLM   | MiniCPM3                                                    | Yes | Yes |
+| MixtralForCausalLM    | Mixtral-8x7B,Mixtral-8x7B-Instruct                         | Yes | Yes |
+| TeleChat12BForCausalLM (#TelechatForCausalLM) | TeleChat-12B                        | Yes | Yes |
+| LlavaForConditionalGeneration       | LLaMA,LLaMA-2,LLaMA-3                       | Yes | Yes |
+| Qwen2VLForConditionalGeneration     | Qwen2-VL                                      | Yes | Yes |
+| MiniCPMV                            | MiniCPM-V                                     | Yes | Yes |
+| Phi3VForCausalLM                    | Phi-3.5-vision                                | Yes | Yes |
 ## 安装
@@ -33,11 +39,11 @@ vLLM支持
 ### 使用源码编译方式安装
 #### 编译环境准备
-提供2种环境准备方式：
+提供2种环境准备方式:
-1. 基于光源pytorch2.3.0基础镜像环境：镜像下载地址：[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
+1. 基于光源pytorch2.3.0基础镜像环境:镜像下载地址:[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。
-2. 基于现有python环境：安装pytorch2.3.0，pytorch whl包下载目录：[https://cancon.hpccube.com:65024/4/main/pytorch](https://cancon.hpccube.com:65024/4/main/pytorch)，根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下：
+2. 基于现有python环境:安装pytorch2.3.0,pytorch whl包下载目录:[https://cancon.hpccube.com:65024/4/main/pytorch](https://cancon.hpccube.com:65024/4/main/pytorch),根据python、dtk版本,下载对应pytorch2.1.0的whl包。安装命令如下:
 ```shell
 pip install torch* (下载的torch的whl包)
 pip install setuptools wheel
@@ -47,11 +53,11 @@ pip install setuptools wheel
 ```shell
 git clone http://developer.hpccube.com/codes/OpenDAS/vllm.git # 根据需要的分支进行切换
 ```
-安装依赖：
+安装依赖:
 ```shell
 pip install -r requirements-rocm.txt
 ```
- 提供2种源码编译方式（进入vllm目录）：
+- 提供2种源码编译方式(进入vllm目录):
 ```
 1. 编译whl包并安装
 VLLM_INSTALL_PUNICA_KERNELS=1 python setup.py bdist_wheel 
@@ -65,17 +71,17 @@ VLLM_INSTALL_PUNICA_KERNELS=1 python3 setup.py install
 #### 运行基础环境准备
 1、使用上面基于光源pytorch2.3.0基础镜像环境
-2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包：
+2、根据pytorch2.3.0、python、dtk及系统下载对应的依赖包:
 - triton:[https://cancon.hpccube.com:65024/4/main/triton](https://cancon.hpccube.com:65024/4/main/triton/)
 - xformers:[https://cancon.hpccube.com:65024/4/main/xformers](https://cancon.hpccube.com:65024/4/main/xformers)
 - flash_attn: [https://cancon.hpccube.com:65024/4/main/flash_attn](https://cancon.hpccube.com:65024/4/main/flash_attn)
 - lmslim: [https://cancon.hpccube.com:65024/4/main/lmslim](https://cancon.hpccube.com:65024/4/main/lmslim)
 #### 注意事项
-+ 若使用 pip install 下载安装过慢，可添加源：-i https://pypi.tuna.tsinghua.edu.cn/simple/
+ 若使用 pip install 下载安装过慢,可添加源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
 ## 验证
- python -c "import vllm; print(vllm.\_\_version__)"，版本号与官方版本同步，查询该软件的版本号，例如0.6.2；
+- python -c "import vllm; print(vllm.\_\_version__)",版本号与官方版本同步,查询该软件的版本号,例如0.6.2;
 ## Known Issue
 - 无

--- a/vllm/model_executor/model_loader/utils.py
+++ b/vllm/model_executor/model_loader/utils.py
@@ -23,7 +23,7 @@ def get_model_architecture(
        model_config: ModelConfig) -> Tuple[Type[nn.Module], str]:
    architectures = getattr(model_config.hf_config, "architectures", [])
    visions = getattr(model_config.hf_config, "visual", []) or getattr(model_config.hf_config, "vision_config", [])
-    support_nn_architectures = ['LlamaForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'ChatGLMModel', 'BaichuanForCausalLM', 'BloomForCausalLM', 'MedusaModel']  
+    support_nn_architectures = ['LlamaForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2VLForConditionalGeneration', 'ChatGLMModel', 'BaichuanForCausalLM', 'BloomForCausalLM', 'MedusaModel']  
    if any(arch in architectures for arch in support_nn_architectures): 
        if os.getenv('LLAMA_NN') != '0': 
             if (architectures == ['QWenLMHeadModel'] or architectures == ['ChatGLMModel'] ) and visions != []:

--- a/vllm/model_executor/models/llama.py
+++ b/vllm/model_executor/models/llama.py
@@ -577,7 +577,6 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA):
                "mlp.down_proj.weight",
                "lm_head.weight"
            ]
            combined_words = "|".join(lay_key_words)
            lay_qkv_words = ["self_attn.qkv_proj.weight"]   

--- a/vllm/model_executor/models/qwen2_vl.py
+++ b/vllm/model_executor/models/qwen2_vl.py
@@ -71,6 +71,11 @@ from vllm.utils import is_cpu
 from .utils import (PPMissingLayer, is_pp_missing_parameter,
                    make_empty_intermediate_tensors_factory)
+import os
+import re
+from vllm import _custom_ops as ops
+from vllm.model_executor.utils import pad_weight, gemm_bank_conf
 logger = init_logger(__name__)
 # === Vision Inputs === #
@@ -889,6 +894,16 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
        self.make_empty_intermediate_tensors = (
            make_empty_intermediate_tensors_factory(
                ["hidden_states", "residual"], config.hidden_size))
+        self.quant_method = None
+        if quant_config is not None:
+            self.quant_method=quant_config.get_name()
+            self.quant_config=quant_config
+        self.use_llama_nn = os.environ.get('LLAMA_NN') == '1'
+        self.use_gemm_pad = os.environ.get('GEMM_PAD') == '1'
+        self.use_fa_pad = os.environ.get('FA_PAD') == '1'
+        self.use_awq_pad = os.environ.get('AWQ_PAD') == '1'
    def _validate_and_reshape_mm_tensor(self,
                                        mm_input: Union[torch.Tensor,
@@ -1119,3 +1134,46 @@ class Qwen2VLForConditionalGeneration(nn.Module, SupportsMultiModal):
                weight_loader = getattr(param, "weight_loader",
                                        default_weight_loader)
                weight_loader(param, loaded_weight)
+        if self.use_llama_nn and self.quant_method is None:
+            lay_key_words = [
+                "attn.qkv.weight",
+                "attn.proj.weight",
+                "mlp.fc1.weight",
+                "mlp.fc2.weight",
+                "mlp.0.weight",
+                "mlp.2.weight",
+                "self_attn.qkv_proj.weight",
+                "self_attn.o_proj.weight",
+                "mlp.gate_up_proj.weight",
+                "mlp.down_proj.weight",
+                "lm_head.weight",
+            ]
+            combined_words = "|".join(lay_key_words)
+            lay_qkv_words = ["attn.qkv.weight"]   
+            qkv_words = "|".join(lay_qkv_words)  
+            lay_qkv_bias_words = ["attn.qkv.bias"]   
+            qkv_bias_words = "|".join(lay_qkv_bias_words) 
+            for layername, weight in params_dict.items():
+                if self.use_fa_pad and (re.findall(qkv_bias_words, layername)):
+                    weight.data = pad_weight(weight.data, 32)
+                matches = re.findall(combined_words, layername)
+                if matches:   
+                    if self.use_gemm_pad and gemm_bank_conf(weight.data.shape[0]):
+                        weight.data = pad_weight(weight.data, 32)  
+                    if self.use_fa_pad and (re.findall(qkv_words, layername)):
+                        if not gemm_bank_conf(weight.data.shape[0]):
+                            weight.data = pad_weight(weight.data, 32)
+                    _weight = torch.zeros_like(weight.data)
+                    ori_shape =_weight.shape
+                    ops.trans_w16_gemm(_weight, weight.data, _weight.shape[0], _weight.shape[1])
+                    weight.data.copy_(_weight)
+                    weight.data=weight.data.reshape(ori_shape[1],-1)
\ No newline at end of file
--- a/vllm/multimodal/utils.py
+++ b/vllm/multimodal/utils.py
@@ -6,6 +6,7 @@ from typing import Any, List, Optional, Tuple, TypeVar, Union
 import numpy as np
 import numpy.typing as npt
 from PIL import Image
+import os
 from vllm.connections import global_http_connection
 from vllm.envs import VLLM_AUDIO_FETCH_TIMEOUT, VLLM_IMAGE_FETCH_TIMEOUT
@@ -30,6 +31,10 @@ def _load_image_from_data_url(image_url: str):
    return load_image_from_base64(image_base64)
+def _load_image_from_file(file_path: str) -> Image.Image:  
+    # Load image directly from file  
+    return Image.open(file_path) 
 def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
    """
    Load a PIL image from a HTTP or base64 data URL.
@@ -43,6 +48,11 @@ def fetch_image(image_url: str, *, image_mode: str = "RGB") -> Image.Image:
    elif image_url.startswith('data:image'):
        image = _load_image_from_data_url(image_url)
+    elif os.path.isfile(image_url):  
+        # Load image from local file path  
+        image = _load_image_from_file(image_url) 
    else:
        raise ValueError("Invalid 'image_url': A valid 'image_url' must start "
                         "with either 'data:image' or 'http'.")