Adapt to rocm 不适用flashattention2

bc3c64aa · xiabo · b97b62b7 · bc3c64aa · bc3c64aa · bc3c64aa
Commit bc3c64aa authored Nov 15, 2023 by xiabo
6 changed files
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -314,7 +314,7 @@ add_library(transformer-shared SHARED
  $<TARGET_OBJECTS:BaseSamplingLayer>
  $<TARGET_OBJECTS:DynamicDecodeLayer>
 #  $<TARGET_OBJECTS:llama_fmha>
-  $<TARGET_OBJECTS:flash_attention2>
+#  $<TARGET_OBJECTS:flash_attention2>
  $<TARGET_OBJECTS:Llama>
  $<TARGET_OBJECTS:LlamaTritonBackend>
 #  $<TARGET_OBJECTS:gemm_s4_f16>

--- a/README.md
+++ b/README.md
-<div align="center">
+# <div align="center"><strong>LMdeploy</strong></div>
-  <img src="resources/lmdeploy-logo.svg" width="450"/>
+## 简介
+LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
+这个强大的工具箱提供以下核心功能：
-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
+- **高效推理引擎 TurboMind**：基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)，我们实现了高效推理引擎 TurboMind，支持 InternLM、LLaMA、vicuna等模型在 NVIDIA GPU 上的推理。
-[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
-[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
-[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
-English | [简体中文](README_zh-CN.md)
+- **交互推理方式**：通过缓存多轮对话过程中 attention 的 k/v，记住对话历史，从而避免重复处理历史会话。
-</div>
+- **多 GPU 部署和量化**：我们提供了全面的模型部署和量化支持，已在不同规模上完成验证。
-<p align="center">
+- **persistent batch 推理**：进一步优化模型执行效率。
-    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
-</p>
-______________________________________________________________________
+persistent batch 推理：进一步优化模型执行效率。
+LMdeploy官方github地址:[https://github.com/InternLM/lmdeploy](https://github.com/InternLM/lmdeploy)
-## News 🎉
+## 安装
- \[2023/09\] TurboMind supports Qwen-14B
+### 使用源码编译方式安装
- \[2023/09\] TurboMind supports InternLM-20B
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
- \[2023/09\] TurboMind supports Baichuan2-7B
- \[2023/08\] TurboMind supports flash-attention2.
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
- \[2023/08\] TurboMind supports Windows (tp=1)
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
-______________________________________________________________________
+#### 编译环境准备
+提供2种环境准备方式：
-## Introduction
-LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
- **Persistent Batch Inference**: Further optimization of model execution efficiency.
-![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
-## Supported Models
-`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
-### TurboMind
-> **Note**<br />
-> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
-|    Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
-| :----------: | :-------------: | :--: | :-----: | :---: | :--: |
-|    Llama     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-|    Llama2    |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-|    SOLAR     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-| InternLM-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-| InternLM-20B |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-|   QWen-7B    |       Yes       | Yes  |   Yes   |  No   |  No  |
-|   QWen-14B   |       Yes       | Yes  |   Yes   |  No   |  No  |
-| Baichuan-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-| Baichuan2-7B |       Yes       | Yes  |   No    |  No   |  No  |
-|  Code Llama  |       Yes       | Yes  |   No    |  No   |  No  |
-### Pytorch
-|   Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
-| :---------: | :-------------: | :--: | :-----: | :---: | :--: |
-|    Llama    |       Yes       | Yes  |   No    |  No   |  No  |
-|   Llama2    |       Yes       | Yes  |   No    |  No   |  No  |
-| InternLM-7B |       Yes       | Yes  |   No    |  No   |  No  |
-## Performance
-**Case I**: output token throughput with fixed input token and output token number (1, 2048)
-**Case II**: request throughput with real conversation data
-Test Setting: LLaMA-7B, NVIDIA A100(80G)
-The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
-And the request throughput of TurboMind is 30% higher than vLLM.
-![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
-## Quick Start
-### Installation
-Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
-```shell
-pip install lmdeploy
-```
-### Deploy InternLM
-#### Get InternLM model
-```shell
-# 1. Download InternLM model
-# Make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
-# if you want to clone without large files – just their pointers
-# prepend your git clone with the following env var:
-GIT_LFS_SKIP_SMUDGE=1
-# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-```
-#### Inference by TurboMind
-```shell
-lmdeploy chat turbomind ./workspace
-```
-> **Note**<br />
-> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
-> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
-> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
-> **Note**<br />
-> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
-#### Serving with gradio
+1. 基于光源pytorch基础镜像环境：镜像下载地址：[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch、python、dtk及系统下载对应的镜像版本。
 ```shell
-lmdeploy serve gradio ./workspace
+pip install -r requirements.txt
+pip install transformers==4.33.2
+pip install urllib3==1.24
+yum install rapidjson
 ```
-![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
+2. 基于现有python环境：安装pytorch，pytorch whl包下载目录：[https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04)，根据python、dtk版本,下载对应pytorch的whl包。安装命令如下：
-#### Serving with Restful API
-Launch inference server by:
 ```shell
-lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+pip install torch* (下载的torch的whl包)
+pip install -r requirements.txt
+pip install transformers==4.33.2
+pip install urllib3==1.24
+yum install rapidjson
 ```
-Then, you can communicate with it by command line,
+#### 源码编译安装
+- 代码下载
 ```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+git clone http://10.0.54.20/xiabo/lmdeploy.git # 根据编译需要切换分支 默认develop分支
-lmdeploy serve api_client restful_api_url
 ```
+- 提供2种源码编译方式（进入mmcv目录）：
-or webui,
-```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
-# server_ip and server_port here are for gradio ui
-# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
-lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
 ```
+1. 源码编译安装
-Refer to [restful_api.md](docs/en/restful_api.md) for more details.
+mkdir build && cd build
+sh ../generate.sh
-#### Serving with Triton Inference Server
+make -j 32 && make install
+cd .. && python3 setup.py install
-Launch inference server by:
+2. 编译成whl包安装
-```shell
+mkdir build && cd build
-bash workspace/service_docker_up.sh
+sh ../generate.sh
+make -j 32 && make install
+cd .. && python3 setup.py bdist_wheel
+cd dist && pip3 install lmdeploy*
 ```
+## 模型服务
-Then, you can communicate with the inference server by command line,
+### 部署 [LLaMA-2](https://github.com/facebookresearch/llama) 服务
+请从[这里](https://huggingface.co/meta-llama) 下载 llama2 模型，参考如下命令部署服务：
-```shell
+以7B为例：
-lmdeploy serve triton_client {server_ip_addresss}:33337
 ```
+1、模型转换
-or webui,
+python3 -m lmdeploy.serve.turbomind.deploy llama2 path/to/chinese-llama2-7b-hf hf path/to/chinese-llama2-7b-hf/tokenizer.model ./workspace_llama
+2、运行
-```shell
+- 在命令行界面运行：
-lmdeploy serve gradio {server_ip_addresss}:33337
+python3 -m lmdeploy.turbomind.chat ./workspace_llama
+- 在服务器界面运行：
+python3 -m lmdeploy.serve.gradio.app ./workspace_llama 10.6.10.67
+打开网页输入10.6.10.67:6006
 ```
+### 部署 [internlm](https://huggingface.co/internlm/) 服务
-For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
+请从[这里](https://huggingface.co/internlm) 下载 llama2 模型，参考如下命令部署服务：
+以7B为例：
-### Inference with PyTorch
-For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
-#### Single GPU
-```shell
-lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
-    --max_new_tokens 64 \
-    --temperture 0.8 \
-    --top_p 0.95 \
-    --seed 0
-```
-#### Tensor Parallel with DeepSpeed
-```shell
-deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
-    $NAME_OR_PATH_TO_HF_MODEL \
-    --max_new_tokens 64 \
-    --temperture 0.8 \
-    --top_p 0.95 \
-    --seed 0
-```
-You need to install deepspeed first to use this feature.
 ```
-pip install deepspeed
+1、模型转换
+python3 -m lmdeploy.serve.turbomind.deploy path/to/internlm-chat-7b internlm-chat-7b hf None ./workspace_intern
+2、运行
+- 在命令行界面运行：
+python3 -m lmdeploy.turbomind.chat ./workspace_intern
+- 在服务器界面运行：
+python3 -m lmdeploy.serve.gradio.app ./workspace_intern 10.6.10.67
+打开网页输入10.6.10.67:6006
 ```
+### 详细可参考 [docs](./docs/zh_cn/serving.md) 
+## 版本号查询
+- python -c "import lmdeploy; lmdeploy.\_\_version__"，版本号与官方版本同步，查询该软件的版本号，例如0.0.6；
-## Quantization
+## Known Issue
+- 无
-#### Weight INT4 Quantization
-LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
-[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
-#### KV Cache INT8 Quantization
-[Click here](./docs/en/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
-> **Warning**<br />
-> runtime Tensor Parallel for quantized model is not available. Please setup `--tp` on `deploy` to enable static TP.
-## Contributing
-We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
-## Acknowledgement
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
-## License
+## Note
+ 若使用pip install下载安装过慢，可添加pypi清华源：-i https://pypi.tuna.tsinghua.edu.cn/simple/
-This project is released under the [Apache 2.0 license](LICENSE).
+## 其他参考
+- [README_origin](README_origin.md)
+- [README_zh-CN](README_zh-CN.md)
--- a/README_origin.md
+++ b/README_origin.md
+<div align="center">
+  <img src="resources/lmdeploy-logo.svg" width="450"/>
+[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
+[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
+[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
+[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
+[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
+English | [简体中文](README_zh-CN.md)
+</div>
+<p align="center">
+    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
+</p>
+______________________________________________________________________
+## News 🎉
+- \[2023/09\] TurboMind supports Qwen-14B
+- \[2023/09\] TurboMind supports InternLM-20B
+- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
+- \[2023/09\] TurboMind supports Baichuan2-7B
+- \[2023/08\] TurboMind supports flash-attention2.
+- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
+- \[2023/08\] TurboMind supports Windows (tp=1)
+- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
+- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
+- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
+- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
+- \[2023/07\] TurboMind supports Llama-2 7B/13B.
+- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
+______________________________________________________________________
+## Introduction
+LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
+- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
+- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
+- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
+- **Persistent Batch Inference**: Further optimization of model execution efficiency.
+![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
+## Supported Models
+`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
+### TurboMind
+> **Note**<br />
+> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
+|    Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
+| :----------: | :-------------: | :--: | :-----: | :---: | :--: |
+|    Llama     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+|    Llama2    |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+|    SOLAR     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+| InternLM-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+| InternLM-20B |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+|   QWen-7B    |       Yes       | Yes  |   Yes   |  No   |  No  |
+|   QWen-14B   |       Yes       | Yes  |   Yes   |  No   |  No  |
+| Baichuan-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+| Baichuan2-7B |       Yes       | Yes  |   No    |  No   |  No  |
+|  Code Llama  |       Yes       | Yes  |   No    |  No   |  No  |
+### Pytorch
+|   Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
+| :---------: | :-------------: | :--: | :-----: | :---: | :--: |
+|    Llama    |       Yes       | Yes  |   No    |  No   |  No  |
+|   Llama2    |       Yes       | Yes  |   No    |  No   |  No  |
+| InternLM-7B |       Yes       | Yes  |   No    |  No   |  No  |
+## Performance
+**Case I**: output token throughput with fixed input token and output token number (1, 2048)
+**Case II**: request throughput with real conversation data
+Test Setting: LLaMA-7B, NVIDIA A100(80G)
+The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
+And the request throughput of TurboMind is 30% higher than vLLM.
+![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
+## Quick Start
+### Installation
+Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
+```shell
+pip install lmdeploy
+```
+### Deploy InternLM
+#### Get InternLM model
+```shell
+# 1. Download InternLM model
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
+# if you want to clone without large files – just their pointers
+# prepend your git clone with the following env var:
+GIT_LFS_SKIP_SMUDGE=1
+# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
+lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
+```
+#### Inference by TurboMind
+```shell
+lmdeploy chat turbomind ./workspace
+```
+> **Note**<br />
+> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
+> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
+> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
+> **Note**<br />
+> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
+#### Serving with gradio
+```shell
+lmdeploy serve gradio ./workspace
+```
+![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
+#### Serving with Restful API
+Launch inference server by:
+```shell
+lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+```
+Then, you can communicate with it by command line,
+```shell
+# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+lmdeploy serve api_client restful_api_url
+```
+or webui,
+```shell
+# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+# server_ip and server_port here are for gradio ui
+# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
+lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
+```
+Refer to [restful_api.md](docs/en/restful_api.md) for more details.
+#### Serving with Triton Inference Server
+Launch inference server by:
+```shell
+bash workspace/service_docker_up.sh
+```
+Then, you can communicate with the inference server by command line,
+```shell
+lmdeploy serve triton_client {server_ip_addresss}:33337
+```
+or webui,
+```shell
+lmdeploy serve gradio {server_ip_addresss}:33337
+```
+For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
+### Inference with PyTorch
+For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
+#### Single GPU
+```shell
+lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
+    --max_new_tokens 64 \
+    --temperture 0.8 \
+    --top_p 0.95 \
+    --seed 0
+```
+#### Tensor Parallel with DeepSpeed
+```shell
+deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
+    $NAME_OR_PATH_TO_HF_MODEL \
+    --max_new_tokens 64 \
+    --temperture 0.8 \
+    --top_p 0.95 \
+    --seed 0
+```
+You need to install deepspeed first to use this feature.
+```
+pip install deepspeed
+```
+## Quantization
+#### Weight INT4 Quantization
+LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
+[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
+#### KV Cache INT8 Quantization
+[Click here](./docs/en/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
+> **Warning**<br />
+> runtime Tensor Parallel for quantized model is not available. Please setup `--tp` on `deploy` to enable static TP.
+## Contributing
+We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
+## Acknowledgement
+- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
+- [llm-awq](https://github.com/mit-han-lab/llm-awq)
+## License
+This project is released under the [Apache 2.0 license](LICENSE).
--- a/src/turbomind/models/llama/CMakeLists.txt
+++ b/src/turbomind/models/llama/CMakeLists.txt
@@ -45,8 +45,8 @@ target_link_libraries(Llama PUBLIC cudart
 #        llama_fmha)
 if (NOT MSVC)
-        add_subdirectory(flash_attention2)
+#        add_subdirectory(flash_attention2)
-        target_link_libraries(Llama PUBLIC flash_attention2)
+#        target_link_libraries(Llama PUBLIC flash_attention2)
 endif()
 add_executable(llama_gemm llama_gemm.cc)

--- a/src/turbomind/models/llama/llama_kernels.cu
+++ b/src/turbomind/models/llama/llama_kernels.cu
@@ -737,49 +737,49 @@ void invokeGatherOutput(int*         output_ids,
        }                                                                                                              \
    }()
-template<typename T>
+// template<typename T>
-FlashAttentionOp<T>::FlashAttentionOp(int batch_size, int head_num, int key_len, int seq_len, int size_per_head):
+// FlashAttentionOp<T>::FlashAttentionOp(int batch_size, int head_num, int key_len, int seq_len, int size_per_head):
-    batch_size_(batch_size), head_num_(head_num), key_len_(key_len), seq_len_(seq_len), size_per_head_(size_per_head)
+//     batch_size_(batch_size), head_num_(head_num), key_len_(key_len), seq_len_(seq_len), size_per_head_(size_per_head)
-{
+// {
-#ifdef _MSC_VER
+// #ifdef _MSC_VER
-    op_version_ = 1;
+//     op_version_ = 1;
-#else
+// #else
-    op_version_ = std::is_same<half, typename std::decay<T>::type>::value ? 2 : 1;
+//     op_version_ = std::is_same<half, typename std::decay<T>::type>::value ? 2 : 1;
-    if (op_version_ == 2 && getSMVersion() < 80) {
+//     if (op_version_ == 2 && getSMVersion() < 80) {
-        op_version_ = 1;
+//         op_version_ = 1;
-    }
+//     }
-#endif
+// #endif
-}
+// }
-template<typename T>
+// template<typename T>
-int FlashAttentionOp<T>::get_workspace_size() const
+// int FlashAttentionOp<T>::get_workspace_size() const
-{
+// {
-#ifdef _MSC_VER
+// #ifdef _MSC_VER
-    FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//     FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-    return attention_op.get_workspace_size();
+//     return attention_op.get_workspace_size();
-#else
+// #else
-    return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
+//     return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
-        FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//         FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-        return attention_op.get_workspace_size();
+//         return attention_op.get_workspace_size();
-    });
+//     });
-#endif
+// #endif
-}
+// }
-template<typename T>
+// template<typename T>
-void FlashAttentionOp<T>::operator()(Params& params, cudaStream_t st) const
+// void FlashAttentionOp<T>::operator()(Params& params, cudaStream_t st) const
-{
+// {
-#ifdef _MSC_VER
+// #ifdef _MSC_VER
-    FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//     FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-    return attention_op(params, st);
+//     return attention_op(params, st);
-#else
+// #else
-    return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
+//     return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
-        FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//         FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-        return attention_op(params, st);
+//         return attention_op(params, st);
-    });
+//     });
-#endif
+// #endif
-}
+// }
-template class FlashAttentionOp<float>;
+// template class FlashAttentionOp<float>;
-template class FlashAttentionOp<half>;
+// template class FlashAttentionOp<half>;
 }  // namespace turbomind
--- a/src/turbomind/python/bind.cpp
+++ b/src/turbomind/python/bind.cpp
@@ -462,6 +462,6 @@ PYBIND11_MODULE(_turbomind, m)
        auto src_count  = std::accumulate(src_tensor.shape, src_tensor.shape + src_tensor.ndim, size_t{1});
        auto dst_count  = std::accumulate(dst_tensor.shape, dst_tensor.shape + dst_tensor.ndim, size_t{1});
        turbomind::FT_CHECK(src_count * 8 == dst_count);
-        turbomind::dequantize_s4((uint4*)dst_tensor.data, (uint32_t*)src_tensor.data, src_count, nullptr);
+        // turbomind::dequantize_s4((uint4*)dst_tensor.data, (uint32_t*)src_tensor.data, src_count, nullptr);
    });
 }