Adapt to rocm 不适用flashattention2

bc3c64aa · xiabo · b97b62b7 · bc3c64aa · bc3c64aa · bc3c64aa
Commit bc3c64aa authored Nov 15, 2023 by xiabo
6 changed files
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -314,7 +314,7 @@ add_library(transformer-shared SHARED
  $<TARGET_OBJECTS:BaseSamplingLayer>
  $<TARGET_OBJECTS:DynamicDecodeLayer>
 #  $<TARGET_OBJECTS:llama_fmha>
-  $<TARGET_OBJECTS:flash_attention2>
+#  $<TARGET_OBJECTS:flash_attention2>
  $<TARGET_OBJECTS:Llama>
  $<TARGET_OBJECTS:LlamaTritonBackend>
 #  $<TARGET_OBJECTS:gemm_s4_f16>

--- a/README.md
+++ b/README.md
-<div align="center">
-  <img src="resources/lmdeploy-logo.svg" width="450"/>
+# <div align="center"><strong>LMdeploy</strong></div>
+## 简介
+LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
+这个强大的工具箱提供以下核心功能：

-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
-[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
-[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
-[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
+- **高效推理引擎 TurboMind**：基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)，我们实现了高效推理引擎 TurboMind，支持 InternLM、LLaMA、vicuna等模型在 NVIDIA GPU 上的推理。

-English | [简体中文](README_zh-CN.md)
+- **交互推理方式**：通过缓存多轮对话过程中 attention 的 k/v，记住对话历史，从而避免重复处理历史会话。

-</div>
+- **多 GPU 部署和量化**：我们提供了全面的模型部署和量化支持，已在不同规模上完成验证。

-<p align="center">
-    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
-</p>
+- **persistent batch 推理**：进一步优化模型执行效率。

-______________________________________________________________________
+persistent batch 推理：进一步优化模型执行效率。
+LMdeploy官方github地址:[https://github.com/InternLM/lmdeploy](https://github.com/InternLM/lmdeploy)

-## News 🎉
+## 安装

- \[2023/09\] TurboMind supports Qwen-14B
- \[2023/09\] TurboMind supports InternLM-20B
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
- \[2023/09\] TurboMind supports Baichuan2-7B
- \[2023/08\] TurboMind supports flash-attention2.
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
- \[2023/08\] TurboMind supports Windows (tp=1)
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
+### 使用源码编译方式安装

-______________________________________________________________________
-
-## Introduction
-
-LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
-
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
-
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
-
- **Persistent Batch Inference**: Further optimization of model execution efficiency.
-
-![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
-
-## Supported Models
-
-`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
-
-### TurboMind
-
-> **Note**<br />
-> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
-
-|    Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
-| :----------: | :-------------: | :--: | :-----: | :---: | :--: |
-|    Llama     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-|    Llama2    |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-|    SOLAR     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-| InternLM-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-| InternLM-20B |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-|   QWen-7B    |       Yes       | Yes  |   Yes   |  No   |  No  |
-|   QWen-14B   |       Yes       | Yes  |   Yes   |  No   |  No  |
-| Baichuan-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
-| Baichuan2-7B |       Yes       | Yes  |   No    |  No   |  No  |
-|  Code Llama  |       Yes       | Yes  |   No    |  No   |  No  |
-
-### Pytorch
-
-|   Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
-| :---------: | :-------------: | :--: | :-----: | :---: | :--: |
-|    Llama    |       Yes       | Yes  |   No    |  No   |  No  |
-|   Llama2    |       Yes       | Yes  |   No    |  No   |  No  |
-| InternLM-7B |       Yes       | Yes  |   No    |  No   |  No  |
-
-## Performance
-
-**Case I**: output token throughput with fixed input token and output token number (1, 2048)
-
-**Case II**: request throughput with real conversation data
-
-Test Setting: LLaMA-7B, NVIDIA A100(80G)
-
-The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
-And the request throughput of TurboMind is 30% higher than vLLM.
-
-![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
-
-## Quick Start
-
-### Installation
-
-Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
-
-```shell
-pip install lmdeploy
-```
-
-### Deploy InternLM
-
-#### Get InternLM model
-
-```shell
-# 1. Download InternLM model
-
-# Make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
-
-# if you want to clone without large files – just their pointers
-# prepend your git clone with the following env var:
-GIT_LFS_SKIP_SMUDGE=1
-
-# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
-lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
-
-```
-
-#### Inference by TurboMind
-
-```shell
-lmdeploy chat turbomind ./workspace
-```
-
-> **Note**<br />
-> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
-> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
-> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
-
-> **Note**<br />
-> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
-
-#### Serving with gradio
+#### 编译环境准备
+提供2种环境准备方式：

+1. 基于光源pytorch基础镜像环境：镜像下载地址：[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch、python、dtk及系统下载对应的镜像版本。
 ```shell
-lmdeploy serve gradio ./workspace
+pip install -r requirements.txt
+pip install transformers==4.33.2
+pip install urllib3==1.24
+yum install rapidjson
 ```

-![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
-
-#### Serving with Restful API
-
-Launch inference server by:
-
+2. 基于现有python环境：安装pytorch，pytorch whl包下载目录：[https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04)，根据python、dtk版本,下载对应pytorch的whl包。安装命令如下：
 ```shell
-lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+pip install torch* (下载的torch的whl包)
+pip install -r requirements.txt
+pip install transformers==4.33.2
+pip install urllib3==1.24
+yum install rapidjson
 ```

-Then, you can communicate with it by command line,
-
+#### 源码编译安装
+- 代码下载
 ```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
-lmdeploy serve api_client restful_api_url
+git clone http://10.0.54.20/xiabo/lmdeploy.git # 根据编译需要切换分支 默认develop分支
 ```
-
-or webui,
-
-```shell
-# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
-# server_ip and server_port here are for gradio ui
-# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
-lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
+- 提供2种源码编译方式（进入mmcv目录）：
 ```
-
-Refer to [restful_api.md](docs/en/restful_api.md) for more details.
-
-#### Serving with Triton Inference Server
-
-Launch inference server by:
-
-```shell
-bash workspace/service_docker_up.sh
+1. 源码编译安装
+mkdir build && cd build
+sh ../generate.sh
+make -j 32 && make install
+cd .. && python3 setup.py install
+
+2. 编译成whl包安装
+mkdir build && cd build
+sh ../generate.sh
+make -j 32 && make install
+cd .. && python3 setup.py bdist_wheel
+cd dist && pip3 install lmdeploy*
 ```
+## 模型服务

-Then, you can communicate with the inference server by command line,
-
-```shell
-lmdeploy serve triton_client {server_ip_addresss}:33337
+### 部署 [LLaMA-2](https://github.com/facebookresearch/llama) 服务
+请从[这里](https://huggingface.co/meta-llama) 下载 llama2 模型，参考如下命令部署服务：
+以7B为例：
 ```
-
-or webui,
-
-```shell
-lmdeploy serve gradio {server_ip_addresss}:33337
+1、模型转换
+python3 -m lmdeploy.serve.turbomind.deploy llama2 path/to/chinese-llama2-7b-hf hf path/to/chinese-llama2-7b-hf/tokenizer.model ./workspace_llama
+2、运行
+- 在命令行界面运行：
+python3 -m lmdeploy.turbomind.chat ./workspace_llama
+- 在服务器界面运行：
+python3 -m lmdeploy.serve.gradio.app ./workspace_llama 10.6.10.67
+打开网页输入10.6.10.67:6006
 ```
-
-For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
-
-### Inference with PyTorch
-
-For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
-
-#### Single GPU
-
-```shell
-lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
-    --max_new_tokens 64 \
-    --temperture 0.8 \
-    --top_p 0.95 \
-    --seed 0
-```
-
-#### Tensor Parallel with DeepSpeed
-
-```shell
-deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
-    $NAME_OR_PATH_TO_HF_MODEL \
-    --max_new_tokens 64 \
-    --temperture 0.8 \
-    --top_p 0.95 \
-    --seed 0
-```
-
-You need to install deepspeed first to use this feature.
-
+### 部署 [internlm](https://huggingface.co/internlm/) 服务
+请从[这里](https://huggingface.co/internlm) 下载 llama2 模型，参考如下命令部署服务：
+以7B为例：
 ```
-pip install deepspeed
+1、模型转换
+python3 -m lmdeploy.serve.turbomind.deploy path/to/internlm-chat-7b internlm-chat-7b hf None ./workspace_intern
+2、运行
+- 在命令行界面运行：
+python3 -m lmdeploy.turbomind.chat ./workspace_intern
+- 在服务器界面运行：
+python3 -m lmdeploy.serve.gradio.app ./workspace_intern 10.6.10.67
+打开网页输入10.6.10.67:6006
 ```
+### 详细可参考 [docs](./docs/zh_cn/serving.md) 
+## 版本号查询
+- python -c "import lmdeploy; lmdeploy.\_\_version__"，版本号与官方版本同步，查询该软件的版本号，例如0.0.6；

-## Quantization
-
-#### Weight INT4 Quantization
-
-LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
-
-[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
-
-#### KV Cache INT8 Quantization
-
-[Click here](./docs/en/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
-
-> **Warning**<br />
-> runtime Tensor Parallel for quantized model is not available. Please setup `--tp` on `deploy` to enable static TP.
-
-## Contributing
-
-We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
-
-## Acknowledgement
-
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
+## Known Issue
+- 无

-## License
+## Note
+ 若使用pip install下载安装过慢，可添加pypi清华源：-i https://pypi.tuna.tsinghua.edu.cn/simple/

-This project is released under the [Apache 2.0 license](LICENSE).
+## 其他参考
+- [README_origin](README_origin.md)
+- [README_zh-CN](README_zh-CN.md)
--- a/README_origin.md
+++ b/README_origin.md
+<div align="center">
+  <img src="resources/lmdeploy-logo.svg" width="450"/>
+
+[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
+[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
+[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
+[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
+[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
+
+English | [简体中文](README_zh-CN.md)
+
+</div>
+
+<p align="center">
+    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
+</p>
+
+______________________________________________________________________
+
+## News 🎉
+
+- \[2023/09\] TurboMind supports Qwen-14B
+- \[2023/09\] TurboMind supports InternLM-20B
+- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
+- \[2023/09\] TurboMind supports Baichuan2-7B
+- \[2023/08\] TurboMind supports flash-attention2.
+- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
+- \[2023/08\] TurboMind supports Windows (tp=1)
+- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
+- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
+- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
+- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
+- \[2023/07\] TurboMind supports Llama-2 7B/13B.
+- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
+
+______________________________________________________________________
+
+## Introduction
+
+LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
+
+- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
+
+- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
+
+- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
+
+- **Persistent Batch Inference**: Further optimization of model execution efficiency.
+
+![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
+
+## Supported Models
+
+`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
+
+### TurboMind
+
+> **Note**<br />
+> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
+
+|    Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
+| :----------: | :-------------: | :--: | :-----: | :---: | :--: |
+|    Llama     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+|    Llama2    |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+|    SOLAR     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+| InternLM-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+| InternLM-20B |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+|   QWen-7B    |       Yes       | Yes  |   Yes   |  No   |  No  |
+|   QWen-14B   |       Yes       | Yes  |   Yes   |  No   |  No  |
+| Baichuan-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+| Baichuan2-7B |       Yes       | Yes  |   No    |  No   |  No  |
+|  Code Llama  |       Yes       | Yes  |   No    |  No   |  No  |
+
+### Pytorch
+
+|   Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
+| :---------: | :-------------: | :--: | :-----: | :---: | :--: |
+|    Llama    |       Yes       | Yes  |   No    |  No   |  No  |
+|   Llama2    |       Yes       | Yes  |   No    |  No   |  No  |
+| InternLM-7B |       Yes       | Yes  |   No    |  No   |  No  |
+
+## Performance
+
+**Case I**: output token throughput with fixed input token and output token number (1, 2048)
+
+**Case II**: request throughput with real conversation data
+
+Test Setting: LLaMA-7B, NVIDIA A100(80G)
+
+The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
+And the request throughput of TurboMind is 30% higher than vLLM.
+
+![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
+
+## Quick Start
+
+### Installation
+
+Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
+
+```shell
+pip install lmdeploy
+```
+
+### Deploy InternLM
+
+#### Get InternLM model
+
+```shell
+# 1. Download InternLM model
+
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
+
+# if you want to clone without large files – just their pointers
+# prepend your git clone with the following env var:
+GIT_LFS_SKIP_SMUDGE=1
+
+# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
+lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
+
+```
+
+#### Inference by TurboMind
+
+```shell
+lmdeploy chat turbomind ./workspace
+```
+
+> **Note**<br />
+> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
+> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
+> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
+
+> **Note**<br />
+> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
+
+#### Serving with gradio
+
+```shell
+lmdeploy serve gradio ./workspace
+```
+
+![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
+
+#### Serving with Restful API
+
+Launch inference server by:
+
+```shell
+lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
+```
+
+Then, you can communicate with it by command line,
+
+```shell
+# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+lmdeploy serve api_client restful_api_url
+```
+
+or webui,
+
+```shell
+# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
+# server_ip and server_port here are for gradio ui
+# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
+lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
+```
+
+Refer to [restful_api.md](docs/en/restful_api.md) for more details.
+
+#### Serving with Triton Inference Server
+
+Launch inference server by:
+
+```shell
+bash workspace/service_docker_up.sh
+```
+
+Then, you can communicate with the inference server by command line,
+
+```shell
+lmdeploy serve triton_client {server_ip_addresss}:33337
+```
+
+or webui,
+
+```shell
+lmdeploy serve gradio {server_ip_addresss}:33337
+```
+
+For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
+
+### Inference with PyTorch
+
+For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
+
+#### Single GPU
+
+```shell
+lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
+    --max_new_tokens 64 \
+    --temperture 0.8 \
+    --top_p 0.95 \
+    --seed 0
+```
+
+#### Tensor Parallel with DeepSpeed
+
+```shell
+deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
+    $NAME_OR_PATH_TO_HF_MODEL \
+    --max_new_tokens 64 \
+    --temperture 0.8 \
+    --top_p 0.95 \
+    --seed 0
+```
+
+You need to install deepspeed first to use this feature.
+
+```
+pip install deepspeed
+```
+
+## Quantization
+
+#### Weight INT4 Quantization
+
+LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
+
+[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
+
+#### KV Cache INT8 Quantization
+
+[Click here](./docs/en/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
+
+> **Warning**<br />
+> runtime Tensor Parallel for quantized model is not available. Please setup `--tp` on `deploy` to enable static TP.
+
+## Contributing
+
+We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
+
+## Acknowledgement
+
+- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
+- [llm-awq](https://github.com/mit-han-lab/llm-awq)
+
+## License
+
+This project is released under the [Apache 2.0 license](LICENSE).
--- a/src/turbomind/models/llama/CMakeLists.txt
+++ b/src/turbomind/models/llama/CMakeLists.txt
@@ -45,8 +45,8 @@ target_link_libraries(Llama PUBLIC cudart
 #        llama_fmha)

 if (NOT MSVC)
-        add_subdirectory(flash_attention2)
-        target_link_libraries(Llama PUBLIC flash_attention2)
+#        add_subdirectory(flash_attention2)
+#        target_link_libraries(Llama PUBLIC flash_attention2)
 endif()

 add_executable(llama_gemm llama_gemm.cc)

--- a/src/turbomind/models/llama/llama_kernels.cu
+++ b/src/turbomind/models/llama/llama_kernels.cu
@@ -737,49 +737,49 @@ void invokeGatherOutput(int*         output_ids,
        }                                                                                                              \
    }()

-template<typename T>
-FlashAttentionOp<T>::FlashAttentionOp(int batch_size, int head_num, int key_len, int seq_len, int size_per_head):
-    batch_size_(batch_size), head_num_(head_num), key_len_(key_len), seq_len_(seq_len), size_per_head_(size_per_head)
-{
-#ifdef _MSC_VER
-    op_version_ = 1;
-#else
-    op_version_ = std::is_same<half, typename std::decay<T>::type>::value ? 2 : 1;
-    if (op_version_ == 2 && getSMVersion() < 80) {
-        op_version_ = 1;
-    }
-#endif
-}
-
-template<typename T>
-int FlashAttentionOp<T>::get_workspace_size() const
-{
-#ifdef _MSC_VER
-    FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-    return attention_op.get_workspace_size();
-#else
-    return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
-        FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-        return attention_op.get_workspace_size();
-    });
-#endif
-}
-
-template<typename T>
-void FlashAttentionOp<T>::operator()(Params& params, cudaStream_t st) const
-{
-#ifdef _MSC_VER
-    FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-    return attention_op(params, st);
-#else
-    return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
-        FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
-        return attention_op(params, st);
-    });
-#endif
-}
+// template<typename T>
+// FlashAttentionOp<T>::FlashAttentionOp(int batch_size, int head_num, int key_len, int seq_len, int size_per_head):
+//     batch_size_(batch_size), head_num_(head_num), key_len_(key_len), seq_len_(seq_len), size_per_head_(size_per_head)
+// {
+// #ifdef _MSC_VER
+//     op_version_ = 1;
+// #else
+//     op_version_ = std::is_same<half, typename std::decay<T>::type>::value ? 2 : 1;
+//     if (op_version_ == 2 && getSMVersion() < 80) {
+//         op_version_ = 1;
+//     }
+// #endif
+// }
+
+// template<typename T>
+// int FlashAttentionOp<T>::get_workspace_size() const
+// {
+// #ifdef _MSC_VER
+//     FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//     return attention_op.get_workspace_size();
+// #else
+//     return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
+//         FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//         return attention_op.get_workspace_size();
+//     });
+// #endif
+// }
+
+// template<typename T>
+// void FlashAttentionOp<T>::operator()(Params& params, cudaStream_t st) const
+// {
+// #ifdef _MSC_VER
+//     FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//     return attention_op(params, st);
+// #else
+//     return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
+//         FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
+//         return attention_op(params, st);
+//     });
+// #endif
+// }

-template class FlashAttentionOp<float>;
-template class FlashAttentionOp<half>;
+// template class FlashAttentionOp<float>;
+// template class FlashAttentionOp<half>;

 }  // namespace turbomind
--- a/src/turbomind/python/bind.cpp
+++ b/src/turbomind/python/bind.cpp
@@ -462,6 +462,6 @@ PYBIND11_MODULE(_turbomind, m)
        auto src_count  = std::accumulate(src_tensor.shape, src_tensor.shape + src_tensor.ndim, size_t{1});
        auto dst_count  = std::accumulate(dst_tensor.shape, dst_tensor.shape + dst_tensor.ndim, size_t{1});
        turbomind::FT_CHECK(src_count * 8 == dst_count);
-        turbomind::dequantize_s4((uint4*)dst_tensor.data, (uint32_t*)src_tensor.data, src_count, nullptr);
+        // turbomind::dequantize_s4((uint4*)dst_tensor.data, (uint32_t*)src_tensor.data, src_count, nullptr);
    });
 }