Commit bc3c64aa authored by xiabo's avatar xiabo
Browse files

Adapt to rocm 不适用flashattention2

parent b97b62b7
...@@ -314,7 +314,7 @@ add_library(transformer-shared SHARED ...@@ -314,7 +314,7 @@ add_library(transformer-shared SHARED
$<TARGET_OBJECTS:BaseSamplingLayer> $<TARGET_OBJECTS:BaseSamplingLayer>
$<TARGET_OBJECTS:DynamicDecodeLayer> $<TARGET_OBJECTS:DynamicDecodeLayer>
# $<TARGET_OBJECTS:llama_fmha> # $<TARGET_OBJECTS:llama_fmha>
$<TARGET_OBJECTS:flash_attention2> # $<TARGET_OBJECTS:flash_attention2>
$<TARGET_OBJECTS:Llama> $<TARGET_OBJECTS:Llama>
$<TARGET_OBJECTS:LlamaTritonBackend> $<TARGET_OBJECTS:LlamaTritonBackend>
# $<TARGET_OBJECTS:gemm_s4_f16> # $<TARGET_OBJECTS:gemm_s4_f16>
......
<div align="center"> # <div align="center"><strong>LMdeploy</strong></div>
<img src="resources/lmdeploy-logo.svg" width="450"/> ## 简介
LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy)[MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
这个强大的工具箱提供以下核心功能:
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/) - **高效推理引擎 TurboMind**:基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer),我们实现了高效推理引擎 TurboMind,支持 InternLM、LLaMA、vicuna等模型在 NVIDIA GPU 上的推理。
[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
English | [简体中文](README_zh-CN.md) - **交互推理方式**:通过缓存多轮对话过程中 attention 的 k/v,记住对话历史,从而避免重复处理历史会话。
</div> - **多 GPU 部署和量化**:我们提供了全面的模型部署和量化支持,已在不同规模上完成验证。
<p align="center"> - **persistent batch 推理**:进一步优化模型执行效率。
👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
</p>
______________________________________________________________________ persistent batch 推理:进一步优化模型执行效率。
LMdeploy官方github地址:[https://github.com/InternLM/lmdeploy](https://github.com/InternLM/lmdeploy)
## News 🎉 ## 安装
- \[2023/09\] TurboMind supports Qwen-14B ### 使用源码编译方式安装
- \[2023/09\] TurboMind supports InternLM-20B
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
- \[2023/09\] TurboMind supports Baichuan2-7B
- \[2023/08\] TurboMind supports flash-attention2.
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
- \[2023/08\] TurboMind supports Windows (tp=1)
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
______________________________________________________________________ #### 编译环境准备
提供2种环境准备方式:
## Introduction
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
- **Persistent Batch Inference**: Further optimization of model execution efficiency.
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
## Supported Models
`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
### TurboMind
> **Note**<br />
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
| Models | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :----------: | :-------------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | Yes | Yes | No |
| Llama2 | Yes | Yes | Yes | Yes | No |
| SOLAR | Yes | Yes | Yes | Yes | No |
| InternLM-7B | Yes | Yes | Yes | Yes | No |
| InternLM-20B | Yes | Yes | Yes | Yes | No |
| QWen-7B | Yes | Yes | Yes | No | No |
| QWen-14B | Yes | Yes | Yes | No | No |
| Baichuan-7B | Yes | Yes | Yes | Yes | No |
| Baichuan2-7B | Yes | Yes | No | No | No |
| Code Llama | Yes | Yes | No | No | No |
### Pytorch
| Models | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :---------: | :-------------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | No | No | No |
| Llama2 | Yes | Yes | No | No | No |
| InternLM-7B | Yes | Yes | No | No | No |
## Performance
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
**Case II**: request throughput with real conversation data
Test Setting: LLaMA-7B, NVIDIA A100(80G)
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
## Quick Start
### Installation
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
```shell
pip install lmdeploy
```
### Deploy InternLM
#### Get InternLM model
```shell
# 1. Download InternLM model
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
```
#### Inference by TurboMind
```shell
lmdeploy chat turbomind ./workspace
```
> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
> **Note**<br />
> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
#### Serving with gradio
1. 基于光源pytorch基础镜像环境:镜像下载地址:[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch、python、dtk及系统下载对应的镜像版本。
```shell ```shell
lmdeploy serve gradio ./workspace pip install -r requirements.txt
pip install transformers==4.33.2
pip install urllib3==1.24
yum install rapidjson
``` ```
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab) 2. 基于现有python环境:安装pytorch,pytorch whl包下载目录:[https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04),根据python、dtk版本,下载对应pytorch的whl包。安装命令如下:
#### Serving with Restful API
Launch inference server by:
```shell ```shell
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1 pip install torch* (下载的torch的whl包)
pip install -r requirements.txt
pip install transformers==4.33.2
pip install urllib3==1.24
yum install rapidjson
``` ```
Then, you can communicate with it by command line, #### 源码编译安装
- 代码下载
```shell ```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333 git clone http://10.0.54.20/xiabo/lmdeploy.git # 根据编译需要切换分支 默认develop分支
lmdeploy serve api_client restful_api_url
``` ```
- 提供2种源码编译方式(进入mmcv目录):
or webui,
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
``` ```
1. 源码编译安装
Refer to [restful_api.md](docs/en/restful_api.md) for more details. mkdir build && cd build
sh ../generate.sh
#### Serving with Triton Inference Server make -j 32 && make install
cd .. && python3 setup.py install
Launch inference server by:
2. 编译成whl包安装
```shell mkdir build && cd build
bash workspace/service_docker_up.sh sh ../generate.sh
make -j 32 && make install
cd .. && python3 setup.py bdist_wheel
cd dist && pip3 install lmdeploy*
``` ```
## 模型服务
Then, you can communicate with the inference server by command line, ### 部署 [LLaMA-2](https://github.com/facebookresearch/llama) 服务
请从[这里](https://huggingface.co/meta-llama) 下载 llama2 模型,参考如下命令部署服务:
```shell 以7B为例:
lmdeploy serve triton_client {server_ip_addresss}:33337
``` ```
1、模型转换
or webui, python3 -m lmdeploy.serve.turbomind.deploy llama2 path/to/chinese-llama2-7b-hf hf path/to/chinese-llama2-7b-hf/tokenizer.model ./workspace_llama
2、运行
```shell - 在命令行界面运行:
lmdeploy serve gradio {server_ip_addresss}:33337 python3 -m lmdeploy.turbomind.chat ./workspace_llama
- 在服务器界面运行:
python3 -m lmdeploy.serve.gradio.app ./workspace_llama 10.6.10.67
打开网页输入10.6.10.67:6006
``` ```
### 部署 [internlm](https://huggingface.co/internlm/) 服务
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md) 请从[这里](https://huggingface.co/internlm) 下载 llama2 模型,参考如下命令部署服务:
以7B为例:
### Inference with PyTorch
For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
#### Single GPU
```shell
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
--seed 0
```
#### Tensor Parallel with DeepSpeed
```shell
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
$NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
--seed 0
```
You need to install deepspeed first to use this feature.
``` ```
pip install deepspeed 1、模型转换
python3 -m lmdeploy.serve.turbomind.deploy path/to/internlm-chat-7b internlm-chat-7b hf None ./workspace_intern
2、运行
- 在命令行界面运行:
python3 -m lmdeploy.turbomind.chat ./workspace_intern
- 在服务器界面运行:
python3 -m lmdeploy.serve.gradio.app ./workspace_intern 10.6.10.67
打开网页输入10.6.10.67:6006
``` ```
### 详细可参考 [docs](./docs/zh_cn/serving.md)
## 版本号查询
- python -c "import lmdeploy; lmdeploy.\_\_version__",版本号与官方版本同步,查询该软件的版本号,例如0.0.6;
## Quantization ## Known Issue
-
#### Weight INT4 Quantization
LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
#### KV Cache INT8 Quantization
[Click here](./docs/en/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
> **Warning**<br />
> runtime Tensor Parallel for quantized model is not available. Please setup `--tp` on `deploy` to enable static TP.
## Contributing
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
## Acknowledgement
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
## License ## Note
+ 若使用pip install下载安装过慢,可添加pypi清华源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
This project is released under the [Apache 2.0 license](LICENSE). ## 其他参考
- [README_origin](README_origin.md)
- [README_zh-CN](README_zh-CN.md)
<div align="center">
<img src="resources/lmdeploy-logo.svg" width="450"/>
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
English | [简体中文](README_zh-CN.md)
</div>
<p align="center">
👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
</p>
______________________________________________________________________
## News 🎉
- \[2023/09\] TurboMind supports Qwen-14B
- \[2023/09\] TurboMind supports InternLM-20B
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
- \[2023/09\] TurboMind supports Baichuan2-7B
- \[2023/08\] TurboMind supports flash-attention2.
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
- \[2023/08\] TurboMind supports Windows (tp=1)
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
______________________________________________________________________
## Introduction
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
- **Persistent Batch Inference**: Further optimization of model execution efficiency.
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
## Supported Models
`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
### TurboMind
> **Note**<br />
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
| Models | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :----------: | :-------------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | Yes | Yes | No |
| Llama2 | Yes | Yes | Yes | Yes | No |
| SOLAR | Yes | Yes | Yes | Yes | No |
| InternLM-7B | Yes | Yes | Yes | Yes | No |
| InternLM-20B | Yes | Yes | Yes | Yes | No |
| QWen-7B | Yes | Yes | Yes | No | No |
| QWen-14B | Yes | Yes | Yes | No | No |
| Baichuan-7B | Yes | Yes | Yes | Yes | No |
| Baichuan2-7B | Yes | Yes | No | No | No |
| Code Llama | Yes | Yes | No | No | No |
### Pytorch
| Models | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :---------: | :-------------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | No | No | No |
| Llama2 | Yes | Yes | No | No | No |
| InternLM-7B | Yes | Yes | No | No | No |
## Performance
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
**Case II**: request throughput with real conversation data
Test Setting: LLaMA-7B, NVIDIA A100(80G)
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
## Quick Start
### Installation
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
```shell
pip install lmdeploy
```
### Deploy InternLM
#### Get InternLM model
```shell
# 1. Download InternLM model
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
lmdeploy convert internlm-chat-7b /path/to/internlm-chat-7b
```
#### Inference by TurboMind
```shell
lmdeploy chat turbomind ./workspace
```
> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
> **Note**<br />
> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
#### Serving with gradio
```shell
lmdeploy serve gradio ./workspace
```
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
#### Serving with Restful API
Launch inference server by:
```shell
lmdeploy serve api_server ./workspace --instance_num 32 --tp 1
```
Then, you can communicate with it by command line,
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client restful_api_url
```
or webui,
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006 --restful_api True
lmdeploy serve gradio restful_api_url --server_name ${server_ip} --server_port ${server_port} --restful_api True
```
Refer to [restful_api.md](docs/en/restful_api.md) for more details.
#### Serving with Triton Inference Server
Launch inference server by:
```shell
bash workspace/service_docker_up.sh
```
Then, you can communicate with the inference server by command line,
```shell
lmdeploy serve triton_client {server_ip_addresss}:33337
```
or webui,
```shell
lmdeploy serve gradio {server_ip_addresss}:33337
```
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
### Inference with PyTorch
For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
#### Single GPU
```shell
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
--seed 0
```
#### Tensor Parallel with DeepSpeed
```shell
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
$NAME_OR_PATH_TO_HF_MODEL \
--max_new_tokens 64 \
--temperture 0.8 \
--top_p 0.95 \
--seed 0
```
You need to install deepspeed first to use this feature.
```
pip install deepspeed
```
## Quantization
#### Weight INT4 Quantization
LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
#### KV Cache INT8 Quantization
[Click here](./docs/en/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
> **Warning**<br />
> runtime Tensor Parallel for quantized model is not available. Please setup `--tp` on `deploy` to enable static TP.
## Contributing
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
## Acknowledgement
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
## License
This project is released under the [Apache 2.0 license](LICENSE).
...@@ -45,8 +45,8 @@ target_link_libraries(Llama PUBLIC cudart ...@@ -45,8 +45,8 @@ target_link_libraries(Llama PUBLIC cudart
# llama_fmha) # llama_fmha)
if (NOT MSVC) if (NOT MSVC)
add_subdirectory(flash_attention2) # add_subdirectory(flash_attention2)
target_link_libraries(Llama PUBLIC flash_attention2) # target_link_libraries(Llama PUBLIC flash_attention2)
endif() endif()
add_executable(llama_gemm llama_gemm.cc) add_executable(llama_gemm llama_gemm.cc)
......
...@@ -737,49 +737,49 @@ void invokeGatherOutput(int* output_ids, ...@@ -737,49 +737,49 @@ void invokeGatherOutput(int* output_ids,
} \ } \
}() }()
template<typename T> // template<typename T>
FlashAttentionOp<T>::FlashAttentionOp(int batch_size, int head_num, int key_len, int seq_len, int size_per_head): // FlashAttentionOp<T>::FlashAttentionOp(int batch_size, int head_num, int key_len, int seq_len, int size_per_head):
batch_size_(batch_size), head_num_(head_num), key_len_(key_len), seq_len_(seq_len), size_per_head_(size_per_head) // batch_size_(batch_size), head_num_(head_num), key_len_(key_len), seq_len_(seq_len), size_per_head_(size_per_head)
{ // {
#ifdef _MSC_VER // #ifdef _MSC_VER
op_version_ = 1; // op_version_ = 1;
#else // #else
op_version_ = std::is_same<half, typename std::decay<T>::type>::value ? 2 : 1; // op_version_ = std::is_same<half, typename std::decay<T>::type>::value ? 2 : 1;
if (op_version_ == 2 && getSMVersion() < 80) { // if (op_version_ == 2 && getSMVersion() < 80) {
op_version_ = 1; // op_version_ = 1;
} // }
#endif // #endif
} // }
template<typename T> // template<typename T>
int FlashAttentionOp<T>::get_workspace_size() const // int FlashAttentionOp<T>::get_workspace_size() const
{ // {
#ifdef _MSC_VER // #ifdef _MSC_VER
FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_); // FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
return attention_op.get_workspace_size(); // return attention_op.get_workspace_size();
#else // #else
return VERSION_SWITCH(op_version_, OP_VERSION, [&]() { // return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_); // FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
return attention_op.get_workspace_size(); // return attention_op.get_workspace_size();
}); // });
#endif // #endif
} // }
template<typename T> // template<typename T>
void FlashAttentionOp<T>::operator()(Params& params, cudaStream_t st) const // void FlashAttentionOp<T>::operator()(Params& params, cudaStream_t st) const
{ // {
#ifdef _MSC_VER // #ifdef _MSC_VER
FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_); // FlashAttentionOpImpl<T, 1> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
return attention_op(params, st); // return attention_op(params, st);
#else // #else
return VERSION_SWITCH(op_version_, OP_VERSION, [&]() { // return VERSION_SWITCH(op_version_, OP_VERSION, [&]() {
FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_); // FlashAttentionOpImpl<T, OP_VERSION> attention_op(batch_size_, head_num_, key_len_, seq_len_, size_per_head_);
return attention_op(params, st); // return attention_op(params, st);
}); // });
#endif // #endif
} // }
template class FlashAttentionOp<float>; // template class FlashAttentionOp<float>;
template class FlashAttentionOp<half>; // template class FlashAttentionOp<half>;
} // namespace turbomind } // namespace turbomind
...@@ -462,6 +462,6 @@ PYBIND11_MODULE(_turbomind, m) ...@@ -462,6 +462,6 @@ PYBIND11_MODULE(_turbomind, m)
auto src_count = std::accumulate(src_tensor.shape, src_tensor.shape + src_tensor.ndim, size_t{1}); auto src_count = std::accumulate(src_tensor.shape, src_tensor.shape + src_tensor.ndim, size_t{1});
auto dst_count = std::accumulate(dst_tensor.shape, dst_tensor.shape + dst_tensor.ndim, size_t{1}); auto dst_count = std::accumulate(dst_tensor.shape, dst_tensor.shape + dst_tensor.ndim, size_t{1});
turbomind::FT_CHECK(src_count * 8 == dst_count); turbomind::FT_CHECK(src_count * 8 == dst_count);
turbomind::dequantize_s4((uint4*)dst_tensor.data, (uint32_t*)src_tensor.data, src_count, nullptr); // turbomind::dequantize_s4((uint4*)dst_tensor.data, (uint32_t*)src_tensor.data, src_count, nullptr);
}); });
} }
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment