README.md 3.36 KB
Newer Older
jixx's avatar
jixx committed
1
 <div align="center"><strong>Text Generation Inference </strong></div>
jixx's avatar
jixx committed
2

jixx's avatar
jixx committed
3
4
## 简介
Text Generation Inference(TGI)是一个用 Rust 和 Python 编写的框架,用于部署和提供LLM模型的推理服务。TGI为很多大模型提供了高性能的推理服务,如LLama,Falcon,BLOOM,Baichuan,Qwen等。
jixx's avatar
jixx committed
5

jixx's avatar
jixx committed
6
## 已经验证的模型结构列表
jixx's avatar
jixx committed
7

jixx's avatar
jixx committed
8
9
10
11
12
13
|     模型      | 模型并行 | FP16 |
| :----------: | :------: | :--: |
|    LLaMA-2        |   未验证    | Yes  |
|    Deepseek2      |   未验证    | Yes  |
|    Baichuan2-7B   |   未验证    | Yes  |
|    Baichuan2-13B  |   未验证    | Yes  |
jixx's avatar
jixx committed
14
15


jixx's avatar
jixx committed
16
17
## 环境要求
+ Python 3.10
wangkx1's avatar
wangkx1 committed
18
19
+ DTK 24.04.3
+ torch 2.3.0
jixx's avatar
jixx committed
20

jixx's avatar
jixx committed
21
### 使用源码编译方式安装
jixx's avatar
jixx committed
22

jixx's avatar
jixx committed
23
#### 编译环境准备
jixx's avatar
jixx committed
24

jixx's avatar
jixx committed
25
##### 方式一:
jixx's avatar
jixx committed
26

jixx's avatar
jixx committed
27
基于光源pytorch2.3.0基础镜像环境:镜像下载地址:[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch2.3.0、python、dtk及系统下载对应的镜像版本。pytorch2.3.0镜像里已经安装了trition,flash-attn
jixx's avatar
init  
jixx committed
28

jixx's avatar
jixx committed
29
1. 安装Rust
jixx's avatar
init  
jixx committed
30
31
32
33
```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

jixx's avatar
jixx committed
34
2. 安装Protoc
jixx's avatar
init  
jixx committed
35
36
37
38
39
40
41
```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```
jixx's avatar
jixx committed
42
43
44
45
3. 安装TGI Service
```bash
# 根据需要的分支进行切换
git clone http://developer.hpccube.com/codes/wangkx1/text_generation_server-dcu.git -b v2.4.0  
wangkx1's avatar
wangkx1 committed
46

jixx's avatar
jixx committed
47
48
49
cd text-generation-inference
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r pre_requirements.txt
wangkx1's avatar
wangkx1 committed
50

wangkx1's avatar
wangkx1 committed
51
#安装rocm-server
jixx's avatar
jixx committed
52
cd server
wangkx1's avatar
wangkx1 committed
53
make install-rocm       
wangkx1's avatar
wangkx1 committed
54

jixx's avatar
jixx committed
55
56
cd .. #回到项目根目录
source $HOME/.cargo/env
wangkx1's avatar
wangkx1 committed
57

jixx's avatar
jixx committed
58
59
60
# router 依赖 openssl 最新的release的依赖包;
# git clone https://github.com/openssl/openssl.git -b openssl-3.4.0  # 安装
make install-router -j8
wangkx1's avatar
wangkx1 committed
61

jixx's avatar
jixx committed
62
63
#安装text-generation服务
BUILD_EXTENSIONS=True make install-launcher -j8 
jixx's avatar
jixx committed
64

jixx's avatar
jixx committed
65
66
67
68
69
70

```
4. 安装benchmark
```bash
cd text-generation-inference
make install-benchmark
jixx's avatar
init  
jixx committed
71
72
```

jixx's avatar
jixx committed
73
另外,`cargo install` 太慢也可以通过在`~/.cargo/config`中添加源来提速。
jixx's avatar
jixx committed
74

jixx's avatar
jixx committed
75
76
77
78
## 查看安装的版本号
```bash
text-generation-launcher -V  #版本号与官方版本同步
```
jixx's avatar
jixx committed
79

jixx's avatar
jixx committed
80
## 使用实例
jixx's avatar
jixx committed
81

jixx's avatar
jixx committed
82
```bash
jixx's avatar
jixx committed
83

jixx's avatar
jixx committed
84
export LD_LIBRARY_PATH=/root/miniconda3/envs/tgi2.4.0/lib:/root/miniconda3/envs/tgi2.4.0/lib/python3.10/site-packages/torch/lib:$LD_LIBRARY_PATH
jixx's avatar
jixx committed
85

jixx's avatar
jixx committed
86
87
88
89
90
# 不支持 PYTORCH_TUNABLEOP_ENABLED
export PYTORCH_TUNABLEOP_ENABLED=0
# export CUDA_GRAPHS=1
# export ATTENTION=flashdecoding
# export ATTENTION=flashinfer
jixx's avatar
jixx committed
91

jixx's avatar
jixx committed
92
93
# 仅支持 ATTENTION=paged
export ATTENTION=paged
jixx's avatar
jixx committed
94

jixx's avatar
jixx committed
95
96
97
98
# model=DeepSeek-V2-Lite-Chat
# model=baichuan-inc/Baichuan2-7B-Chat
# model=baichuan-inc/Baichuan2-13B-Chat
model=llama/meta-llama_Llama-2-7b-chat
jixx's avatar
jixx committed
99

jixx's avatar
jixx committed
100
101
102
103
export HIP_VISIBLE_DEVICES=4,5
text-generation-launcher --dtype=float16 \
--model-id ${model} \
--trust-remote-code --port 8080 
jixx's avatar
jixx committed
104

jixx's avatar
jixx committed
105
106
# --max-client-batch-size 16 
# --num-shard 2
jixx's avatar
jixx committed
107

jixx's avatar
init  
jixx committed
108
109
```

jixx's avatar
jixx committed
110
## 测试服务
jixx's avatar
jixx committed
111
112

```shell
jixx's avatar
jixx committed
113
curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":200}}'     -H 'Content-Type: application/json'
jixx's avatar
init  
jixx committed
114
115
```

jixx's avatar
jixx committed
116
## Known Issue
jixx's avatar
jixx committed
117

jixx's avatar
jixx committed
118
-
jixx's avatar
jixx committed
119

jixx's avatar
jixx committed
120
121
122
## 参考资料
- [README_ORIGIN](README_ORIGIN.md)
- [https://github.com/huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference)