<div align="center"><strong>Text Generation Inference </strong></div>

## 简介
Text Generation Inference（TGI）是一个用 Rust 和 Python 编写的框架，用于部署和提供LLM模型的推理服务。TGI为很多大模型提供了高性能的推理服务，如LLama,Falcon,BLOOM,Baichuan,Qwen等。

## 已经验证的模型结构列表

|     模型      | 模型并行 | FP16 |
| :----------: | :------: | :--: |
|    LLaMA-2        |   未验证    | Yes  |
|    Deepseek2      |   未验证    | Yes  |
|    Baichuan2-7B   |   未验证    | Yes  |
|    Baichuan2-13B  |   未验证    | Yes  |


## 环境要求
+ Python 3.10
+ DTK 24.04.3
+ torch 2.3.0

### 使用源码编译方式安装

#### 编译环境准备

##### 方式一：

基于光源pytorch2.3.0基础镜像环境：镜像下载地址：[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch2.3.0、python、dtk及系统下载对应的镜像版本。pytorch2.3.0镜像里已经安装了trition,flash-attn

1. 安装Rust
```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

2. 安装Protoc
```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```
3. 安装TGI Service
```bash
# 根据需要的分支进行切换
git clone http://developer.hpccube.com/codes/wangkx1/text_generation_server-dcu.git -b v2.4.0  

cd text-generation-inference
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r pre_requirements.txt

#安装rocm-server
cd server
make install-rocm       

cd .. #回到项目根目录
source $HOME/.cargo/env

# router 依赖 openssl 最新的release的依赖包;
# git clone https://github.com/openssl/openssl.git -b openssl-3.4.0  # 安装
make install-router -j8

#安装text-generation服务
BUILD_EXTENSIONS=True make install-launcher -j8 


```
4. 安装benchmark
```bash
cd text-generation-inference
make install-benchmark
```

另外，`cargo install` 太慢也可以通过在`~/.cargo/config`中添加源来提速。

## 查看安装的版本号
```bash
text-generation-launcher -V  #版本号与官方版本同步
```

## 使用实例

```bash

export LD_LIBRARY_PATH=/root/miniconda3/envs/tgi2.4.0/lib:/root/miniconda3/envs/tgi2.4.0/lib/python3.10/site-packages/torch/lib:$LD_LIBRARY_PATH

# 不支持 PYTORCH_TUNABLEOP_ENABLED
export PYTORCH_TUNABLEOP_ENABLED=0
# export CUDA_GRAPHS=1
# export ATTENTION=flashdecoding
# export ATTENTION=flashinfer

# 仅支持 ATTENTION=paged
export ATTENTION=paged

# model=DeepSeek-V2-Lite-Chat
# model=baichuan-inc/Baichuan2-7B-Chat
# model=baichuan-inc/Baichuan2-13B-Chat
model=llama/meta-llama_Llama-2-7b-chat

export HIP_VISIBLE_DEVICES=4,5
text-generation-launcher --dtype=float16 \
--model-id ${model} \
--trust-remote-code --port 8080 

# --max-client-batch-size 16 
# --num-shard 2

```

## 测试服务

```shell
curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":200}}'     -H 'Content-Type: application/json'
```

## Known Issue

- 无

## 参考资料
- [README_ORIGIN](README_ORIGIN.md)
- [https://github.com/huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference)