README.md

 <div align="center"><strong>Text Generation Inference </strong></div>

## 简介
Text Generation Inference（TGI）是一个用 Rust 和 Python 编写的框架，用于部署和提供LLM模型的推理服务。TGI为很多大模型提供了高性能的推理服务，如LLama,Falcon,BLOOM,Baichuan,Qwen等。

## 支持模型列表

- [Deepseek V2](https://huggingface.co/deepseek-ai/DeepSeek-V2)
- [Idefics 2](https://huggingface.co/HuggingFaceM4/idefics2-8b) (Multimodal)
- [Llava Next (1.6)](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) (Multimodal)
- [Llama](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f)
- [Phi 3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
- [Granite](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct)
- [Gemma](https://huggingface.co/google/gemma-7b)
- [PaliGemma](https://huggingface.co/google/paligemma-3b-pt-224)
- [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)
- [Cohere](https://huggingface.co/CohereForAI/c4ai-command-r-plus)
- [Dbrx](https://huggingface.co/databricks/dbrx-instruct)
- [Mamba](https://huggingface.co/state-spaces/mamba-2.8b-slimpj)
- [Mistral](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
- [Mixtral](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
- [Gpt Bigcode](https://huggingface.co/bigcode/gpt_bigcode-santacoder)
- [Phi](https://huggingface.co/microsoft/phi-1_5)
- [PhiMoe](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct)
- [Baichuan](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)
- [Falcon](https://huggingface.co/tiiuae/falcon-7b-instruct)
- [StarCoder 2](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1)
- [Qwen 2](https://huggingface.co/collections/Qwen/qwen2-6659360b33528ced941e557f)
- [Qwen 2 VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
- [Opt](https://huggingface.co/facebook/opt-6.7b)
- [T5](https://huggingface.co/google/flan-t5-xxl)
- [Galactica](https://huggingface.co/facebook/galactica-120b)
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
- [Bloom](https://huggingface.co/bigscience/bloom-560m)
- [Mpt](https://huggingface.co/mosaicml/mpt-7b-instruct)
- [Gpt2](https://huggingface.co/openai-community/gpt2)
- [Gpt Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
- [Gptj](https://huggingface.co/EleutherAI/gpt-j-6b)
- [Idefics](https://huggingface.co/HuggingFaceM4/idefics-9b) (Multimodal)
- [Mllama](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) (Multimodal)


## 环境要求
+ Python 3.10
+ DTK 24.04.3
+ torch 2.1.0

### 使用源码编译方式安装

#### 编译环境准备

有两种方式安装准备环境
##### 方式一:

### **TODO**

##### 方式二：

基于光源pytorch2.1.0基础镜像环境：镜像下载地址：[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。pytorch2.1.0镜像里已经安装了trition,flash-attn

1. 安装Rust
```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

2. 安装Protoc
```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```
3. 安装TGI Service
```bash
git clone http://developer.hpccube.com/codes/OpenDAS/text-generation-inference.git # 分支进行切换-b v3.0.0
cd text-generation-inference
#安装exllama
cd server/exllamav2_kernels
python setup.py install #安装exllmav2 kernels
cd ../.. #回到项目根目录
source $HOME/.cargo/env
BUILD_EXTENSIONS=True make install #安装text-generation服务
```
4. 安装benchmark
```bash
cd text-generation-inference
make install-benchmark
```
注意：若安装过程过慢，可以通过如下命令修改默认源提速。
```bash
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
```
另外，`cargo install` 太慢也可以通过在`~/.cargo/config`中添加源来提速。

## 查看安装的版本号
```bash
text-generation-launcher -V  #版本号与官方版本同步
```

## 使用前

```bash
export PYTORCH_TUNABLEOP_ENABLED=0
export ATTENTION=paged
```

## 使用

```bash
#启动tgi服务
HIP_VISIBLE_DEVICES=2 text-generation-launcher --dtype=float16 --model-id /path/to/model --trust-remote-code --port 3001
#请求服务
curl 127.0.0.1:3001/generate \
    -X POST \
    -d '{"inputs":"What is deep learning?","parameters":{"max_new_tokens":100,"temperature":0.7}}' \
    -H 'Content-Type: application/json'
#通过python调用的方式
import requests

headers = {
    "Content-Type": "application/json",
}

data = {
    'inputs': 'What is Deep Learning?',
    'parameters': {
        'max_new_tokens': 20,
    },
}

response = requests.post('http://127.0.0.1:3001/generate', headers=headers, json=data)
print(response.json())
# {'generated_text': ' Deep Learning is a subset of machine learning where neural networks are trained deep within a hierarchy of layers instead'}
```


## Known Issue

- 无

## 参考资料
- [README_ORIGIN](README_ORIGIN.md)
- [https://github.com/huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference)