README.md


# LLAMA

## 论文
- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)

## 模型结构
LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
预归一化。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。

![img](./docs/llama_str.png)

## 算法原理
LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。

![img](./docs/llama_pri.png)

## 环境配置

### Docker（方法一）

## **TODO**

### 源码编译安装（方法二）

基于光源pytorch2.1.0基础镜像环境：镜像下载地址：[https://sourcefind.cn/#/image/dcu/pytorch](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch2.1.0、python、dtk及系统下载对应的镜像版本。pytorch2.1.0镜像里已经安装了trition,flash-attn

1. 安装Rust

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

2. 安装Protoc

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

3. 安装TGI Service

```bash
cd llama_tgi
git clone http://developer.hpccube.com/codes/OpenDAS/text-generation-inference.git #根据需要的分支进行切换 例：-b v2.1.1
cd text-generation-inference
#安装exllama
cd server
make install-exllama #安装exllama kernels
make install-exllamav2 #安装exllmav2 kernels
cd .. #回到项目根目录
source $HOME/.cargo/env
BUILD_EXTENSIONS=True make install #安装text-generation服务
```

4. 安装benchmark

```bash
cd text-generation-inference
make install-benchmark
```

注意：若安装过程过慢，可以通过如下命令修改默认源提速。

```bash
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
```

另外，`cargo install` 太慢也可以通过在`~/.cargo/config`中添加源来提速。

## 查看安装的版本号

```bash
text-generation-launcher -V  #版本号与官方版本同步
```

## 使用前

```bash
export PYTORCH_TUNABLEOP_ENABLED=0
```

## 数据集
无

## 推理

### 模型下载

| 基座模型                                                     | chat模型                                                     | GPTQ模型                                                     |
| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| [Llama-2-7b-hf](http://113.200.138.88:18080/aimodels/Llama-2-7b-hf/-/archive/main/Llama-2-7b-hf-main.tar.gz) | [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | [Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True) |
| [Llama-2-13b-hf](http://113.200.138.88:18080/aimodels/Llama-2-13b-hf/-/archive/main/Llama-2-13b-hf-main.tar.gz) | [Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | [Llama-2-13B-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ/tree/gptq-4bit-128g-actorder_True) |
| [Llama-2-70b-hf](http://113.200.138.88:18080/aimodels/meta-llama/Llama-2-70b-hf/-/archive/main/Llama-2-70b-hf-main.tar.gz) | [Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | [Llama-2-70B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GPTQ/tree/gptq-4bit-128g-actorder_True) |
| [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |                                                              |
| [Meta-Llama-3-70B](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-70B/-/archive/main/Meta-Llama-3.1-70B-main.tar.gz) | [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) |                                                              |


### 部署TGI
#### 1. 启动TGI服务
```
HIP_VISIBLE_DEVICES=3 text-generation-launcher --dtype=float16 --model-id /path/to/Llama-2-7b-chat-hf --port 3001
```
更多参数可使用如下方式查看
```
text-generation-launcher --help
```
#### 2. 请求服务

curl命令方式:
```
curl 127.0.0.1:3001/generate \
    -X POST \
    -d '{"inputs":"What is deep learning?","parameters":{"max_new_tokens":100,"temperature":0.7}}' \
    -H 'Content-Type: application/json'
```
通过python调用的方式：
```
import requests

headers = {
    "Content-Type": "application/json",
}

data = {
    'inputs': 'What is Deep Learning?',
    'parameters': {
        'max_new_tokens': 20,
    },
}

response = requests.post('http://127.0.0.1:3001/generate', headers=headers, json=data)
print(response.json())
# {'generated_text': '\n\nDeep Learning is a subset of Machine Learning that is concerned with the development of algorithms that can'}
```
更多API查看，请参考 [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference)
#### 3. TGI benchmark
example:
```
text-generation-benchmark -s 32 -d 128 --runs 10 --tokenizer-name /path/to/Llama-2-7b-chat-hf
```
注意：需要先启动TGI服务才能使用TGI benchmark。此外，`--tokenizer-name`需要和服务中保持一致。
更多参数可使用如下方式查看
```
text-generation-benchmark --help
```

### 推理结果

## ![img1](./readme_images/img1.png)应用场景

### 算法类别
对话问答

### 热点应用行业
金融,科研,教育

## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/llama_tgi](https://developer.hpccube.com/codes/modelzoo/llama_tgi)

## 参考资料
* [https://github.com/huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference)