# LLama_FT

## 论文
- [https://arxiv.org/pdf/2302.13971.pdf](https://arxiv.org/pdf/2302.13971.pdf)

## 模型结构
LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
预归一化。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。

## 模型介绍
LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。

## 环境配置

提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像：
```
docker pull docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:fastertransformer-dtk23.04-latest
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器内路径
docker run -it --name llama --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
```

镜像版本依赖：
* DTK驱动：dtk23.04
* Pytorch: 1.10
* python: python3.8

激活镜像环境：
`source /opt/dtk-23.04/env.sh`

## 数据集
训练数据集：CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]，Wikipedia和Books包括以下语言的数据：bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk。评估数据集：BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs。

## 推理
### 编译

```
mkdir build
cd build
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON -DCMAKE_CXX_COMPILER=nvcc ..
make -j12
```

### 模型下载

[llama 7B](https://huggingface.co/decapoda-research/llama-7b-hf)

[llama 13B](https://huggingface.co/decapoda-research/llama-13b-hf)

[llama 30B](https://huggingface.co/decapoda-research/llama-30b-hf)

[llama 65B](https://huggingface.co/decapoda-research/llama-65b-hf)


模型转换

```bash
python ../examples/cpp/llama/huggingface_llama_convert.py \
-saved_dir=/data/models/llama-7b-infer/ \
-in_file=/data/models/llama-7b-hf/ \
-infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
```

例如llama-7b的转换：`-in_file`为模型输入路径，`-saved_dir`为模型输出路径，`-infer_gpu_num`为推理的tp大小，`-weight_data_type`为推理的数据类型，`-model_name`为模型名称.若使用其他模型，对应修改路径和`-model_name`.

### 运行 LLama-7b

1. 生成`gemm_config.in`文件

data_type = 0 (FP32) or 1 (FP16)

```bash
./bin/gpt_gemm 1 1 20 32 128 11008 32000 1 1
```

上述参数对应为

```bash 
./bin/gpt_gemm <batch_size> <beam_width> <max_input_len> <head_number> <size_per_head> <inter_size> <vocab_size> <data_type> <tensor_para_size> 
```

2. 配置`../examples/cpp/llama/llama_config.ini`

data_type = 1时，data_type = fp16;data_type = 0时，data_type = fp32,tensor_para_size和模型转换设置的tp数保持一致，model_name=llama_7B，model_dir为对应的模型权重，request_batch_size为推理的batch_size数量，request_output_len为输出长度,`../examples/cpp/llama/start_ids.csv`可以修改输入的起始id.

3. 运行

```bash
./bin/llama_example
```
该程序会读取`../examples/cpp/llama//start_ids.csv`中的id作为输入tokens,生成的结果会保存在`.out`.


### 运行 LLama-13b

```bash
./bin/gpt_gemm 1 1 20 40 128 13824 32000 1 1
./bin/llama_example
```

### 运行 LLama-33b

```bash
./bin/gpt_gemm 1 1 20 52 128 17920 32000 1 2
mpirun --allow-run-as-root -np 2 ./bin/llama_example
```

### 运行 LLama-65b

```bash
./bin/gpt_gemm 1 1 20 64 128 22016 32000 1 8
mpirun --allow-run-as-root -np 8 ./bin/llama_example 
```

### 参数配置说明

llama-33b模型，使用fp16推理需要2张卡（32G）,llama-65b模型，使用fp16推理需要8张卡（32G）.
从huggingface下载llama模型，可以查看config.json文件，如下左边为fastertransformer参数，后边对应config.son文件中的参数值.

```bash
head_num=num_attention_heads
size_per_head=hidden_size / num_attention_heads
inter_size=intermediate_size
num_layer=num_hidden_layers
rotary_embedding=size_per_head
layernorm_eps=rms_norm_eps
vocab_size=vocab_size
```

## result
```
build/
    out
```
执行一下命令可以解析out结果：
```bash
pip install sentencepiece
python ../examples/cpp/llama/llama_tokenizer.py
其中，`tokenizer`为原模型路径
```

## 精度
测试数据："I believe the meaning of life is" (token id: 306, 4658, 278, 6593, 310, 2834, 338)，使用的加速卡:1张 DCU-Z100L-32G
准确性数据：
| 数据类型 | batch size | temperate | input len | output len |
| :------: | :------: | :------: | :------: |:------: |
| fp16 | 1 | 0 | 7 | 256 |

输出：
```
I believe the meaning of life is to live it to the fullest. I believe that we are all here for a reason and that we are all here to help each other. I believe that we are all here to learn and grow and that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here
```

## 应用场景

### 算法类别
NLP

### 热点应用行业
金融,科研,教育

## 源码仓库及问题反馈
* https://developer.hpccube.com/codes/modelzoo/llama_ft

## 参考
* [https://github.com/NVIDIA/FasterTransformer](https://github.com/NVIDIA/FasterTransformer)