README.md

# LLama_FT

## 模型介绍
LLama是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。

## 模型结构

LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
预归一化。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
SwiGLU 激活函数 [PaLM]。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
旋转嵌入。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。

## 推理

### 环境配置

提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像：
* 推理镜像：docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:fastertransformer-dtk23.04-latest

激活镜像环境：
`source /opt/dtk-23.04/env.sh`


### 编译

```
source /opt/dtk23.04/env.sh
mkdir build
cd build
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON -DCMAKE_CXX_COMPILER=nvcc ..
make -j12
```

### 模型下载

[llama 7B](https://huggingface.co/decapoda-research/llama-7b-hf)
[llama 13b](https://huggingface.co/decapoda-research/llama-13b-hf)
[llama 30B](https://huggingface.co/decapoda-research/llama-30b-hf)
[llama 65b](https://huggingface.co/decapoda-research/llama-65b-hf)

模型转换

```bash
python ../examples/cpp/llama/huggingface_llama_convert.py \
-saved_dir=/data/models/llama-7b-infer/ \
-in_file=/data/models/llama-7b-hf/ \
-infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
```

例如llama-7b的转换：`-in_file`为模型输入路径，`-saved_dir`为模型输出路径，`-infer_gpu_num`为推理的tp大小，`-weight_data_type`为推理的数据类型，`-model_name`为模型名称.若使用其他模型，对应修改路径和`-model_name`.

### 运行 LLama-7b

1. 生成`gemm_config.in`文件

data_type = 0 (FP32) or 1 (FP16)

```bash
./bin/gpt_gemm 1 1 20 52 128 17920 32000 1 1
```

上述参数对应为

```bash 
./bin/gpt_gemm <batch_size> <beam_width> <max_input_len> <head_number> <size_per_head> <inter_size> <vocab_size> <data_type> <tensor_para_size> 
```

2. 配置`../examples/cpp/llama/llama_config.ini`

data_type = 1时，data_type = fp16;data_type = 0时，data_type = fp32,tensor_para_size和模型转换设置的tp数保持一致，model_name=llama_7B，model_dir为对应的模型权重，request_batch_size为推理的batch_size数量，request_output_len为输出长度,`../examples/cpp/llama//start_ids.csv`可以修改输入的起始id.

3. 运行

```bash
./bin/llama_example
```
该程序会读取`../examples/cpp/llama//start_ids.csv`中的id作为输入tokens,生成的结果会保存在`.out`.


### 运行 LLama-13b

```bash
./bin/gpt_gemm 1 1 20 40 128 13824 32000 1 1
./bin/llama_example
```

### 运行 LLama-33b

```bash
./bin/gpt_gemm 1 1 20 52 128 17920 32000 1 2
mpirun --allow-run-as-root -np 2 ./bin/llama_example
```

### 运行 LLama-65b

```bash
./bin/gpt_gemm 1 1 20 64 128 22016 32000 1 8
mpirun --allow-run-as-root -np 8 ./bin/llama_example 
```

### 参数配置说明

llama-33b模型，使用fp16推理需要2张卡（32G）,llama-65b模型，使用fp16推理需要8张卡（32G）.
从huggingface下载llama模型，可以查看config.json文件，如下左边为fastertransformer参数，后边对应config.son文件中的参数值.

```bash
head_num=num_attention_heads
size_per_head=hidden_size / num_attention_heads
inter_size=intermediate_size
num_layer=num_hidden_layers
rotary_embedding=size_per_head
layernorm_eps=rms_norm_eps
vocab_size=vocab_size
```

## 源码仓库及问题反馈
* https://developer.hpccube.com/codes/modelzoo/llama_ft

## 参考
* [https://github.com/NVIDIA/FasterTransformer](https://github.com/NVIDIA/FasterTransformer)