README.md

# Open and Efficient Foundation Language Models(LLAMA)


## 模型介绍
LLaMA，这是一个基础语言模型的集合，参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。特别是，llama 13B在大多数基准测试中优于GPT-3 (175B)， LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。
## 模型结构
LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：

**预归一化**。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。

**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。

**旋转嵌入**。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。

以下是llama-13B的主要网络参数配置：

```
"hidden_act": "silu", 
"hidden_size": 5120, 
"intermediate_size": 13824, 
"initializer_range": 0.02, 
"max_sequence_length": 2048, 
"model_type": "llama", 
"num_attention_heads": 40, 
"num_hidden_layers": 40, 
"rms_norm_eps": 1e-06, 
"torch_dtype": "float16", 
"vocab_size": 32000
```

## 数据集
我们在Fastchat目录下集成了英文对话数据集供用户快速验证：

    ./FastChat-main/playground/data/alpaca-data-conversation.json


## LLAMA-13B微调（slurm）

### 环境配置

要求DCU集群Slurm环境正常。

依赖开发者社区torch1.10，deepspeed 0.6.3，apex0.1（可选）：https://developer.hpccube.com/tool/

推荐用户使用预编译好的python3.8包来快速建立python3虚拟环境：

    cp -r slurm/* ./
    根据当前系统更改env.sh中相关路径
    virtualenv -p /python_bin_path/python3 --system-site-packages venv_torch3.8
    source env.sh	#进入venv_torch3.8虚拟环境
    
    pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple	#更新pip
    cd FastChat-main
    pip3 install -e .
    cd ../transformers-main
    pip3 install -e .
    cd ..
    pip3 install  torch-1.10.0a0+git2040069.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
    pip3 install  deepspeed-0.6.3+1b2721a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
    pip3 install  apex-0.1+gitdb7007a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl（可选）

### 训练

该训练脚本需要8节点，每节点4张DCU-Z100-16G。

并行配置采用zero3，使用fp16精度微调，如果想使能apex adamw_apex_fused优化器，更改./FastChat-main/fastchat/train/train.py:55行优化器改成adamw_apex_fused。deepspeed config.json如下：

```
{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps":4,
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": false,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : true
  }
}
```

进入登陆节点，微调命令：

    source submit_job.sh
    tail -f log/xxx.out.log	#查看输出log
    tail -f log/xxx.err.log	#查看错误log


## LLAMA-13B微调（无slurm，使用mpi）

### 环境配置

2节点16卡Z00L裸金属节点，要求dtk22.10.1环境正常，mpirun文件夹下包含预编译好的openmpi库mpi4.tar.gz，可直接使用：

```
cp -r mpirun/* ./
根据当前系统更改env.sh中相关路径
cd FastChat-main
pip3 install -e .
cd ../transformers-main
pip3 install -e .
cd ..
pip3 install  torch-1.10.0a0+git2040069.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
pip3 install  deepspeed-0.6.3+1b2721a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl
pip3 install  apex-0.1+gitdb7007a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl（可选）
```

### 训练

该训练脚本需要2节点，每节点8张DCU-Z100L-32G。

并行配置采用zero3，使用fp16精度微调，如果想使能apex adamw_apex_fused优化器，更改./FastChat-main/fastchat/train/train.py:55行优化器改成adamw_apex_fused。deepspeed config.json如下：

```
{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps":16,
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": false,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : true
  }
}
```

进入节点1，根据环境修改hostfile，保证两节点文件路径一致，配置相同，修改mpi_job.sh中--mca btl_tcp_if_include enp97s0f1，enp97s0f1改为ip -a命令后对应节点ip的网卡名，numa可以根据当前节点拓扑更改绑定，微调命令：

```
source mpi_job.sh
```

### 模型精度

训练数据：[./FastChat-main/playground/data/alpaca-data-conversation.json](链接)

使用的GPGPU：16张DCU-Z100L-32G。

模型精度（max_sequence_length: 2048）：
| 卡数 | 分布式工具 | 收敛性 |
| :------: | :------: |:------: |
| 16 | deepspeed | total_loss: 0.62/150 steps |
## 源码仓库及问题反馈

- https://developer.hpccube.com/codes/modelzoo/llama_torch

## 参考

* https://huggingface.co/decapoda-research/llama-13b-hf
* https://github.com/lm-sys/FastChat