README.md

# GLM(General Language Model Pretraining with Autoregressive Blank Infilling )
## 模型介绍
2017 年, Google 提出了 Transformer 架构, 随后 BERT 、GPT、T5等预训练模型不断涌现, 并在各项任务中都不断刷新 SOTA 纪录。2022年, 清华提出了 GLM 模型(https://github.com/THUDM/GLM), 不同于上述预训练模型架构，它采用了一种自回归的空白填充方法, 在 NLP 领域三种主要的任务（自然语言理解、无条件生成、有条件生成）上都取得了不错的结果。

在LiBai中主要实现了GLM推理部分的工作。

## GLM-Inference
当模型规模过于庞大，单个 GPU 设备无法容纳大规模模型参数时，便捷好用的分布式训练和推理需求就相继出现，业内也随之推出相应的工具。

基于 OneFlow 构建的 LiBai 模型库让分布式上手难度降到最低，用户不需要关注模型如何分配在不同的显卡设备，只需要修改几个配置数据就可以设置不同的分布式策略。当然，加速性能更是出众。

用 LiBai 搭建的 GLM 可以便捷地实现model parallel + pipeline parallel推理, 很好地解决单卡放不下大规模模型的问题。

### 分布式推理具有天然优势

要知道，模型的参数其实就是许多 tensor，也就是以矩阵的形式出现，大模型的参数也就是大矩阵，并行策略就是把大矩阵分为多个小矩阵，并分配到不同的显卡或不同的设备上，基础的 LinearLayer 在LiBai中的实现代码如下：

```python
class Linear1D(nn.Module):
    def __init__(self, in_features, out_features, parallel="data", layer_idx=0, ...):
        super().__init__()

        if parallel == "col":
            weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.split(0)])
        elif parallel == "row":
            weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.split(1)])
        elif parallel == "data":
            weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
        else:
            raise KeyError(f"{parallel} is not supported! Only support ('data', 'row' and 'col')")

        self.weight = flow.nn.Parameter(
            flow.empty(
                (out_features, in_features),
                dtype=flow.float32,
                placement=dist.get_layer_placement(layer_idx),  # for pipeline parallelism placement
                sbp=weight_sbp,
            )
        )
        init_method(self.weight)
        ...
    
    def forward(self, x):
        ...
```

在这里，用户可选择去如何切分 Linear 层的矩阵，如何切分数据矩阵，而OneFlow 中的 SBP 控制竖着切、横着切以及其他拆分矩阵的方案（模型并行、数据并行），以及通过设置 Placement 来控制这个 LinearLayer 是放在第几张显卡上（流水并行）。

所以，根据 LiBai 中各种 layer 的设计原理以及基于 OneFlow 中 tensor 自带的 SBP 和 Placement 属性的天然优势，使得用户搭建的模型能够很简单地就实现数据并行、模型并行以及流水并行操作。

## GLM-10B-chinese推理
### 环境配置
提供[光源](https://www.sourcefind.cn/#/service-details)拉取的训练以及推理的docker镜像：image.sourcefind.cn:5000/dcu/admin/base/oneflow:0.9.1-centos7.6-dtk-22.10.1-py39-latest

    cd libai
    pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
    pip3 install pybind11 -i https://mirrors.aliyun.com/pypi/simple
    pip3 install -e . -i https://mirrors.aliyun.com/pypi/simple
需要先准备好模型权重：https://huggingface.co/THUDM/glm-10b-chinese/tree/main

### Glm-10b-chinese的文件结构

```python
$ tree data
path/to/glm-10b-chinese
├── added_tokens.json
├── cog-pretrain.model
├── config.json
└── pytorch_model.bin
```

### 推理

采用1节点，4张DCU-Z100-16G，采用tp=2，pp=2的并行配置。

运行以下代码：

    cd projects/GLM
    # 运行前修改 configs/glm_inference.py 中 `pad_token_id=50000, eos_token_id=50007, bos_token_id=None`
    python3 -m oneflow.distributed.launch --nproc_per_node 4 demo.py

demo.py如下：

    # model parallel + pipeline parallel demo
    
    import oneflow as flow
    from projects.GLM.tokenizer.glm_tokenizer import GLMChineseTokenzier
    from libai.utils import distributed as dist
    from projects.GLM.configs.glm_inference import cfg
    from projects.GLM.modeling_glm import GLMForConditionalGeneration
    from projects.GLM.utils.glm_loader import GLMLoaderHuggerFace
    from omegaconf import DictConfig
    import time
    
    # 只需简单配置并行方案
    parallel_config = DictConfig(
        dict(
            data_parallel_size=1,
            tensor_parallel_size=2,
            pipeline_parallel_size=2,
            pipeline_num_layers=2 * 24
        )
    )
    dist.setup_dist_util(parallel_config)
    
    tokenizer = GLMChineseTokenzier.from_pretrained("glm-10b-chinese")
    input_ids = tokenizer.encode(
        [
            "冬天，中国哪座城市最适合避寒？问题描述：能推荐一些国内适合冬天避寒的城市吗？回答用户：旅游爱好者 回答： [gMASK]"
        ],
        return_tensors="of",
    )
    inputs = {"input_ids": input_ids, "attention_mask": flow.ones(input_ids.size())}
    inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=128)
    
    sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
    placement = dist.get_layer_placement(0)
    
    loader = GLMLoaderHuggerFace(
        GLMForConditionalGeneration, 
        cfg, 
        "glm-10b-chinese",
        embedding_dropout_prob=0,
        attention_dropout_prob=0,
        output_dropout_prob=0,
    )
    
    T1 = time.time()
    model = loader.load()
    T2 = time.time()
    if dist.is_main_process():
        print('模型加载时间:%s秒' % (T2 - T1))
    
    T3 = time.time()
    outputs = model.generate(
        inputs=inputs['input_ids'].to_global(sbp=sbp, placement=placement), 
        position_ids=inputs['position_ids'].to_global(sbp=sbp, placement=placement), 
        generation_attention_mask=inputs['generation_attention_mask'].to_global(sbp=sbp, placement=placement), 
        max_length=128
    )
    T4 = time.time()
    if dist.is_main_process():
        print('model.generate: %s秒' % (T4 - T3))
    
    T5 = time.time()
    res = tokenizer.decode(outputs[0])
    T6 = time.time()
    if dist.is_main_process():
        print('tokenizer.decode: %s秒' % (T6 - T5))
    
    if dist.is_main_process():
        print(res)

输出：

```
>>>Total number of model parameters: 9,879,633,920
模型加载时间:59.47162699699402秒
model.generate: 72.28496813774109秒
tokenizer.decode: 0.0698804759979248秒
[CLS] 冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者 回答: [gMASK] <|endoftext|> <|startofpiece|> 避寒,当然是去海南呀!<n><n>海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾,没有雾霾!<n><n>海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾!<n><n>海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾!
```

## 性能和准确率数据

使用的加速卡：4张DCU-Z100-16G：

| bs | max_input_length | max_gen_length | model.generate耗时/s |
| :------: | :------: | :------: | :------: |
| 1 | 128 | 128 | 72.2 |
| 1 | 512 | 512 | 201.3 |
## 参考
* https://github.com/Oneflow-Inc/libai
* https://github.com/Oneflow-Inc/one-glm