NVIDIA NeMo 是基于 PyTorch 和 PyTorch Lightning 的一个开源训练框架，源代码完全公开在 GitHub 上。NeMo 的主要目标是使 AI 开发者能够快速构建对话式 AI 模型并开发相关应用。

目前支持GPT类模型的预训练和微调(SFT, lora等)

# 1.docker设置

最新可用镜像: torch2.4.1-py3.10-dtk25.04-beta-das-alpha(该镜像id是ce83b4a462d9, 自带transformer_engine1.8, 无需额外安装)

git下载该项目: `git clone http://developer.sourcefind.cn/codes/sugon_wxj/nemo.git`

启动容器: 
```bash
docker run -it \
    --shm-size=32G \
    --device=/dev/kfd \
    --device=/dev/mkfd \
    --device=/dev/dri \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --ulimit memlock=-1:-1 \
    --ipc=host \
    --network=host \
    --group-add video \
    --privileged \
    --name nemo_dtk25.4 \
    -v /opt/hyhal:/opt/hyhal \
    -v /path/to/data/:/data \
    -v /path/to/workspace/:/workspace \
    ce83b4a462d9 \
    /bin/bash
```

安装依赖
```bash
cd /workspace/nemo
# 安装依赖和nemo
cd nemo_dtk25-2.0.0.rc0.beta
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple 
pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple 

# 安装megatronlm-core
cd .. && cd Megatron-LM-core_r0.7.0.beta
 pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple 
```

# 2.下载模型权重并转换

去`魔塔`或者`hugging face`下载一个`llama2-7b-hf`的模型权重, 然后用NeMo提供的模型转换方法进行模型转换

```bash
python ./NeMo-2.0.0.rc0.beta/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py 
    --input_name_or_path=./llama2-7b-hf/ 
    --output_path=./llama2-7b.nemo
```

# 3.下载数据集并处理

去`魔塔`或者`hugging face`下载一个`databricks-dolly-15k`的数据集, 然后用NeMo提供的模型转换方法进行数据集处理

数据集处理脚本: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py
该脚本就是将格式从`{'context': ''}`转为`{'input': '', 'output': ''}`
```bash
python ./NeMo-2.0.0.rc0.beta/scripts/dataset_processing/nlp/dolly_dataprep/preprocess.py \
    --input databricks-dolly-15k/databricks-dolly-15k.jsonl
```

输出文件的第一行示例可能为:
```bash
head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}
```

然后使用数据集划分脚本划分数据集(按80:15:5的比例):
```bash
python ./NeMo-2.0.0.rc0.beta/scripts/dataset_processing/nlp/dolly_dataprep/dolly_dataspilt.py \
    --input ./databricks-dolly-15k/
```

最后共有5个json文件
```bash
# ls /data/nemo_dataset/databricks-dolly-15k
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
training.jsonl
validation.jsonl
test.jsonl
```

## 4. 运行SFT微调脚本

修改K100AI_finetune.sh脚本中的MODEL, TRAIN_DS, VALID_DS, TEST_DS等变量为实际目录

执行微调脚本:
单机八卡: `bash K100AI_finetune.sh >& K100AI_finetune.log`


