README.md 3.66 KB
Newer Older
wxj's avatar
wxj committed
1
NVIDIA NeMo 是基于 PyTorch 和 PyTorch Lightning 的一个开源训练框架,源代码完全公开在 GitHub 上。NeMo 的主要目标是使 AI 开发者能够快速构建对话式 AI 模型并开发相关应用。
wxj's avatar
wxj committed
2

wxj's avatar
wxj committed
3
目前支持GPT类模型的预训练和微调(SFT, lora等)
wxj's avatar
wxj committed
4

wxj's avatar
wxj committed
5
6
7
8
9
# 1.docker设置

最新可用镜像: torch2.4.1-py3.10-dtk25.04-beta-das-alpha(该镜像id是ce83b4a462d9, 自带transformer_engine1.8, 无需额外安装)

git下载该项目: `git clone http://developer.sourcefind.cn/codes/sugon_wxj/nemo.git`
wxj's avatar
wxj committed
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

启动容器: 
```bash
docker run -it \
    --shm-size=32G \
    --device=/dev/kfd \
    --device=/dev/mkfd \
    --device=/dev/dri \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --ulimit memlock=-1:-1 \
    --ipc=host \
    --network=host \
    --group-add video \
    --privileged \
    --name nemo_dtk25.4 \
    -v /opt/hyhal:/opt/hyhal \
    -v /path/to/data/:/data \
    -v /path/to/workspace/:/workspace \
    ce83b4a462d9 \
    /bin/bash
```

安装依赖
```bash
wxj's avatar
wxj committed
35
36
cd /workspace/nemo
# 安装依赖和nemo
wxj's avatar
wxj committed
37
cd nemo_dtk25-2.0.0.rc0.beta
wxj's avatar
wxj committed
38
39
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple 
pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple 
wxj's avatar
wxj committed
40

wxj's avatar
wxj committed
41
# 安装megatronlm-core
wxj's avatar
wxj committed
42
cd .. && cd Megatron-LM-core_r0.7.0.beta
wxj's avatar
wxj committed
43
44
45
46
47
48
 pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple 
```

# 2.下载模型权重并转换

`魔塔`或者`hugging face`下载一个`llama2-7b-hf`的模型权重, 然后用NeMo提供的模型转换方法进行模型转换
wxj's avatar
wxj committed
49

wxj's avatar
wxj committed
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
```bash
python ./NeMo-2.0.0.rc0.beta/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py 
    --input_name_or_path=./llama2-7b-hf/ 
    --output_path=./llama2-7b.nemo
```

# 3.下载数据集并处理

`魔塔`或者`hugging face`下载一个`databricks-dolly-15k`的数据集, 然后用NeMo提供的模型转换方法进行数据集处理

数据集处理脚本: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py
该脚本就是将格式从`{'context': ''}`转为`{'input': '', 'output': ''}`
```bash
python ./NeMo-2.0.0.rc0.beta/scripts/dataset_processing/nlp/dolly_dataprep/preprocess.py \
    --input databricks-dolly-15k/databricks-dolly-15k.jsonl
wxj's avatar
wxj committed
65
66
```

wxj's avatar
wxj committed
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
输出文件的第一行示例可能为:
```bash
head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}
```

然后使用数据集划分脚本划分数据集(按80:15:5的比例):
```bash
python ./NeMo-2.0.0.rc0.beta/scripts/dataset_processing/nlp/dolly_dataprep/dolly_dataspilt.py \
    --input ./databricks-dolly-15k/
```

最后共有5个json文件
```bash
# ls /data/nemo_dataset/databricks-dolly-15k
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
training.jsonl
validation.jsonl
test.jsonl
```

## 4. 运行SFT微调脚本

修改K100AI_finetune.sh脚本中的MODEL, TRAIN_DS, VALID_DS, TEST_DS等变量为实际目录

wxj's avatar
wxj committed
93
94
95
执行微调脚本:
单机八卡: `bash K100AI_finetune.sh >& K100AI_finetune.log`

wxj's avatar
wxj committed
96