Commit 917e35e3 authored by Rayyyyy's avatar Rayyyyy
Browse files

add 70B and xtuner finetune.

parent 225f11a9
...@@ -3,7 +3,11 @@ ...@@ -3,7 +3,11 @@
[llama3](https://llama.meta.com/llama3/) [llama3](https://llama.meta.com/llama3/)
## 模型结构 ## 模型结构
Llama-3中选择了一个相对标准的decoder-only的transformer架构。与Llama-2相比,我们做了几个关键的改进。Llama 3使用了一个带有128K个标记的标记器,可以更有效地对语言进行编码,从而大大提高了模型的性能。为了提高Llama 3模型的推理效率,我们在8B和70B两个尺寸上都采用了分组查询关注(GQA)。我们在8,192个标记的序列上训练模型,使用掩码来确保self-attention不会跨越文档边界。 Llama-3中选择了一个相对标准的decoder-only的transformer架构。与Llama-2相比,做了几个关键的改进:
- 基于超过15T token训练数据,大小相当于Llama 2数据集的7倍还多,增强了推理、代码生成和指令跟随等方面的能力;
- 支持8K长文本(之前是4k),改进的tokenizer具有128K tokens的词汇量,可以更有效地对语言进行编码,从而大大提高了模型的性能;
- 采用分组查询注意力(grouped query attention,GQA)、掩码等技术,帮助开发者以最低的能耗获取绝佳的性能。
- 在8,192个tokens的序列上训练模型,使用掩码来确保self-attention不会跨越文档边界。
## 算法原理 ## 算法原理
...@@ -22,6 +26,8 @@ docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/op ...@@ -22,6 +26,8 @@ docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/op
cd /your_code_path/llama3_pytorch cd /your_code_path/llama3_pytorch
pip install -e . pip install -e .
pip install -U xtuner
pip install bitsandbytes-0.43.0-py3-none-any.whl
``` ```
### Dockerfile(方法二) ### Dockerfile(方法二)
...@@ -33,27 +39,38 @@ docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/op ...@@ -33,27 +39,38 @@ docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/op
cd /your_code_path/llama3_pytorch cd /your_code_path/llama3_pytorch
pip install -e . pip install -e .
pip install -U xtuner
pip install bitsandbytes-0.43.0-py3-none-any.whl
``` ```
### Anaconda(方法三) ### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。 关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
```bash ```bash
DTK驱动dtk23.10.1 DTK驱动: dtk23.10.1
pythonpython3.8 python: python3.8
torch2.1.0 torch: 2.1.0
``` ```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应` `Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
其它非深度学习库安装方式如下: 其它非深度学习库安装方式如下:
```bash ```bash
pip install -e . pip install -e .
pip install -U xtuner
pip install bitsandbytes-0.43.0-py3-none-any.whl
``` ```
## 数据集 ## 数据集
官方暂无 官方暂无
## 训练 ## 训练
暂无 ### xtuner微调方法
1. 修改[llama3_8b_instruct_qlora_alpaca_e3_M.py](./llama3_8b_instruct_qlora_alpaca_e3_M.py)代码中的`pretrained_model_name_or_path``data_path`为本地对应数据地址;
2. 根据硬件环境和自身训练需求来调整 `max_length``batch_size``accumulative_counts``max_epochs``lr``save_steps``evaluation_freq`、model.lora中的`r``lora_alpha`参数;
3. ${DCU_NUM}参数修改为要使用的DCU卡数量;
4. 执行
```bash
NPROC_PER_NODE=${DCU_NUM} xtuner train ./llama3_8b_instruct_qlora_alpaca_e3_M.py --deepspeed deepspeed_zero2
```
## 推理 ## 推理
预训练模型下载方法请参考下面的[预训练权重](#预训练权重)章节,不同的模型需要不同的模型并行(MP)值,如下表所示: 预训练模型下载方法请参考下面的[预训练权重](#预训练权重)章节,不同的模型需要不同的模型并行(MP)值,如下表所示:
...@@ -61,6 +78,7 @@ pip install -e . ...@@ -61,6 +78,7 @@ pip install -e .
| Model | MP | | Model | MP |
|--------|----| |--------|----|
| 8B | 1 | | 8B | 1 |
| 70B | 8 |
所有模型都支持序列长度高达8192个tokens,但我们根据max_seq_len和max_batch_size值预先分配缓存。根据你的硬件设置。 所有模型都支持序列长度高达8192个tokens,但我们根据max_seq_len和max_batch_size值预先分配缓存。根据你的硬件设置。
...@@ -69,9 +87,9 @@ pip install -e . ...@@ -69,9 +87,9 @@ pip install -e .
- `max_seq_len``max_batch_size`参数按需设置。 - `max_seq_len``max_batch_size`参数按需设置。
### Pretrained模型 ### Pretrained模型
这些模型都没有针对聊天或者Q&A进行微调。可以参考 `example_text_completion.py` 里的用例。 这些模型都没有针对聊天或者Q&A进行微调。可以参考`example_text_completion.py`里的用例。
- Meta-Llama-3-8B 模型示例 - Meta-Llama-3-8B 模型示例,Meta-Llama-3-70B模型仅需替换--ckpt_dir、--tokenizer_path对应模型地址即可。
```bash ```bash
torchrun --nproc_per_node 1 example_text_completion.py \ torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir Meta-Llama-3-8B/original/ \ --ckpt_dir Meta-Llama-3-8B/original/ \
...@@ -80,15 +98,15 @@ torchrun --nproc_per_node 1 example_text_completion.py \ ...@@ -80,15 +98,15 @@ torchrun --nproc_per_node 1 example_text_completion.py \
``` ```
### Instruction-tuned模型 ### Instruction-tuned模型
经过微调的模型被训练用于对话应用程序。为了获得预期的功能和性能,需要遵循 [`ChatFormat`](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202)中定义的特定格式: 经过微调的模型被训练用于对话应用程序。为了获得模型的预期特性和性能,需要遵循 [`ChatFormat`](llama/tokenizer.py#L202)中定义的特定格式:
- prompt以 `<|begin_of_text|>` 特殊token开始,之后是一条或多条message - 提示以特殊令牌 <|begin_of_text|> 开始,之后跟随一个或多个消息
- 每条message都以`<|start_header_id|>` 标签,`system``user`或者`assistant`角色、以及`<|end_header_id|>` 标签开头 - 每条消息以标签`<|start_header_id|>`开始,角色为`system``user`或者`assistant`、并以标签 `<|end_header_id|>` 结束
- 在双换行符`\n\n`之后是message的内容 - 在双换行符`\n\n`之后,消息的内容随之而来
- 每条message的结尾`<|eot_id|>`token标记。 - 每条消息的结尾`<|eot_id|>`令牌标记。
您还可以部署额外的分类器来过滤被认为不安全的输入和输出。有关如何向推理代码的输入和输出添加安全检查器,请参阅[llama-recipes repo](https://github.com/meta-llama/llama-recipes/blob/main/recipes/inference/local_inference/inference.py) 您还可以部署额外的分类器来过滤被认为不安全的输入和输出。有关如何向推理代码的输入和输出添加安全检查器,请参阅[llama-recipes repo](https://github.com/meta-llama/llama-recipes/blob/main/recipes/inference/local_inference/inference.py)
- Meta-Llama-3-8B-Instruct 模型示例 - Meta-Llama-3-8B-Instruct 模型示例,Meta-Llama-3-70B-Instruct模型仅需替换--ckpt_dir、--tokenizer_path对应模型地址即可。
```bash ```bash
torchrun --nproc_per_node 1 example_chat_completion.py \ torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir Meta-Llama-3-8B-Instruct/original/ \ --ckpt_dir Meta-Llama-3-8B-Instruct/original/ \
...@@ -139,17 +157,53 @@ mkdir Meta-Llama-3-8B-Instruct ...@@ -139,17 +157,53 @@ mkdir Meta-Llama-3-8B-Instruct
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir Meta-Llama-3-8B-Instruct --token hf_* huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir Meta-Llama-3-8B-Instruct --token hf_*
``` ```
- Meta-Llama-3-70B 模型
```bash
mkdir Meta-Llama-3-70B
huggingface-cli download meta-llama/Meta-Llama-3-70B --include "original/*" --local-dir Meta-Llama-3-70B --token hf_*
```
- Meta-Llama-3-70B-Instruct 模型
```bash
mkdir Meta-Llama-3-70B-Instruct
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct --include "original/*" --local-dir Meta-Llama-3-70B-Instruct --token hf_*
```
模型目录结构如下: 模型目录结构如下:
```bash ```bash
├── llama3_pytorch ├── llama3_pytorch
│ ├── Meta-Llama-3-8B │ ├── Meta-Llama-3-8B
── original ── original
│ ├── consolidated.00.pth │ ├── consolidated.00.pth
│ ├── params.json │ ├── params.json
│ └── tokenizer.model │ └── tokenizer.model
│ ├── Meta-Llama-3-8B-Instruct │ ├── Meta-Llama-3-8B-Instruct
│ ├── original │ └── original
│ ├── consolidated.00.pth
│ ├── params.json
│ └── tokenizer.model
│ ├── Meta-Llama-3-70B
│ └── original
│ ├── consolidated.00.pth
│ ├── consolidated.01.pth
│ ├── consolidated.02.pth
│ ├── consolidated.03.pth
│ ├── consolidated.04.pth
│ ├── consolidated.05.pth
│ ├── consolidated.06.pth
│ ├── consolidated.07.pth
│ ├── params.json
│ └── tokenizer.model
│ └── Meta-Llama-3-70B-Instruct
│ └── original
│ ├── consolidated.00.pth │ ├── consolidated.00.pth
│ ├── consolidated.01.pth
│ ├── consolidated.02.pth
│ ├── consolidated.03.pth
│ ├── consolidated.04.pth
│ ├── consolidated.05.pth
│ ├── consolidated.06.pth
│ ├── consolidated.07.pth
│ ├── params.json │ ├── params.json
│ └── tokenizer.model │ └── tokenizer.model
``` ```
...@@ -159,3 +213,5 @@ huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original ...@@ -159,3 +213,5 @@ huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original
## 参考资料 ## 参考资料
- https://github.com/meta-llama/llama3 - https://github.com/meta-llama/llama3
- https://github.com/InternLM/xtuner
- https://github.com/SmartFlowAI/EmoLLM
This diff is collapsed.
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/home/llama3/Meta-Llama-3-8B-Instruct'
use_varlen_attn = False # new
# Data
data_path = '/home/llama3/datasets/multi_turn_dataset_2.json'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 1e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
# SYSTEM = SYSTEM_TEMPLATE.alpaca
# evaluation_inputs = [
# '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
# ]
SYSTEM = "你由EmoLLM团队打造的中文领域心理健康助手, 是一个研究过无数具有心理健康问题的病人与心理健康医生对话的心理专家, 在心理方面拥有广博的知识储备和丰富的研究咨询经验,接下来你将只使用中文来回答和咨询问题。"
evaluation_inputs = [
'我最近总是感到很焦虑,尤其是在学业上。我有个特别崇拜的同学,他好像在各方面都比我优秀,我总觉得自己怎么努力也追不上他,这让我压力特别大。',
'我知道应该理性看待,但就是忍不住会去比较。我甚至晚上会因为这个睡不着觉,总想着怎样才能像他那样出色。',
'我今天心情不好,感觉不开心,很烦。'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
# quantization_config=dict(
# type=BitsAndBytesConfig,
# load_in_4bit=False,
# load_in_8bit=False,
# llm_int8_threshold=6.0,
# llm_int8_has_fp16_weight=False,
# bnb_4bit_compute_dtype=torch.float16,
# bnb_4bit_use_double_quant=False,
# bnb_4bit_quant_type='nf4')
),
lora=dict(
type=LoraConfig,
r=32,# 64
lora_alpha=64,#16
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
# dataset=dict(type=load_dataset, path=alpaca_en_path),
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
# dataset_map_fn=alpaca_map_fn,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
# sampler=dict(type=sampler, shuffle=True),
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs, # T_max=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
# train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, interval=10,
log_metric_by_epoch=False),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
# by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment