Commit ab643c4f authored by Rayyyyy's avatar Rayyyyy
Browse files

Add llama-factory for llama3 scripts and update README

parent 5985275a
......@@ -14,7 +14,6 @@ Llama-3中选择了一个相对标准的decoder-only的transformer架构。与Ll
<img src="./doc/method.png"/>
</div>
## 环境配置
-v 路径、docker_name和imageID根据实际情况修改
**注意**:bitsandbytes库功能不全,暂不支持量化相关
......@@ -45,6 +44,7 @@ DTK驱动: dtk24.04
python: python3.10
torch: 2.1.0
xtuner: 0.1.18
llama-factory: 0.6.3
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
......@@ -93,19 +93,17 @@ or
NPROC_PER_NODE=${DCU_NUM} xtuner train ./llama3_8b_instruct_qlora_alpaca_e3_M.py --deepspeed deepspeed_zero2 --work-dir /path/of/saves
```
### Llama Factory 微调方法
### Llama Factory 微调方法(推荐)
1. 训练库安装(**非llama3_pytorch目录下**),安装版本为**v0.6.3**
1. 训练库安装(**非llama3_pytorch目录下**),安装版本为**v0.6.3**`Llama-Factory`具体安装方法请参考仓库的README。
```
git clone -b v0.6.3 http://developer.hpccube.com/codes/OpenDAS/llama-factory.git
```
具体安装方法请参考Llama-Factory仓库的README。
2. 通过[预训练权重](#预训练权重)下载预训练模型,当前用例使用[Meta-Llama-3-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct)模型;
3. 选择`single_node.sh`启动的,需要确认`single_config.yaml`文件中`num_processes`参数与设置的显卡数量一致。
3. `llama3/3.1`训练脚本可参考[这里](./llama-factory/examples/),特别地,选择`single_node.sh`启动的,需要确认`single_config.yaml`文件中`num_processes`参数与设置的显卡数量一致。
4. 使用**deepspeed**进行多机多卡训练,需先安装**pdsh**(若已安装可忽略),保证服务器之间**通讯免密**
#### 全参微调
......@@ -129,8 +127,8 @@ cd /your_code_path/llama_factory/examples/lora_multi_gpu
参数解释同[#全参微调](#全参微调)
## 推理
预训练模型下载
请参考下面的[预训练权重](#预训练权重)章节,不同的模型需要不同的模型并行(MP)值,如下表所示:
请参考下面的[预训练权重](#预训练权重)章节下载预训练模型,不同的模型需要不同的模型并行(MP)值,如下表所示:
| Model | MP |
|--------|----|
......@@ -144,6 +142,7 @@ cd /your_code_path/llama_factory/examples/lora_multi_gpu
- `max_seq_len``max_batch_size`参数按需设置。
### Pretrained模型
这些模型都没有针对聊天或者Q&A进行微调。可以参考`example_text_completion.py`里的用例。
- Meta-Llama-3-8B 模型示例,Meta-Llama-3-70B模型仅需替换–-nproc_per_node、--ckpt_dir、--tokenizer_path对应模型地址即可。
......@@ -155,6 +154,7 @@ torchrun --nproc_per_node 1 example_text_completion.py \
```
### Instruction-tuned模型
经过微调的模型被训练用于对话应用程序。为了获得模型的预期特性和性能,需要遵循 [`ChatFormat`](llama/tokenizer.py#L202)中定义的特定格式:
- 提示以特殊令牌`<|begin_of_text|>`开始,之后跟随一个或多个消息。
- 每条消息以标签`<|start_header_id|>`开始,角色为`system``user`或者`assistant`、并以标签`<|end_header_id|>`结束。
......@@ -171,6 +171,35 @@ torchrun --nproc_per_node 1 example_chat_completion.py \
--max_seq_len 512 --max_batch_size 6
```
### .safetensors格式的模型推理方法
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# 模型文件地址
model_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
prompt = '你好'
input_query = {"role": "user", "content": prompt}
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto")
input_ids = tokenizer.apply_chat_template(
[input_query,], add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=1024,
)
response = outputs[0][input_ids.shape[-1]:]
generated_text = tokenizer.decode(response, skip_special_tokens=True)
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
### 多轮对话
1. 确认环境安装及模型下载完毕;
2. 修改[chat.sh](./chat.sh)文件中的`--ckpt_dir``--tokenizer_path`参数为本地模型地址,`--max_seq_len`根据自身需求进行修改,调整该值可以增加多轮对话模型的记忆长度,不过需要注意的是这可能会增加模型运算的时间和内存需求;
......@@ -233,6 +262,10 @@ python eval.py --model hf --model_args pretrained=/home/llama3/Meta-Llama-3-8B-I
- [Meta-Llama-3-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct)
- [Meta-Llama-3-70B](http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B)
- [Meta-Llama-3-70B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B-Instruct)
- [Meta-Llama-3.1-8B](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-8B)
- [Meta-Llama-3.1-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3.1-8B-Instruct)
- [Meta-Llama-3.1-70B](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-70B)
- [Meta-Llama-3.1-70B-Instruct](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-70B-Instruct)
模型目录结构如下:
```bash
......@@ -328,3 +361,4 @@ python eval.py --model hf --model_args pretrained=/home/llama3/Meta-Llama-3-8B-I
- https://github.com/meta-llama/llama3
- https://github.com/InternLM/xtuner
- https://github.com/meta-llama/llama-recipes
- https://github.com/hiyouga/LLaMA-Factory/tree/v0.6.3
PYTHON_VERSION=3.10
NCCL_SOCKET_IFNAME=ens38f0
NCCL_IB_DISABLE=1
HSA_FORCE_FINE_GRAIN_PCIE=1
MIOPEN_COMPILE_PARALLEL_LEVEL=1
NCCL_PATH=/opt/dtk/rccl
NCCL_DEBUG=DEBUG
10.5.32.245 slots=8
10.5.32.246 slots=8
\ No newline at end of file
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
MASTER_ADDR=''
# 多机多卡+deepspeed
deepspeed --hostfile=./hostfile \
--num_nodes 2 \
--master_addr $MASTER_ADDR \
--master_port 12345 \
../../src/train_bash.py \
--deepspeed ../deepspeed/ds_z3_config.json \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type full \
--output_dir saves/LLaMA3-70B-Instruct/full/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 500 \
--eval_steps 100 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 3000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
main_process_ip: 10.5.32.245
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2 # the number of nodes
num_processes: 16 # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
# 多机多卡 0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ./master_config.yaml \
--machine_rank 0 \
../../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type full \
--output_dir saves/LLaMA3-8B/full/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 3000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
# 多机多卡 1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ./master_config.yaml \
--machine_rank 1 \
../../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type full \
--output_dir saves/LLaMA3-8B/full/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 3000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
# 单机多卡
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ../accelerate/single_config.yaml \
../../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type full \
--output_dir saves/LLaMA3-8B-Instruct/full/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 1000 \
--eval_steps 100 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--max_samples 1000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
# 单机多卡 + deepspeed
deepspeed --num_gpus 4 ../../src/train_bash.py \
--deepspeed ../deepspeed/ds_z3_config.json \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type full \
--output_dir saves/LLaMA3-8B-Instruct/full/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 1000 \
--eval_steps 100 \
--evaluation_strategy steps \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--max_samples 1000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
NCCL_SOCKET_IFNAME=ens38f0
NCCL_IB_DISABLE=1
HSA_FORCE_FINE_GRAIN_PCIE=1
MIOPEN_COMPILE_PARALLEL_LEVEL=1
NCCL_PATH=/opt/dtk/rccl
NCCL_DEBUG=DEBUG
10.5.32.245 slots=8
10.5.32.246 slots=8
\ No newline at end of file
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
# machine_rank: 0
main_process_ip: 10.5.32.245
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2 # the number of nodes
num_processes: 16 # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
\ No newline at end of file
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=ens38f0
# LoRA + 多机多卡 0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ./master_config.yaml \
--machine_rank 0 \
../../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir saves/LLaMA3-70B/lora/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 1000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=ens38f0
# LoRA + 多机多卡 1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ./master_config.yaml \
--machine_rank 1 \
../../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir saves/LLaMA3-70B/lora/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 1000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
#--gradient_accumulation_steps Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
#--preprocessing_num_workers The number of processes to use for the pre-processing. (default: None)
MASTER_ADDR=''
# LoRA + 多机多卡 + deepspeed
deepspeed --hostfile=./hostfile \
--num_nodes 2 \
--master_addr $MASTER_ADDR \
--master_port 12345 \
../../src/train_bash.py \
--deepspeed ../deepspeed/ds_z3_config.json \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--dataset alpaca_gpt4_zh,alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir saves/LLaMA3-70B-Instruct/lora/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--warmup_ratio 0.1 \
--save_steps 500 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 3000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
# # LoRA + 单机多卡
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ../accelerate/single_config.yaml \
../../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir saves/LLaMA3-70B/lora/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 1000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
#--gradient_accumulation_steps: Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
#--preprocessing_num_workers: The number of processes to use for the pre-processing. (default: None)
MASTER_ADDR=''
# LoRA + 单机多卡 + deepspeed
deepspeed --master_addr $MASTER_ADDR \
--master_port 12345 \
../../src/train_bash.py \
--deepspeed ../deepspeed/ds_z3_config.json \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
--dataset alpaca_gpt4_zh,alpaca_zh \
--dataset_dir ../../data \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir saves/LLaMA3-70B-Instruct/lora/sft-single-node \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--warmup_ratio 0.1 \
--save_steps 500 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 3000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
#!/bin/bash
export HSA_FORCE_FINE_GRAIN_PCIE=1
# LoRA+单机多卡
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
--config_file ../accelerate/single_config.yaml \
../../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--dataset alpaca_zh \
--dataset_dir ../..data \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir saves/LLaMA3-8B/lora/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 8192 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 1000 \
--val_size 0.1 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment