Add llama-factory for llama3 scripts and update README

ab643c4f · Rayyyyy · 5985275a · ab643c4f · ab643c4f · ab643c4f
Commit ab643c4f authored Jul 31, 2024 by Rayyyyy
18 changed files
--- a/README.md
+++ b/README.md
@@ -14,7 +14,6 @@ Llama-3中选择了一个相对标准的decoder-only的transformer架构。与Ll
    <img src="./doc/method.png"/>
 </div>

-
 ## 环境配置
 -v 路径、docker_name和imageID根据实际情况修改
 **注意**：bitsandbytes库功能不全，暂不支持量化相关
@@ -45,6 +44,7 @@ DTK驱动: dtk24.04
 python: python3.10
 torch: 2.1.0
 xtuner: 0.1.18
+llama-factory: 0.6.3
 ```
 `Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`

@@ -93,19 +93,17 @@ or
 NPROC_PER_NODE=${DCU_NUM} xtuner train ./llama3_8b_instruct_qlora_alpaca_e3_M.py --deepspeed deepspeed_zero2 --work-dir /path/of/saves
 ```

-### Llama Factory 微调方法
+### Llama Factory 微调方法(推荐)

-1. 训练库安装（**非llama3_pytorch目录下**），安装版本为**v0.6.3**
+1. 训练库安装（**非llama3_pytorch目录下**），安装版本为**v0.6.3**，`Llama-Factory`具体安装方法请参考仓库的README。
 ```
 git clone -b v0.6.3 http://developer.hpccube.com/codes/OpenDAS/llama-factory.git
 ```
-具体安装方法请参考Llama-Factory仓库的README。

 2. 通过[预训练权重](#预训练权重)下载预训练模型，当前用例使用[Meta-Llama-3-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct)模型；

-3. 选择`single_node.sh`启动的，需要确认`single_config.yaml`文件中`num_processes`参数与设置的显卡数量一致。
+3. `llama3/3.1`训练脚本可参考[这里](./llama-factory/examples/)，特别地，选择`single_node.sh`启动的，需要确认`single_config.yaml`文件中`num_processes`参数与设置的显卡数量一致。

-4. 使用**deepspeed**进行多机多卡训练，需先安装**pdsh**(若已安装可忽略)，保证服务器之间**通讯免密**。

 #### 全参微调

@@ -129,8 +127,8 @@ cd /your_code_path/llama_factory/examples/lora_multi_gpu
 参数解释同[#全参微调](#全参微调)

 ## 推理
-预训练模型下载
-请参考下面的[预训练权重](#预训练权重)章节，不同的模型需要不同的模型并行(MP)值，如下表所示:
+
+请参考下面的[预训练权重](#预训练权重)章节下载预训练模型，不同的模型需要不同的模型并行(MP)值，如下表所示:

 |  Model | MP |
 |--------|----|
@@ -144,6 +142,7 @@ cd /your_code_path/llama_factory/examples/lora_multi_gpu
 - `max_seq_len`和`max_batch_size`参数按需设置。

 ### Pretrained模型
+
 这些模型都没有针对聊天或者Q&A进行微调。可以参考`example_text_completion.py`里的用例。

 - Meta-Llama-3-8B 模型示例，Meta-Llama-3-70B模型仅需替换–-nproc_per_node、--ckpt_dir、--tokenizer_path对应模型地址即可。
@@ -155,6 +154,7 @@ torchrun --nproc_per_node 1 example_text_completion.py \
 ```

 ### Instruction-tuned模型
+
 经过微调的模型被训练用于对话应用程序。为了获得模型的预期特性和性能，需要遵循 [`ChatFormat`](llama/tokenizer.py#L202)中定义的特定格式:
 - 提示以特殊令牌`<|begin_of_text|>`开始，之后跟随一个或多个消息。
 - 每条消息以标签`<|start_header_id|>`开始，角色为`system`、`user`或者`assistant`、并以标签`<|end_header_id|>`结束。
@@ -171,6 +171,35 @@ torchrun --nproc_per_node 1 example_chat_completion.py \
    --max_seq_len 512 --max_batch_size 6
 ```

+### .safetensors格式的模型推理方法
+
+```python
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# 模型文件地址
+model_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
+prompt = '你好'
+
+input_query = {"role": "user", "content": prompt}
+
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, torch_dtype="auto", device_map="auto")
+
+input_ids = tokenizer.apply_chat_template(
+    [input_query,], add_generation_prompt=True, return_tensors="pt").to(model.device)
+
+outputs = model.generate(
+    input_ids,
+    max_new_tokens=1024,
+)
+
+response = outputs[0][input_ids.shape[-1]:]
+generated_text = tokenizer.decode(response, skip_special_tokens=True)
+print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
 ### 多轮对话
 1. 确认环境安装及模型下载完毕；
 2. 修改[chat.sh](./chat.sh)文件中的`--ckpt_dir`、`--tokenizer_path`参数为本地模型地址，`--max_seq_len`根据自身需求进行修改，调整该值可以增加多轮对话模型的记忆长度，不过需要注意的是这可能会增加模型运算的时间和内存需求；
@@ -233,6 +262,10 @@ python eval.py --model hf --model_args pretrained=/home/llama3/Meta-Llama-3-8B-I
 - [Meta-Llama-3-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct)
 - [Meta-Llama-3-70B](http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B)
 - [Meta-Llama-3-70B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B-Instruct)
+- [Meta-Llama-3.1-8B](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-8B)
+- [Meta-Llama-3.1-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3.1-8B-Instruct)
+- [Meta-Llama-3.1-70B](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-70B)
+- [Meta-Llama-3.1-70B-Instruct](http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-70B-Instruct)

 模型目录结构如下：
 ```bash
@@ -328,3 +361,4 @@ python eval.py --model hf --model_args pretrained=/home/llama3/Meta-Llama-3-8B-I
 - https://github.com/meta-llama/llama3
 - https://github.com/InternLM/xtuner
 - https://github.com/meta-llama/llama-recipes
+- https://github.com/hiyouga/LLaMA-Factory/tree/v0.6.3
--- a/llama-factory/examples/full_multi_gpu/70B/.deepspeed_env
+++ b/llama-factory/examples/full_multi_gpu/70B/.deepspeed_env
+PYTHON_VERSION=3.10
+NCCL_SOCKET_IFNAME=ens38f0
+NCCL_IB_DISABLE=1
+HSA_FORCE_FINE_GRAIN_PCIE=1
+MIOPEN_COMPILE_PARALLEL_LEVEL=1
+NCCL_PATH=/opt/dtk/rccl
+NCCL_DEBUG=DEBUG
--- a/llama-factory/examples/full_multi_gpu/70B/hostfile
+++ b/llama-factory/examples/full_multi_gpu/70B/hostfile
+10.5.32.245 slots=8
+10.5.32.246 slots=8
\ No newline at end of file
--- a/llama-factory/examples/full_multi_gpu/70B/multi_node_deepspeed.sh
+++ b/llama-factory/examples/full_multi_gpu/70B/multi_node_deepspeed.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+MASTER_ADDR=''
+# 多机多卡+deepspeed
+deepspeed --hostfile=./hostfile \
+    --num_nodes 2 \
+    --master_addr $MASTER_ADDR \
+    --master_port 12345 \
+    ../../src/train_bash.py \
+    --deepspeed ../deepspeed/ds_z3_config.json \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type full \
+    --output_dir saves/LLaMA3-70B-Instruct/full/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 500 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/full_multi_gpu/8B/master_config.yaml
+++ b/llama-factory/examples/full_multi_gpu/8B/master_config.yaml
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+main_process_ip: 10.5.32.245
+main_process_port: 12345
+main_training_function: main
+mixed_precision: fp16
+num_machines: 2 # the number of nodes
+num_processes: 16 # the number of GPUs in all nodes
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/llama-factory/examples/full_multi_gpu/8B/multi_node_0.sh
+++ b/llama-factory/examples/full_multi_gpu/8B/multi_node_0.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+# 多机多卡 0
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
+    --config_file ./master_config.yaml \
+    --machine_rank 0 \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type full \
+    --output_dir saves/LLaMA3-8B/full/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/full_multi_gpu/8B/multi_node_1.sh
+++ b/llama-factory/examples/full_multi_gpu/8B/multi_node_1.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+# 多机多卡 1
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
+    --config_file ./master_config.yaml \
+    --machine_rank 1 \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type full \
+    --output_dir saves/LLaMA3-8B/full/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/full_multi_gpu/8B/single_node.sh
+++ b/llama-factory/examples/full_multi_gpu/8B/single_node.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+# 单机多卡
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
+    --config_file ../accelerate/single_config.yaml \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type full \
+    --output_dir saves/LLaMA3-8B-Instruct/full/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 1000 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --learning_rate 5e-5 \
+    --num_train_epochs 1.0 \
+    --max_samples 1000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/full_multi_gpu/8B/single_node_deepspeed.sh
+++ b/llama-factory/examples/full_multi_gpu/8B/single_node_deepspeed.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+# 单机多卡 + deepspeed
+deepspeed --num_gpus 4 ../../src/train_bash.py \
+    --deepspeed ../deepspeed/ds_z3_config.json \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type full \
+    --output_dir saves/LLaMA3-8B-Instruct/full/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 1000 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --learning_rate 5e-5 \
+    --num_train_epochs 1.0 \
+    --max_samples 1000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/lora_multi_gpu/70B/.deepspeed_env
+++ b/llama-factory/examples/lora_multi_gpu/70B/.deepspeed_env
+NCCL_SOCKET_IFNAME=ens38f0
+NCCL_IB_DISABLE=1
+HSA_FORCE_FINE_GRAIN_PCIE=1
+MIOPEN_COMPILE_PARALLEL_LEVEL=1
+NCCL_PATH=/opt/dtk/rccl
+NCCL_DEBUG=DEBUG
--- a/llama-factory/examples/lora_multi_gpu/70B/hostfile
+++ b/llama-factory/examples/lora_multi_gpu/70B/hostfile
+10.5.32.245 slots=8
+10.5.32.246 slots=8
\ No newline at end of file
--- a/llama-factory/examples/lora_multi_gpu/70B/master_config.yaml
+++ b/llama-factory/examples/lora_multi_gpu/70B/master_config.yaml
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+# machine_rank: 0
+main_process_ip: 10.5.32.245
+main_process_port: 12345
+main_training_function: main
+mixed_precision: fp16
+num_machines: 2 # the number of nodes
+num_processes: 16 # the number of GPUs in all nodes
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
\ No newline at end of file
--- a/llama-factory/examples/lora_multi_gpu/70B/multi_node_0.sh
+++ b/llama-factory/examples/lora_multi_gpu/70B/multi_node_0.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+export NCCL_IB_DISABLE=1
+export NCCL_SOCKET_IFNAME=ens38f0
+
+# LoRA + 多机多卡 0
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
+    --config_file ./master_config.yaml \
+    --machine_rank 0 \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir saves/LLaMA3-70B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 1000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/lora_multi_gpu/70B/multi_node_1.sh
+++ b/llama-factory/examples/lora_multi_gpu/70B/multi_node_1.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+export NCCL_IB_DISABLE=1
+export NCCL_SOCKET_IFNAME=ens38f0
+
+# LoRA + 多机多卡 1
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
+    --config_file ./master_config.yaml \
+    --machine_rank 1 \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir saves/LLaMA3-70B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 1000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/lora_multi_gpu/70B/multi_node_deepspeed.sh
+++ b/llama-factory/examples/lora_multi_gpu/70B/multi_node_deepspeed.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+#--gradient_accumulation_steps Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
+#--preprocessing_num_workers The number of processes to use for the pre-processing. (default: None)
+MASTER_ADDR=''
+
+# LoRA + 多机多卡 + deepspeed
+deepspeed --hostfile=./hostfile \
+    --num_nodes 2 \
+    --master_addr $MASTER_ADDR \
+    --master_port 12345 \
+    ../../src/train_bash.py \
+    --deepspeed ../deepspeed/ds_z3_config.json \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
+    --dataset alpaca_gpt4_zh,alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir saves/LLaMA3-70B-Instruct/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --warmup_ratio 0.1 \
+    --save_steps 500 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/lora_multi_gpu/70B/single_node.sh
+++ b/llama-factory/examples/lora_multi_gpu/70B/single_node.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+# # LoRA + 单机多卡
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
+    --config_file ../accelerate/single_config.yaml \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir saves/LLaMA3-70B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 1000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/lora_multi_gpu/70B/single_node_deepspeed.sh
+++ b/llama-factory/examples/lora_multi_gpu/70B/single_node_deepspeed.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+#--gradient_accumulation_steps: Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
+#--preprocessing_num_workers: The number of processes to use for the pre-processing. (default: None)
+MASTER_ADDR=''
+# LoRA + 单机多卡 + deepspeed
+deepspeed --master_addr $MASTER_ADDR \
+    --master_port 12345 \
+    ../../src/train_bash.py \
+    --deepspeed ../deepspeed/ds_z3_config.json \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
+    --dataset alpaca_gpt4_zh,alpaca_zh \
+    --dataset_dir ../../data \
+    --template llama3 \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir saves/LLaMA3-70B-Instruct/lora/sft-single-node \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --warmup_ratio 0.1 \
+    --save_steps 500 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/llama-factory/examples/lora_multi_gpu/8B/single_node.sh
+++ b/llama-factory/examples/lora_multi_gpu/8B/single_node.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+
+# LoRA+单机多卡
+CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
+    --config_file ../accelerate/single_config.yaml \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
+    --dataset alpaca_zh \
+    --dataset_dir ../..data \
+    --template llama3 \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir saves/LLaMA3-8B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 8192 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 1000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16