Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
llama3_pytorch
Commits
ab643c4f
Commit
ab643c4f
authored
Jul 31, 2024
by
Rayyyyy
Browse files
Add llama-factory for llama3 scripts and update README
parent
5985275a
Changes
18
Hide whitespace changes
Inline
Side-by-side
Showing
18 changed files
with
517 additions
and
8 deletions
+517
-8
README.md
README.md
+42
-8
llama-factory/examples/full_multi_gpu/70B/.deepspeed_env
llama-factory/examples/full_multi_gpu/70B/.deepspeed_env
+7
-0
llama-factory/examples/full_multi_gpu/70B/hostfile
llama-factory/examples/full_multi_gpu/70B/hostfile
+2
-0
llama-factory/examples/full_multi_gpu/70B/multi_node_deepspeed.sh
...ctory/examples/full_multi_gpu/70B/multi_node_deepspeed.sh
+39
-0
llama-factory/examples/full_multi_gpu/8B/master_config.yaml
llama-factory/examples/full_multi_gpu/8B/master_config.yaml
+17
-0
llama-factory/examples/full_multi_gpu/8B/multi_node_0.sh
llama-factory/examples/full_multi_gpu/8B/multi_node_0.sh
+36
-0
llama-factory/examples/full_multi_gpu/8B/multi_node_1.sh
llama-factory/examples/full_multi_gpu/8B/multi_node_1.sh
+36
-0
llama-factory/examples/full_multi_gpu/8B/single_node.sh
llama-factory/examples/full_multi_gpu/8B/single_node.sh
+35
-0
llama-factory/examples/full_multi_gpu/8B/single_node_deepspeed.sh
...ctory/examples/full_multi_gpu/8B/single_node_deepspeed.sh
+34
-0
llama-factory/examples/lora_multi_gpu/70B/.deepspeed_env
llama-factory/examples/lora_multi_gpu/70B/.deepspeed_env
+6
-0
llama-factory/examples/lora_multi_gpu/70B/hostfile
llama-factory/examples/lora_multi_gpu/70B/hostfile
+2
-0
llama-factory/examples/lora_multi_gpu/70B/master_config.yaml
llama-factory/examples/lora_multi_gpu/70B/master_config.yaml
+18
-0
llama-factory/examples/lora_multi_gpu/70B/multi_node_0.sh
llama-factory/examples/lora_multi_gpu/70B/multi_node_0.sh
+41
-0
llama-factory/examples/lora_multi_gpu/70B/multi_node_1.sh
llama-factory/examples/lora_multi_gpu/70B/multi_node_1.sh
+41
-0
llama-factory/examples/lora_multi_gpu/70B/multi_node_deepspeed.sh
...ctory/examples/lora_multi_gpu/70B/multi_node_deepspeed.sh
+45
-0
llama-factory/examples/lora_multi_gpu/70B/single_node.sh
llama-factory/examples/lora_multi_gpu/70B/single_node.sh
+37
-0
llama-factory/examples/lora_multi_gpu/70B/single_node_deepspeed.sh
...tory/examples/lora_multi_gpu/70B/single_node_deepspeed.sh
+42
-0
llama-factory/examples/lora_multi_gpu/8B/single_node.sh
llama-factory/examples/lora_multi_gpu/8B/single_node.sh
+37
-0
No files found.
README.md
View file @
ab643c4f
...
...
@@ -14,7 +14,6 @@ Llama-3中选择了一个相对标准的decoder-only的transformer架构。与Ll
<img
src=
"./doc/method.png"
/>
</div>
## 环境配置
-v 路径、docker_name和imageID根据实际情况修改
**注意**
:bitsandbytes库功能不全,暂不支持量化相关
...
...
@@ -45,6 +44,7 @@ DTK驱动: dtk24.04
python: python3.10
torch: 2.1.0
xtuner: 0.1.18
llama-factory: 0.6.3
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
...
...
@@ -93,19 +93,17 @@ or
NPROC_PER_NODE
=
${
DCU_NUM
}
xtuner train ./llama3_8b_instruct_qlora_alpaca_e3_M.py
--deepspeed
deepspeed_zero2
--work-dir
/path/of/saves
```
### Llama Factory 微调方法
### Llama Factory 微调方法
(推荐)
1.
训练库安装(
**非llama3_pytorch目录下**
),安装版本为
**v0.6.3**
1.
训练库安装(
**非llama3_pytorch目录下**
),安装版本为
**v0.6.3**
,
`Llama-Factory`
具体安装方法请参考仓库的README。
```
git clone -b v0.6.3 http://developer.hpccube.com/codes/OpenDAS/llama-factory.git
```
具体安装方法请参考Llama-Factory仓库的README。
2.
通过
[
预训练权重
](
#预训练权重
)
下载预训练模型,当前用例使用
[
Meta-Llama-3-8B-Instruct
](
http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct
)
模型;
3.
选择
`single_node.sh`
启动的,需要确认
`single_config.yaml`
文件中
`num_processes`
参数与设置的显卡数量一致。
3.
`llama3/3.1`
训练脚本可参考
[
这里
](
./llama-factory/examples/
)
,特别地,
选择
`single_node.sh`
启动的,需要确认
`single_config.yaml`
文件中
`num_processes`
参数与设置的显卡数量一致。
4.
使用
**deepspeed**
进行多机多卡训练,需先安装
**pdsh**
(若已安装可忽略),保证服务器之间
**通讯免密**
。
#### 全参微调
...
...
@@ -129,8 +127,8 @@ cd /your_code_path/llama_factory/examples/lora_multi_gpu
参数解释同
[
#全参微调
](
#全参微调
)
## 推理
预训练模型下载
请参考下面的
[
预训练权重
](
#预训练权重
)
章节,不同的模型需要不同的模型并行(MP)值,如下表所示:
请参考下面的
[
预训练权重
](
#预训练权重
)
章节
下载预训练模型
,不同的模型需要不同的模型并行(MP)值,如下表所示:
| Model | MP |
|--------|----|
...
...
@@ -144,6 +142,7 @@ cd /your_code_path/llama_factory/examples/lora_multi_gpu
-
`max_seq_len`
和
`max_batch_size`
参数按需设置。
### Pretrained模型
这些模型都没有针对聊天或者Q&A进行微调。可以参考
`example_text_completion.py`
里的用例。
-
Meta-Llama-3-8B 模型示例,Meta-Llama-3-70B模型仅需替换–-nproc_per_node、--ckpt_dir、--tokenizer_path对应模型地址即可。
...
...
@@ -155,6 +154,7 @@ torchrun --nproc_per_node 1 example_text_completion.py \
```
### Instruction-tuned模型
经过微调的模型被训练用于对话应用程序。为了获得模型的预期特性和性能,需要遵循
[
`ChatFormat`
](
llama/tokenizer.py#L202
)
中定义的特定格式:
-
提示以特殊令牌
`<|begin_of_text|>`
开始,之后跟随一个或多个消息。
-
每条消息以标签
`<|start_header_id|>`
开始,角色为
`system`
、
`user`
或者
`assistant`
、并以标签
`<|end_header_id|>`
结束。
...
...
@@ -171,6 +171,35 @@ torchrun --nproc_per_node 1 example_chat_completion.py \
--max_seq_len
512
--max_batch_size
6
```
### .safetensors格式的模型推理方法
```
python
from
transformers
import
AutoModelForCausalLM
,
AutoTokenizer
# 模型文件地址
model_path
=
'meta-llama/Meta-Llama-3-8B-Instruct'
prompt
=
'你好'
input_query
=
{
"role"
:
"user"
,
"content"
:
prompt
}
tokenizer
=
AutoTokenizer
.
from_pretrained
(
model_path
)
model
=
AutoModelForCausalLM
.
from_pretrained
(
model_path
,
torch_dtype
=
"auto"
,
device_map
=
"auto"
)
input_ids
=
tokenizer
.
apply_chat_template
(
[
input_query
,],
add_generation_prompt
=
True
,
return_tensors
=
"pt"
).
to
(
model
.
device
)
outputs
=
model
.
generate
(
input_ids
,
max_new_tokens
=
1024
,
)
response
=
outputs
[
0
][
input_ids
.
shape
[
-
1
]:]
generated_text
=
tokenizer
.
decode
(
response
,
skip_special_tokens
=
True
)
print
(
f
"Prompt:
{
prompt
!
r
}
, Generated text:
{
generated_text
!
r
}
"
)
```
### 多轮对话
1.
确认环境安装及模型下载完毕;
2.
修改
[
chat.sh
](
./chat.sh
)
文件中的
`--ckpt_dir`
、
`--tokenizer_path`
参数为本地模型地址,
`--max_seq_len`
根据自身需求进行修改,调整该值可以增加多轮对话模型的记忆长度,不过需要注意的是这可能会增加模型运算的时间和内存需求;
...
...
@@ -233,6 +262,10 @@ python eval.py --model hf --model_args pretrained=/home/llama3/Meta-Llama-3-8B-I
-
[
Meta-Llama-3-8B-Instruct
](
http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct
)
-
[
Meta-Llama-3-70B
](
http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B
)
-
[
Meta-Llama-3-70B-Instruct
](
http://113.200.138.88:18080/aimodels/Meta-Llama-3-70B-Instruct
)
-
[
Meta-Llama-3.1-8B
](
http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-8B
)
-
[
Meta-Llama-3.1-8B-Instruct
](
http://113.200.138.88:18080/aimodels/Meta-Llama-3.1-8B-Instruct
)
-
[
Meta-Llama-3.1-70B
](
http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-70B
)
-
[
Meta-Llama-3.1-70B-Instruct
](
http://113.200.138.88:18080/aimodels/meta-llama/Meta-Llama-3.1-70B-Instruct
)
模型目录结构如下:
```
bash
...
...
@@ -328,3 +361,4 @@ python eval.py --model hf --model_args pretrained=/home/llama3/Meta-Llama-3-8B-I
-
https://github.com/meta-llama/llama3
-
https://github.com/InternLM/xtuner
-
https://github.com/meta-llama/llama-recipes
-
https://github.com/hiyouga/LLaMA-Factory/tree/v0.6.3
llama-factory/examples/full_multi_gpu/70B/.deepspeed_env
0 → 100644
View file @
ab643c4f
PYTHON_VERSION=3.10
NCCL_SOCKET_IFNAME=ens38f0
NCCL_IB_DISABLE=1
HSA_FORCE_FINE_GRAIN_PCIE=1
MIOPEN_COMPILE_PARALLEL_LEVEL=1
NCCL_PATH=/opt/dtk/rccl
NCCL_DEBUG=DEBUG
llama-factory/examples/full_multi_gpu/70B/hostfile
0 → 100644
View file @
ab643c4f
10.5.32.245 slots=8
10.5.32.246 slots=8
\ No newline at end of file
llama-factory/examples/full_multi_gpu/70B/multi_node_deepspeed.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
MASTER_ADDR
=
''
# 多机多卡+deepspeed
deepspeed
--hostfile
=
./hostfile
\
--num_nodes
2
\
--master_addr
$MASTER_ADDR
\
--master_port
12345
\
../../src/train_bash.py
\
--deepspeed
../deepspeed/ds_z3_config.json
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-70B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
full
\
--output_dir
saves/LLaMA3-70B-Instruct/full/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
1
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
1
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
500
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
3000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/full_multi_gpu/8B/master_config.yaml
0 → 100644
View file @
ab643c4f
compute_environment
:
LOCAL_MACHINE
debug
:
false
distributed_type
:
MULTI_GPU
downcast_bf16
:
'
no'
gpu_ids
:
all
main_process_ip
:
10.5.32.245
main_process_port
:
12345
main_training_function
:
main
mixed_precision
:
fp16
num_machines
:
2
# the number of nodes
num_processes
:
16
# the number of GPUs in all nodes
rdzv_backend
:
static
same_network
:
true
tpu_env
:
[]
tpu_use_cluster
:
false
tpu_use_sudo
:
false
use_cpu
:
false
llama-factory/examples/full_multi_gpu/8B/multi_node_0.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
# 多机多卡 0
CUDA_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7 accelerate launch
\
--config_file
./master_config.yaml
\
--machine_rank
0
\
../../src/train_bash.py
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-8B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
full
\
--output_dir
saves/LLaMA3-8B/full/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
2
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
100
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
3000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/full_multi_gpu/8B/multi_node_1.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
# 多机多卡 1
CUDA_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7 accelerate launch
\
--config_file
./master_config.yaml
\
--machine_rank
1
\
../../src/train_bash.py
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-8B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
full
\
--output_dir
saves/LLaMA3-8B/full/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
2
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
100
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
3000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/full_multi_gpu/8B/single_node.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
# 单机多卡
CUDA_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7 accelerate launch
\
--config_file
../accelerate/single_config.yaml
\
../../src/train_bash.py
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-8B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
full
\
--output_dir
saves/LLaMA3-8B-Instruct/full/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
8
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
1000
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--learning_rate
5e-5
\
--num_train_epochs
1.0
\
--max_samples
1000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/full_multi_gpu/8B/single_node_deepspeed.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
# 单机多卡 + deepspeed
deepspeed
--num_gpus
4 ../../src/train_bash.py
\
--deepspeed
../deepspeed/ds_z3_config.json
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-8B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
full
\
--output_dir
saves/LLaMA3-8B-Instruct/full/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
8
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
1000
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--learning_rate
5e-5
\
--num_train_epochs
1.0
\
--max_samples
1000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/lora_multi_gpu/70B/.deepspeed_env
0 → 100644
View file @
ab643c4f
NCCL_SOCKET_IFNAME=ens38f0
NCCL_IB_DISABLE=1
HSA_FORCE_FINE_GRAIN_PCIE=1
MIOPEN_COMPILE_PARALLEL_LEVEL=1
NCCL_PATH=/opt/dtk/rccl
NCCL_DEBUG=DEBUG
llama-factory/examples/lora_multi_gpu/70B/hostfile
0 → 100644
View file @
ab643c4f
10.5.32.245 slots=8
10.5.32.246 slots=8
\ No newline at end of file
llama-factory/examples/lora_multi_gpu/70B/master_config.yaml
0 → 100644
View file @
ab643c4f
compute_environment
:
LOCAL_MACHINE
debug
:
false
distributed_type
:
MULTI_GPU
downcast_bf16
:
'
no'
gpu_ids
:
all
# machine_rank: 0
main_process_ip
:
10.5.32.245
main_process_port
:
12345
main_training_function
:
main
mixed_precision
:
fp16
num_machines
:
2
# the number of nodes
num_processes
:
16
# the number of GPUs in all nodes
rdzv_backend
:
static
same_network
:
true
tpu_env
:
[]
tpu_use_cluster
:
false
tpu_use_sudo
:
false
use_cpu
:
false
\ No newline at end of file
llama-factory/examples/lora_multi_gpu/70B/multi_node_0.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
NCCL_IB_DISABLE
=
1
export
NCCL_SOCKET_IFNAME
=
ens38f0
# LoRA + 多机多卡 0
CUDA_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7 accelerate launch
\
--config_file
./master_config.yaml
\
--machine_rank
0
\
../../src/train_bash.py
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-70B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
lora
\
--lora_target
q_proj,v_proj
\
--output_dir
saves/LLaMA3-70B/lora/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
1
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
2
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
100
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--load_best_model_at_end
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
1000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/lora_multi_gpu/70B/multi_node_1.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
export
NCCL_IB_DISABLE
=
1
export
NCCL_SOCKET_IFNAME
=
ens38f0
# LoRA + 多机多卡 1
CUDA_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7 accelerate launch
\
--config_file
./master_config.yaml
\
--machine_rank
1
\
../../src/train_bash.py
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-70B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
lora
\
--lora_target
q_proj,v_proj
\
--output_dir
saves/LLaMA3-70B/lora/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
1
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
2
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
100
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--load_best_model_at_end
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
1000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/lora_multi_gpu/70B/multi_node_deepspeed.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#--gradient_accumulation_steps Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
#--preprocessing_num_workers The number of processes to use for the pre-processing. (default: None)
MASTER_ADDR
=
''
# LoRA + 多机多卡 + deepspeed
deepspeed
--hostfile
=
./hostfile
\
--num_nodes
2
\
--master_addr
$MASTER_ADDR
\
--master_port
12345
\
../../src/train_bash.py
\
--deepspeed
../deepspeed/ds_z3_config.json
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-70B-Instruct
\
--dataset
alpaca_gpt4_zh,alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
lora
\
--lora_target
q_proj,v_proj
\
--output_dir
saves/LLaMA3-70B-Instruct/lora/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
8
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--warmup_ratio
0.1
\
--save_steps
500
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--load_best_model_at_end
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
3000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/lora_multi_gpu/70B/single_node.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
# # LoRA + 单机多卡
CUDA_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7 accelerate launch
\
--config_file
../accelerate/single_config.yaml
\
../../src/train_bash.py
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-70B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
lora
\
--lora_target
q_proj,v_proj
\
--output_dir
saves/LLaMA3-70B/lora/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
2
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
100
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--load_best_model_at_end
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
1000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/lora_multi_gpu/70B/single_node_deepspeed.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
#--gradient_accumulation_steps: Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
#--preprocessing_num_workers: The number of processes to use for the pre-processing. (default: None)
MASTER_ADDR
=
''
# LoRA + 单机多卡 + deepspeed
deepspeed
--master_addr
$MASTER_ADDR
\
--master_port
12345
\
../../src/train_bash.py
\
--deepspeed
../deepspeed/ds_z3_config.json
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-70B-Instruct
\
--dataset
alpaca_gpt4_zh,alpaca_zh
\
--dataset_dir
../../data
\
--template
llama3
\
--finetuning_type
lora
\
--lora_target
q_proj,v_proj
\
--output_dir
saves/LLaMA3-70B-Instruct/lora/sft-single-node
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
8
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--warmup_ratio
0.1
\
--save_steps
500
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--load_best_model_at_end
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
3000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
llama-factory/examples/lora_multi_gpu/8B/single_node.sh
0 → 100644
View file @
ab643c4f
#!/bin/bash
export
HSA_FORCE_FINE_GRAIN_PCIE
=
1
# LoRA+单机多卡
CUDA_VISIBLE_DEVICES
=
0,1,2,3 accelerate launch
\
--config_file
../accelerate/single_config.yaml
\
../../src/train_bash.py
\
--stage
sft
\
--do_train
\
--model_name_or_path
meta-llama/Meta-Llama-3-8B-Instruct
\
--dataset
alpaca_zh
\
--dataset_dir
../..data
\
--template
llama3
\
--finetuning_type
lora
\
--lora_target
q_proj,v_proj
\
--output_dir
saves/LLaMA3-8B/lora/sft
\
--overwrite_cache
\
--overwrite_output_dir
\
--cutoff_len
8192
\
--preprocessing_num_workers
16
\
--per_device_train_batch_size
1
\
--per_device_eval_batch_size
1
\
--gradient_accumulation_steps
8
\
--lr_scheduler_type
cosine
\
--logging_steps
10
\
--warmup_steps
20
\
--save_steps
100
\
--eval_steps
100
\
--evaluation_strategy
steps
\
--load_best_model_at_end
\
--learning_rate
5e-5
\
--num_train_epochs
3.0
\
--max_samples
1000
\
--val_size
0.1
\
--ddp_timeout
180000000
\
--plot_loss
\
--fp16
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment