update readme

b1818b8e · Rayyyyy · eb2cf5ac · b1818b8e · b1818b8e
Commit b1818b8e authored Apr 20, 2024 by Rayyyyy
Hide whitespace changes
Inline Side-by-side

Showing with 55 additions and 15 deletions

README.md README.md +50 -10

test.sh test.sh +5 -5

No files found.
--- a/README.md
+++ b/README.md
@@ -41,7 +41,7 @@ DTK驱动：dtk23.10.1
 python：python3.8
 torch:2.1.0
 ```
-`Tips：以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
 其它非深度学习库安装方式如下：
 ```bash
@@ -55,16 +55,23 @@ pip install -e .
 暂无
 ## 推理
-预训练模型下载方法请参考下面的[预训练权重](#预训练权重)章节，模型准备好后，执行
+预训练模型下载方法请参考下面的[预训练权重](#预训练权重)章节，不同的模型需要不同的模型并行(MP)值，如下表所示:
- Meta-Llama-3-8B-Instruct 模型
-```bash
-torchrun --nproc_per_node 1 example_chat_completion.py \
-    --ckpt_dir Meta-Llama-3-8B-Instruct/original/ \
-    --tokenizer_path Meta-Llama-3-8B-Instruct/original/tokenizer.model \
-    --max_seq_len 512 --max_batch_size 6
-```
- Meta-Llama-3-8B 模型
+|  Model | MP |
+|--------|----|
+| 8B     | 1  |
+| 70B    | 8  |
+所有模型都支持序列长度高达8192个tokens，但我们根据max_seq_len和max_batch_size值预先分配缓存。根据你的硬件设置。
+**Tips:**
+- `–nproc_per_node`需要根据模型的MP值进行设置。
+- `max_seq_len`和`max_batch_size`参数按需设置。
+### Pretrained模型
+这些模型都没有针对聊天或者Q&A进行微调，他们应该被prompted，这样预期的答案是prompt的自然延续。可以参考example_text_completion.py里的用例，请参阅下面的命令，使用lama-3-8b模型运行它。
+- Meta-Llama-3-8B 模型示例
 ```bash
 torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir Meta-Llama-3-8B/original/ \
@@ -72,6 +79,23 @@ torchrun --nproc_per_node 1 example_text_completion.py \
    --max_seq_len 128 --max_batch_size 4
 ```
+### Instruction-tuned模型
+经过微调的模型被训练用于对话应用程序。为了获得预期的功能和性能，需要遵循 [`ChatFormat`](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202)中定义的特定格式:
+- prompt以 `<|begin_of_text|>` 特殊token开始，之后是一条或多条message。
+- 每条message都以`<|start_header_id|>` 标签，`system`、`user`或者`assistant`角色、以及`<|end_header_id|>` 标签开头。
+- 在双换行符`\n\n`之后是message的内容。
+- 每条message的结尾用`<|eot_id|>`token标记。
+您还可以部署额外的分类器来过滤被认为不安全的输入和输出。有关如何向推理代码的输入和输出添加安全检查器，请参阅[llama-recipes repo](https://github.com/meta-llama/llama-recipes/blob/main/recipes/inference/local_inference/inference.py) 。
+- Meta-Llama-3-8B-Instruct 模型示例
+```bash
+torchrun --nproc_per_node 1 example_chat_completion.py \
+    --ckpt_dir Meta-Llama-3-8B-Instruct/original/ \
+    --tokenizer_path Meta-Llama-3-8B-Instruct/original/tokenizer.model \
+    --max_seq_len 512 --max_batch_size 6
+```
 ## result
 - Meta-Llama-3-8B-Instruct
 <div align=center>
@@ -107,12 +131,28 @@ export HF_ENDPOINT=https://hf-mirror.com
 mkdir Meta-Llama-3-8B-Instruct
 huggingface-cli download --resume-download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir Meta-Llama-3-8B-Instruct --token hf_*
 ```
 - Meta-Llama-3-8B 模型
 ```bash
 mkdir Meta-Llama-3-8B
 huggingface-cli download --resume-download meta-llama/Meta-Llama-3-8B --include "original/*" --local-dir Meta-Llama-3-8B --token hf_*
 ```
+模型目录结构如下：
+```bash
+├── llama3_pytorch
+│   ├── Meta-Llama-3-8B-Instruct
+│       ├── original
+│           ├── consolidated.00.pth
+│           ├── params.json
+│           └── tokenizer.model
+│   ├── Meta-Llama-3-8B
+│       ├── original
+│           ├── consolidated.00.pth
+│           ├── params.json
+│           └── tokenizer.model
+```
 ## 源码仓库及问题反馈
 - https://developer.hpccube.com/codes/modelzoo/llama3_pytorch

--- a/test.sh
+++ b/test.sh
 #!/bin/bash
 echo "Export params ..."
-export HIP_VISIBLE_DEVICES=0,1,2,3 # 自行修改为训练的卡号和数量
+export HIP_VISIBLE_DEVICES=0 # 自行修改为训练的卡号和数量
 export HSA_FORCE_FINE_GRAIN_PCIE=1
 export USE_MIOPEN_BATCHNORM=1
@@ -14,7 +14,7 @@ torchrun --nproc_per_node 1 example_chat_completion.py \
    --max_seq_len 512 --max_batch_size 6
 # Meta-Llama-3-8B 模型
-torchrun --nproc_per_node 1 example_text_completion.py \
+# torchrun --nproc_per_node 1 example_text_completion.py \
-    --ckpt_dir Meta-Llama-3-8B/original/ \
+#     --ckpt_dir Meta-Llama-3-8B/original/ \
-    --tokenizer_path Meta-Llama-3-8B/original/tokenizer.model \
+#     --tokenizer_path Meta-Llama-3-8B/original/tokenizer.model \
-    --max_seq_len 128 --max_batch_size 4
+#     --max_seq_len 128 --max_batch_size 4