Add solutions of deepspeed cpu offload stage3

e92143e3 · chenych · 1a7440bc · e92143e3
Commit e92143e3 authored Dec 25, 2024 by chenych
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 0 deletions

README.md README.md +3 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -43,6 +43,9 @@ LLaMA Factory是一个大语言模型训练和推理的框架，支持了魔搭
 > 2. `XVERSE`在`tokenizer > 0.19`的版本下有兼容性问题报错`Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrappe`，需要使用[XVERSE-13B-256K-hf](https://huggingface.co/xverse/XVERSE-13B-256K/tree/main)中的`tokenizer_config.json.update`/`tokenizer.json.update`替换原有模型文件中的对应tokenizer文件，具体解决方法参考[xverse-ai/XVERSE-7B issues](https://github.com/xverse-ai/XVERSE-7B/issues/1)
 >
 > 3. `Qwen2`训练仅支持bf16格式，**fp16会出现loss为0，lr为0的问题**，参考[issues](https://github.com/hiyouga/LLaMA-Factory/issues/4848)
+>
+> 4. `deepspeed-cpu-offload-stage3`出现`RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`错误，是deepspeed本身bug，解决办法参考官方[issuse](https://github.com/microsoft/DeepSpeed/issues/5634)
 ## 使用源码编译方式安装
 ### 环境准备