Commit e92143e3 authored by chenych's avatar chenych
Browse files

Add solutions of deepspeed cpu offload stage3

parent 1a7440bc
...@@ -43,6 +43,9 @@ LLaMA Factory是一个大语言模型训练和推理的框架,支持了魔搭 ...@@ -43,6 +43,9 @@ LLaMA Factory是一个大语言模型训练和推理的框架,支持了魔搭
> 2. `XVERSE`在`tokenizer > 0.19`的版本下有兼容性问题报错`Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrappe`,需要使用[XVERSE-13B-256K-hf](https://huggingface.co/xverse/XVERSE-13B-256K/tree/main)中的`tokenizer_config.json.update`/`tokenizer.json.update`替换原有模型文件中的对应tokenizer文件,具体解决方法参考[xverse-ai/XVERSE-7B issues](https://github.com/xverse-ai/XVERSE-7B/issues/1) > 2. `XVERSE`在`tokenizer > 0.19`的版本下有兼容性问题报错`Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrappe`,需要使用[XVERSE-13B-256K-hf](https://huggingface.co/xverse/XVERSE-13B-256K/tree/main)中的`tokenizer_config.json.update`/`tokenizer.json.update`替换原有模型文件中的对应tokenizer文件,具体解决方法参考[xverse-ai/XVERSE-7B issues](https://github.com/xverse-ai/XVERSE-7B/issues/1)
> >
> 3. `Qwen2`训练仅支持bf16格式,**fp16会出现loss为0,lr为0的问题**,参考[issues](https://github.com/hiyouga/LLaMA-Factory/issues/4848) > 3. `Qwen2`训练仅支持bf16格式,**fp16会出现loss为0,lr为0的问题**,参考[issues](https://github.com/hiyouga/LLaMA-Factory/issues/4848)
>
> 4. `deepspeed-cpu-offload-stage3`出现`RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`错误,是deepspeed本身bug,解决办法参考官方[issuse](https://github.com/microsoft/DeepSpeed/issues/5634)
## 使用源码编译方式安装 ## 使用源码编译方式安装
### 环境准备 ### 环境准备
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment