Unverified Commit 761de498 authored by Atream's avatar Atream Committed by GitHub
Browse files

Merge pull request #751 from kvcache-ai/Atream-patch-2

Update DeepseekR1_V3_tutorial.md
parents bd33a59e 735873a3
...@@ -160,6 +160,7 @@ is speed up which is inspiring. So our showcase makes use of this finding* ...@@ -160,6 +160,7 @@ is speed up which is inspiring. So our showcase makes use of this finding*
### V0.2.2 longer context & FP8 kernel ### V0.2.2 longer context & FP8 kernel
#### longer context #### longer context
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first. To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
Note: The latest MLA kernel in FlashInfer still has a few minor issues. They are continuously fixing them on the main branch. If you are using FlashInfer, please install it from the main source code. Note: The latest MLA kernel in FlashInfer still has a few minor issues. They are continuously fixing them on the main branch. If you are using FlashInfer, please install it from the main source code.
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this: If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
...@@ -173,6 +174,8 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix ...@@ -173,6 +174,8 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix
prefill_device: "cuda" prefill_device: "cuda"
absorb_for_prefill: True # change this to True to enable long context(prefill may slower). absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
``` ```
If the VRAM is still insufficient, try reducing the `chunk_prefill_size` parameter (default is 8192) to further decrease the intermediate results during chunk prefill.
#### FP8 kernel #### FP8 kernel
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works: The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment