Commit bb6920ed authored by Azure's avatar Azure
Browse files

update doc

parent 9c71bcb0
...@@ -16,7 +16,9 @@ ...@@ -16,7 +16,9 @@
- [Memory consumptions:](#memory-consumptions) - [Memory consumptions:](#memory-consumptions)
- [Benchmark results](#benchmark-results-2) - [Benchmark results](#benchmark-results-2)
- [How to Run](#how-to-run) - [How to Run](#how-to-run)
- [V0.2.2 longer context](#v022-longer-context) - [V0.2.2 longer context \& FP8 kernel](#v022-longer-context--fp8-kernel)
- [longer context](#longer-context)
- [FP8 kernel](#fp8-kernel)
- [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase) - [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
- [Single socket version (32 cores)](#single-socket-version-32-cores) - [Single socket version (32 cores)](#single-socket-version-32-cores)
- [Dual socket version (64 cores)](#dual-socket-version-64-cores) - [Dual socket version (64 cores)](#dual-socket-version-64-cores)
...@@ -155,7 +157,11 @@ the output quality doesn't change. But the speed of decoding and prefill ...@@ -155,7 +157,11 @@ the output quality doesn't change. But the speed of decoding and prefill
is speed up which is inspiring. So our showcase makes use of this finding* is speed up which is inspiring. So our showcase makes use of this finding*
## How to Run ## How to Run
### V0.2.2 longer context ### V0.2.2 longer context & FP8 kernel
#### longer context
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this: If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
``` ```
- match: - match:
...@@ -167,6 +173,18 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix ...@@ -167,6 +173,18 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix
prefill_device: "cuda" prefill_device: "cuda"
absorb_for_prefill: True # change this to True to enable long context(prefill may slower). absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
``` ```
#### FP8 kernel
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
- **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
- **Hybrid Quantization Architecture**:
- Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
- Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
The detailed guide is [here](./fp8_kernel.md).
### V0.2 & V0.2.1 Showcase ### V0.2 & V0.2.1 Showcase
#### Single socket version (32 cores) #### Single socket version (32 cores)
Our local_chat test command is: Our local_chat test command is:
......
# FP8 Linear Kernel for DeepSeek-V3 # FP8 Linear Kernel for DeepSeek-V3/R1
## Overview ## Overview
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works: The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
...@@ -17,8 +17,8 @@ So those who are persuing the best performance can use the FP8 linear kernel for ...@@ -17,8 +17,8 @@ So those who are persuing the best performance can use the FP8 linear kernel for
### Using Pre-Merged Weights ### Using Pre-Merged Weights
Pre-merged weights are available on Hugging Face: Pre-merged weights are available on Hugging Face:
[KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3) [KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-V3)
[KVCache-ai/DeepSeek-R1](https://huggingface.co/KVCache-ai/DeepSeek-R1) [KVCache-ai/DeepSeek-R1-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-R1)
> Please confirm the weights are fully uploaded before downloading. The large file size may extend Hugging Face upload time. > Please confirm the weights are fully uploaded before downloading. The large file size may extend Hugging Face upload time.
...@@ -29,7 +29,7 @@ pip install -U huggingface_hub ...@@ -29,7 +29,7 @@ pip install -U huggingface_hub
# Optional: Use HF Mirror for faster downloads in special area. # Optional: Use HF Mirror for faster downloads in special area.
# export HF_ENDPOINT=https://hf-mirror.com # export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3 --local-dir <local_dir> huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid --local-dir <local_dir>
``` ```
### Using merge scripts ### Using merge scripts
If you got local DeepSeek-R1/V3 fp8 safetensors and q4km gguf weights, you can merge them using the following scripts. If you got local DeepSeek-R1/V3 fp8 safetensors and q4km gguf weights, you can merge them using the following scripts.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment