description:Create a report to help us reproduce and fix the bug
title:"[Bug]"
labels:['Bug']
body:
-type:checkboxes
attributes:
label:Checklist
options:
-label:1. I have searched related issues but cannot get the expected help.
-label:2. The bug has not been fixed in the latest version.
-label:3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
-label:4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
-label:5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
-type:textarea
attributes:
label:Describe the bug
description:A clear and concise description of what the bug is.
validations:
required:true
-type:textarea
attributes:
label:Reproduction
description:|
What command or script did you run? Which **model** are you using?
placeholder:|
A placeholder for the command.
validations:
required:true
-type:textarea
attributes:
label:Environment
description:|
Please provide necessary environment information here (e.g. OS/GPU/CPU). Otherwise the issue will be close.
-label:1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
-label:2. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-English/Chinese content without translation may be closed.
-type:textarea
attributes:
label:Motivation
description:|
A clear and concise description of the motivation of the feature.
validations:
required:true
-type:textarea
attributes:
label:Related resources
description:|
If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.
@@ -160,6 +160,7 @@ is speed up which is inspiring. So our showcase makes use of this finding*
...
@@ -160,6 +160,7 @@ is speed up which is inspiring. So our showcase makes use of this finding*
### V0.2.2 longer context & FP8 kernel
### V0.2.2 longer context & FP8 kernel
#### longer context
#### longer context
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
Note: The latest MLA kernel in FlashInfer still has a few minor issues. They are continuously fixing them on the main branch. If you are using FlashInfer, please install it from the main source code.
Note: The latest MLA kernel in FlashInfer still has a few minor issues. They are continuously fixing them on the main branch. If you are using FlashInfer, please install it from the main source code.
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
...
@@ -173,6 +174,8 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix
...
@@ -173,6 +174,8 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix
prefill_device: "cuda"
prefill_device: "cuda"
absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
```
```
If the VRAM is still insufficient, try reducing the `chunk_prefill_size` parameter (default is 8192) to further decrease the intermediate results during chunk prefill.
#### FP8 kernel
#### FP8 kernel
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works: