FAQ.md 4.4 KB
Newer Older
liam's avatar
liam committed
1
2
# FAQ
## Install
Azure's avatar
Azure committed
3
### Q: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found
liam's avatar
liam committed
4
5
6
7
8
9
```
in Ubuntu 22.04 installation need to add the:
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6
```
liam's avatar
liam committed
10
from-https://github.com/kvcache-ai/ktransformers/issues/117#issuecomment-2647542979
Azure's avatar
Azure committed
11
### Q: DeepSeek-R1 not outputting initial <think> token
liam's avatar
liam committed
12
13
14
15
16
17
18

> from deepseek-R1 doc:<br>
> Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think>\n\n\</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think>\n" at the beginning of every output.

So we fix this by manually adding "\<think>\n" token at prompt end (you can check out at local_chat.py),
and pass the arg `--force_think true ` can let the local_chat initiate the response with "\<think>\n"

Azure's avatar
Azure committed
19
20
21
from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552

## Usage
Azure's avatar
Azure committed
22
### Q: If I got more VRAM than the model's requirement, how can I fully utilize it?
Azure's avatar
Azure committed
23
24
25
26
27

1. Get larger context.
   1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
   2. server: Increase the `--cache_lens' to a larger value.
2. Move more weights to the GPU.
Atream's avatar
Atream committed
28
    Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
Azure's avatar
Azure committed
29
30
31
32
33
34
35
36
37
38
39
40
41
    ```yaml
    - match:
       name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
     replace:
       class: ktransformers.operators.experts.KTransformersExperts  
       kwargs:
         generate_device: "cuda:0" # run in cuda:0; marlin only support GPU
         generate_op:  "KExpertsMarlin" # use marlin expert
     recursive: False
    ```
    You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU.

    > Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
Atream's avatar
Atream committed
42
43
    > Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization.
    > Note KExpertsTorch is untested.
Azure's avatar
Azure committed
44
45


Azure's avatar
Azure committed
46
### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
Azure's avatar
Azure committed
47

Azure's avatar
Azure committed
48
Use the `--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
Azure's avatar
Azure committed
49
50
51

> Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.

Azure's avatar
Azure committed
52
### Q: How to get the best performance?
Azure's avatar
Azure committed
53
54
55

You have to set `--cpu_infer` to the number of cores you want to use. The more cores you use, the faster the model will run. But it's not the more the better. Adjust it slightly lower to your actual number of cores.

Azure's avatar
Azure committed
56
### Q: My DeepSeek-R1 model is not thinking.
Azure's avatar
Azure committed
57

Azure's avatar
Azure committed
58
According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think True `.
Azure's avatar
Azure committed
59

Azure's avatar
Azure committed
60
### Q: Loading gguf error
Azure's avatar
Azure committed
61
62
63
64

Make sure you:
1. Have the `gguf` file in the `--gguf_path` directory.
2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
Azure's avatar
Azure committed
65
3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
Azure's avatar
Azure committed
66
4. The file itself is not corrupted; you can verify this by checking that the sha256sum matches the one from huggingface, modelscope, or hf-mirror.
Azure's avatar
Azure committed
67
68

### Q: Version `GLIBCXX_3.4.30' not found
ZiWei Yuan's avatar
ZiWei Yuan committed
69
70
71
The detailed error:
>ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)

72
Running `conda install -c conda-forge libstdcxx-ng` can solve the problem.
Azure's avatar
Azure committed
73
74