"tests/git@developer.sourcefind.cn:OpenDAS/fairscale.git" did not exist on "73f73120fed1bc806d7487dcc4e999ad63e12947"
Commit 101db0e9 authored by Azure's avatar Azure
Browse files

Merge branch 'main' into update-yaml

parents 3897f001 7e58f9d2
WeChatGrouop.jpg

175 KB | W: | H:

WeChatGrouop.jpg

168 KB | W: | H:

WeChatGrouop.jpg
WeChatGrouop.jpg
WeChatGrouop.jpg
WeChatGrouop.jpg
  • 2-up
  • Swipe
  • Onion skin
# FAQ # FAQ
## Install ## Install
### ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found ### Q: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found
``` ```
in Ubuntu 22.04 installation need to add the: in Ubuntu 22.04 installation need to add the:
sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo add-apt-repository ppa:ubuntu-toolchain-r/test
...@@ -8,7 +8,7 @@ sudo apt-get update ...@@ -8,7 +8,7 @@ sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6 sudo apt-get install --only-upgrade libstdc++6
``` ```
from-https://github.com/kvcache-ai/ktransformers/issues/117#issuecomment-2647542979 from-https://github.com/kvcache-ai/ktransformers/issues/117#issuecomment-2647542979
### DeepSeek-R1 not outputting initial <think> token ### Q: DeepSeek-R1 not outputting initial <think> token
> from deepseek-R1 doc:<br> > from deepseek-R1 doc:<br>
> Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think>\n\n\</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think>\n" at the beginning of every output. > Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think>\n\n\</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think>\n" at the beginning of every output.
...@@ -19,7 +19,7 @@ and pass the arg `--force_think true ` can let the local_chat initiate the respo ...@@ -19,7 +19,7 @@ and pass the arg `--force_think true ` can let the local_chat initiate the respo
from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552 from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
## Usage ## Usage
### If I got more VRAM than the model's requirement, how can I fully utilize it? ### Q: If I got more VRAM than the model's requirement, how can I fully utilize it?
1. Get larger context. 1. Get larger context.
1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value. 1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
...@@ -41,23 +41,29 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552 ...@@ -41,23 +41,29 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
> Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid. > Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
### If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them? ### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
Use the `--optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file. Use the `--optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
> Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution. > Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.
### How to get the best performance? ### Q: How to get the best performance?
You have to set `--cpu_infer` to the number of cores you want to use. The more cores you use, the faster the model will run. But it's not the more the better. Adjust it slightly lower to your actual number of cores. You have to set `--cpu_infer` to the number of cores you want to use. The more cores you use, the faster the model will run. But it's not the more the better. Adjust it slightly lower to your actual number of cores.
### My DeepSeek-R1 model is not thinking. ### Q: My DeepSeek-R1 model is not thinking.
According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think true `. According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think true `.
### Loading gguf error ### Q: Loading gguf error
Make sure you: Make sure you:
1. Have the `gguf` file in the `--gguf_path` directory. 1. Have the `gguf` file in the `--gguf_path` directory.
2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories. 2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong. 3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
\ No newline at end of file
### Q: Version `GLIBCXX_3.4.30' not found
The detailed error:
>ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)
It may because of your conda env have no this version. Your can first exit your conda env by `conda deactivate` and use `whereis libstdc++.so.6` to find the path. And re enter your conda env and copy the .so by `cp <path of outter libstdc++> <path of your conda env libstdc++>`
...@@ -160,7 +160,7 @@ def local_chat( ...@@ -160,7 +160,7 @@ def local_chat(
messages, add_generation_prompt=True, return_tensors="pt" messages, add_generation_prompt=True, return_tensors="pt"
) )
if force_think: if force_think:
token_thinks = torch.tensor([tokenizer.encode("<think>\\n",add_special_tokens=False)]) token_thinks = torch.tensor([tokenizer.encode("<think>\\n",add_special_tokens=False)],device=input_tensor.device)
input_tensor = torch.cat( input_tensor = torch.cat(
[input_tensor, token_thinks], dim=1 [input_tensor, token_thinks], dim=1
) )
......
...@@ -90,6 +90,7 @@ class ArgumentParser: ...@@ -90,6 +90,7 @@ class ArgumentParser:
# user config # user config
parser.add_argument("--user_secret_key", type=str, default=self.cfg.user_secret_key) parser.add_argument("--user_secret_key", type=str, default=self.cfg.user_secret_key)
parser.add_argument("--user_algorithm", type=str, default=self.cfg.user_algorithm) parser.add_argument("--user_algorithm", type=str, default=self.cfg.user_algorithm)
parser.add_argument("--force_think", type=bool, default=self.cfg.user_force_think)
# web config # web config
parser.add_argument("--web_cross_domain", type=bool, default=self.cfg.web_cross_domain) parser.add_argument("--web_cross_domain", type=bool, default=self.cfg.web_cross_domain)
...@@ -121,4 +122,5 @@ class ArgumentParser: ...@@ -121,4 +122,5 @@ class ArgumentParser:
self.cfg.server_ip = args.host self.cfg.server_ip = args.host
self.cfg.server_port = args.port self.cfg.server_port = args.port
self.cfg.backend_type = args.type self.cfg.backend_type = args.type
self.cfg.user_force_think = args.force_think
return args return args
...@@ -10,6 +10,7 @@ from transformers import ( ...@@ -10,6 +10,7 @@ from transformers import (
BitsAndBytesConfig, BitsAndBytesConfig,
) )
from ktransformers.server.config.config import Config
from ktransformers.server.schemas.base import ObjectID from ktransformers.server.schemas.base import ObjectID
from ktransformers.server.utils.multi_timer import Profiler from ktransformers.server.utils.multi_timer import Profiler
import torch import torch
...@@ -323,10 +324,19 @@ class TransformersInterface(BackendInterfaceBase): ...@@ -323,10 +324,19 @@ class TransformersInterface(BackendInterfaceBase):
#input_ids = torch.tensor([[6366]], device=input_ids.device) #input_ids = torch.tensor([[6366]], device=input_ids.device)
else: else:
raise ValueError("local_messages should be List or str") raise ValueError("local_messages should be List or str")
if Config().user_force_think:
token_thinks = torch.tensor([self.tokenizer.encode("<think>\\n",add_special_tokens=False)],device=input_ids.device)
input_ids = torch.cat(
[input_ids, token_thinks], dim=1
)
self.profiler.pause_timer("tokenize") self.profiler.pause_timer("tokenize")
self.profiler.create_and_start_timer("prefill") self.profiler.create_and_start_timer("prefill")
if Config().user_force_think:
t = "<think>\n"
print(t,end="",flush=True)
yield t
for t in self.prefill(input_ids, self.check_is_new(thread_id)): for t in self.prefill(input_ids, self.check_is_new(thread_id)):
if t is not None: if t is not None:
print(t, end="",flush=True) print(t, end="",flush=True)
...@@ -337,7 +347,7 @@ class TransformersInterface(BackendInterfaceBase): ...@@ -337,7 +347,7 @@ class TransformersInterface(BackendInterfaceBase):
for t in self.generate(): for t in self.generate():
if t is not None: if t is not None:
print(t, end="",flush=True) print(t, end="",flush=True)
yield t yield t
print("") print("")
self.profiler.pause_timer("decode") self.profiler.pause_timer("decode")
self.report_last_time_performance() self.report_last_time_performance()
...@@ -83,6 +83,7 @@ class Config(metaclass=Singleton): ...@@ -83,6 +83,7 @@ class Config(metaclass=Singleton):
self.user_config: dict = cfg.get("user", {}) self.user_config: dict = cfg.get("user", {})
self.user_secret_key = self.user_config.get("secret_key", "") self.user_secret_key = self.user_config.get("secret_key", "")
self.user_algorithm = self.user_config.get("algorithm", "") self.user_algorithm = self.user_config.get("algorithm", "")
self.user_force_think = self.user_config.get("force_think", False)
# model config # model config
self.model: dict = cfg.get("model", {}) self.model: dict = cfg.get("model", {})
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment