# This file is distributed under the same license as the Qwen package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
#
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/framework/qwen_agent.rst:2
#: aaed24d3edd64e6ab1f20188f3d5ba24
msgid "Qwen-Agent"
msgstr "Qwen-Agent"
#: ../../Qwen/source/framework/qwen_agent.rst:5
#: 1cbbb8d342f243c58e0d66a3e44daac8
msgid "To be updated for Qwen3."
msgstr "仍需为Qwen3更新。"
#: ../../Qwen/source/framework/qwen_agent.rst:7
#: 3e1dbee121bc4a6c91a26618e27c0d86
msgid "`Qwen-Agent <https://github.com/QwenLM/Qwen-Agent>`__ is a framework for developing LLM applications based on the instruction following, tool usage, planning, and memory capabilities of Qwen. It also comes with example applications such as Browser Assistant, Code Interpreter, and Custom Assistant."
msgid "Qwen-Agent provides atomic components such as LLMs and prompts, as well as high-level components such as Agents. The example below uses the Assistant component as an illustration, demonstrating how to add custom tools and quickly develop an agent that uses tools."
msgid "The framework also provides more atomic components for developers to combine. For additional showcases, please refer to `examples <https://github.com/QwenLM/Qwen-Agent/tree/main/examples>`__."
msgid "Qwen (Chinese: 通义千问; pinyin: _Tongyi Qianwen_) is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc. Both language models and multimodal models are pre-trained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences."
msgid "You can learn more about them at Alibaba Cloud Model Studio ([China Site](https://help.aliyun.com/zh/model-studio/getting-started/models#9f8890ce29g5u) \\[zh\\], [International Site](https://www.alibabacloud.com/en/product/modelstudio))."
msgid "Causal language models, also known as autoregressive language models or decoder-only language models, are a type of machine learning model designed to predict the next token in a sequence based on the preceding tokens. In other words, they generate text one token at a time, using the previously generated tokens as context. The \"causal\" aspect refers to the fact that the model only considers the past context (the already generated tokens) when predicting the next token, not any future tokens."
msgstr "因果语言模型 (causal Language Models),也被称为自回归语言模型 (autoregressive language models) 或仅解码器语言模型 (decoder-only language models) ,是一种机器学习模型,旨在根据序列中的前导 token 预测下一个 token 。换句话说,它使用之前生成的 token 作为上下文,一次生成一个 token 的文本。\"因果\"方面指的是模型在预测下一个 token 时只考虑过去的上下文(即已生成的 token ),而不考虑任何未来的 token 。"
msgid "Causal language models are widely used for various natural language processing tasks involving text completion and generation. They have been particularly successful in generating coherent and contextually relevant text, making them a cornerstone of modern natural language understanding and generation systems."
msgid "Sequence-to-sequence models use both an encoder to capture the entire input sequence and a decoder to generate an output sequence. They are widely used for tasks like machine translation, text summarization, etc."
msgid "Bidirectional models can access both past and future context in a sequence during training. They cannot generate sequential outputs in real-time due to the need for future context. They are widely used as embedding models and subsequently used for text classification."
msgid "Causal language models operate unidirectionally in a strictly forward direction, predicting each subsequent word based only on the previous words in the sequence. This unidirectional nature ensures that the model's predictions do not rely on future context, making them suitable for tasks like text completion and generation."
msgid "Base language models are foundational models trained on extensive corpora of text to predict the next word in a sequence. Their main goal is to capture the statistical patterns and structures of language, enabling them to generate coherent and contextually relevant text. These models are versatile and can be adapted to various natural language processing tasks through fine-tuning. While adept at producing fluent text, they may require in-context learning or additional training to follow specific instructions or perform complex reasoning tasks effectively. For Qwen models, the base models are those without \"-Instruct\" indicators, such as Qwen2.5-7B and Qwen2.5-72B."
msgid "Instruction-tuned language models are specialized models designed to understand and execute specific instructions in conversational styles. These models are fine-tuned to interpret user commands accurately and can perform tasks such as summarization, translation, and question answering with improved accuracy and consistency. Unlike base models, which are trained on large corpora of text, instruction-tuned models undergo additional training using datasets that contain examples of instructions and their desired outcomes, often in multiple turns. This kind of training makes them ideal for applications requiring targeted functionalities while maintaining the ability to generate fluent and coherent text. For Qwen models, the instruction-tuned models are those with the \"-Instruct\" suffix, such as Qwen2.5-7B-Instruct and Qwen2.5-72B-Instruct. [^instruct-chat]"
msgid "Tokens represent the fundamental units that models process and generate. They can represent texts in human languages (regular tokens) or represent specific functionality like keywords in programming languages (control tokens [^special]). Typically, a tokenizer is used to split text into regular tokens, which can be words, subwords, or characters depending on the specific tokenization scheme employed, and furnish the token sequence with control tokens as needed. The vocabulary size, or the total number of unique tokens a model recognizes, significantly impacts its performance and versatility. Larger language models often use sophisticated tokenization methods to handle the vast diversity of human language while keeping the vocabulary size manageable. Qwen use a relatively large vocabulary of 151,646 tokens in total."
msgid "Qwen adopts a subword tokenization method called Byte Pair Encoding (BPE), which attempts to learn the composition of tokens that can represent the text with the fewest tokens. For example, the string \" tokenization\" is decomposed as \" token\" and \"ization\" (note that the space is part of the token). Especially, the tokenization of Qwen ensures that there is no unknown words and all texts can be transformed to token sequences."
msgid "There are 151,643 tokens as a result of BPE in the vocabulary of Qwen, which is a large vocabulary efficient for diverse languages. As a rule of thumb, 1 token is 3~4 characters for English texts and 1.5~1.8 characters for Chinese texts."
msgid "Qwen uses byte-level BPE (BBPE) on UTF-8 encoded texts. It starts by treating each byte as a token and then iteratively merges the most frequent pairs of tokens occurring the texts into larger tokens until the desired vocabulary size is met."
msgid "In byte-level BPE, minimum 256 tokens are needed to tokenize every piece of text and avoid the out of vocabulary (OOV) problem. In comparison, character-level BPE needs every Unicode character in its vocabulary to avoid OOV and the Unicode Standard contains 154,998 characters as of Unicode Version 16.0."
msgid "One limitation to keep in mind for byte-level BPE is that the individual tokens in the vocabulary may not be seemingly semantically meaningful or even valid UTF-8 byte sequences, and in certain aspects, they should be viewed as a text compression scheme."
msgid "Control tokens are special tokens inserted into the sequence that signifies meta information. For example, in pre-training, multiple documents may be packed into a single sequence. For Qwen, the control token \"<|endoftext|>\" is inserted after each document to signify that the document has ended and a new document will proceed."
msgid "Chat templates provide a structured format for conversational interactions, where predefined placeholders or prompts are used to elicit responses from the model that adhere to a desired dialogue flow or context. Different models may use different kinds of chat template to format the conversations. It is crucial to use the designated one to ensure the precise control over the LLM's generation process."
msgid "The user input take the role of `user` and the model generation takes the role of `assistant`. Qwen also supports the meta message that instruct the model to perform specific actions or generate text with certain characteristics, such as altering tone, style, or content, which takes the role of `system` and the content defaults to \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\""
msgstr "用户输入扮演 `user` 的 role ,而模型生成则承担 `assistant` 的 role 。 Qwen 还支持元消息,该消息指导模型执行特定操作或生成具有特定特性的文本,例如改变语气、风格或内容,这将承担 `system` 的 role,且内容默认为 \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\" 。"
msgid "Starting from Qwen2.5, the Qwen model family including multimodal and specialized models will use a unified vocabulary, which contains control tokens from all subfamilies. There are 22 control tokens in the vocabulary of Qwen2.5, making the vocabulary size totaling 151,665:"
msgid "As Qwen models are causal language models, in theory there is only one length limit of the entire sequence. However, since there is often packing in training and each sequence may contain multiple individual pieces of texts. **How long the model can generate or complete ultimately depends on the use case and in that case how long each document (for pre-training) or each turn (for post-training) is in training.**"
msgid "For Qwen2.5, the packed sequence length in training is 32,768 tokens.[^yarn] The maximum document length in pre-training is this length. The maximum message length for user and assistant is different in post-training. In general, the assistant message could be up to 8192 tokens."
msgid "Previously, they are known as the chat models and with the \"-Chat\" suffix. Starting from Qwen2, the name is changed to follow the common practice. For Qwen, \"-Instruct\" and \"-Chat\" should be regarded as synonymous."
msgid "Control tokens can be called special tokens. However, the meaning of special tokens need to be interpreted based on the contexts: special tokens may contain extra regular tokens."
msgid "For historical reference only, ChatML is first described by the OpenAI Python SDK. The last available version is [this](https://github.com/openai/openai-python/blob/v0.28.1/chatml.md). Please also be aware that that document lists use cases intended for OpenAI models. For Qwen2.5 models, please only use as in our guide."
msgid "The sequence length can be extended to 131,072 tokens for Qwen2.5-7B, Qwen2.5-14B, Qwen2.5-32B, and Qwen2.5-72B models with YaRN. Please refer to the model card on how to enable YaRN in vLLM."
#~ msgid "There is the proprietary version hosted exclusively at [Alibaba Cloud \\[zh\\]](https://help.aliyun.com/zh/model-studio/developer-reference/tongyi-qianwen-llm/) and the open-weight version."
msgid "This guide helps you quickly start using Qwen3. We provide examples of [Hugging Face Transformers](https://github.com/huggingface/transformers) as well as [ModelScope](https://github.com/modelscope/modelscope), and [vLLM](https://github.com/vllm-project/vllm) for deployment."
msgid "You can find Qwen3 models in [the Qwen3 collection](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) at HuggingFace Hub and [the Qwen3 collection](https://www.modelscope.cn/collections/Qwen3-9743180bdc6b48) at ModelScope."
msgid "To get a quick start with Qwen3, you can try the inference with `transformers` first. Make sure that you have installed `transformers>=4.51.0`. We advise you to use Python 3.10 or higher, and PyTorch 2.6 or higher."
msgid "Qwen3 will think before respond, similar to QwQ models. This means the model will use its reasoning abilities to enhance the quality of generated responses. The model will first generate thinking content wrapped in a `<think>...</think>` block, followed by the final response."
msgid "Hard Switch: To strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models, you can set `enable_thinking=False` when formatting the text."
msgid "Soft Switch: Qwen3 also understands the user's instruction on its thinking behaviour, in particular, the soft switch `/think` and `/no_think`. You can add them to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations."
msgid "For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in `generation_config.json`). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section."
msgid "To tackle with downloading issues, we advise you to try [ModelScope](https://github.com/modelscope/modelscope). Before starting, you need to install `modelscope` with `pip`."
msgid "`modelscope` adopts a programmatic interface similar (but not identical) to `transformers`. For basic usage, you can simply change the first line of code above to the following:"
msgid "To deploy Qwen3, we advise you to use vLLM. vLLM is a fast and easy-to-use framework for LLM inference and serving. In the following, we demonstrate how to build a OpenAI-API compatible API service with vLLM."
msgid "Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:"
msgid "While the soft switch is always available, the hard switch is also availabe in vLLM through the following configuration to the API call. To disable thinking, use"
msgstr "虽然软开关始终可用,但硬开关也可以通过以下 API 调用配置在 vLLM 中使用。要禁用思考,请使用"
msgid "This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2.5 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths."
msgid "For vLLM, the memory usage is not reported because it pre-allocates all GPU memory. We use ``gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False`` by default."
msgid "[new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)"
msgid "Qwen is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Both language models and multimodal models are pretrained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences. Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc."
msgid "**Seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose chat) **within a single model**, ensuring optimal performance across various scenarios."
msgid "**Significantly enhancement in reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning."
msgid "**Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience."
msgid "**Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks."
msgid "Join our community by joining our `Discord <https://discord.gg/yPEP2vHTu4>`__ and `WeChat <https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png>`__ group. We are looking forward to seeing you there!"
#~ msgid "Dense, easy-to-use, decoder-only language models, available in **0.5B**, **1.5B**, **3B**, **7B**, **14B**, **32B**, and **72B** sizes, and base and instruct variants."
#~ msgid "Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON."
#~ msgid "Multilingual support for over **29** languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more."
# This file is distributed under the same license as the Qwen package.
#
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/inference/transformers.md:1
#: 0614c94c5d284106b6157f7b89fa087f
msgid "Transformers"
msgstr ""
#: ../../Qwen/source/inference/transformers.md:3
#: d3760c125a4049b9848d4c98d60104f8
msgid "Transformers is a library of pretrained natural language processing for inference and training. Developers can use Transformers to train models on their data, build inference applications, and generate texts with large language models."
msgid "In general, the pipeline interface requires less boilerplate code, which is shown here. The following shows a basic example using pipeline for mult-iturn conversations:"
msgid "To download model files to a local directory, you could use"
msgstr "要将模型文件下载到本地目录,可以使用"
#: ../../Qwen/source/inference/transformers.md:51
#: ee52b7ed97df4d3d8eee53879369a8a0
msgid "You can also download model files using ModelScope if you are in mainland China"
msgstr "如果您在中国大陆,还可以使用 ModelScope 下载模型文件"
#: ../../Qwen/source/inference/transformers.md:55
#: 0fe0e5f3d5d044f584d89008a0f89b0e
msgid "**Device Placement**: `device_map=\"auto\"` will load the model parameters to multiple devices automatically, if available. It relies on the `accelerate` pacakge. If you would like to use a single device, you can pass `device` instead of device_map. `device=-1` or `device=\"cpu\"` indicates using CPU, `device=\"cuda\"` indicates using the current GPU, and `device=\"cuda:1\"` or `device=1` indicates using the second GPU. Do not use `device_map` and `device` at the same time!"
msgid "**Compute Precision**: `torch_dtype=\"auto\"` will determine automatically the data type to use based on the original precision of the checkpoint and the precision your device supports. For modern devices, the precision determined will be `bfloat16`."
msgid "Calls to the text generation pipleine will use the generation configuration from the model file, e.g., `generation_config.json`. Those configuration could be overridden by passing arguments directly to the call. The default is equivalent to"
msgid "For the best practices in configuring generation parameters, please see the model card."
msgstr "有关配置生成参数的最佳实践,请参阅模型卡片。"
#: ../../Qwen/source/inference/transformers.md:75
#: c69916ddab134eadbb509748b73bb515
msgid "Thinking & Non-Thinking Mode"
msgstr "思考与非思考模式"
#: ../../Qwen/source/inference/transformers.md:77
#: e23485edb6654b588965a22a20332dce
msgid "By default, Qwen3 model will think before response. It is also true for the `pipeline()` interface. To switch between thinking and non-thinking mode, two methods can be used"
msgid "Append a final assistant message, containing only `<think>\\n\\n</think>\\n\\n`. This method is stateless, meaning it will only work for that single turn. It will also strictly prevented the model from generating thinking content. For example,"
msgid "Add to the user (or the system) message, `/no_think` to disable thinking and `/think` to enable thinking. This method is stateful, meaning the model will follow the most recent instruction in multi-turn conversations. You can also use instructions in natural language."
msgid "If you would like a more structured assistant message format, you can use the following function to extract the thinking content into a field named `reasoning_content` which is similar to the format used by vLLM, SGLang, etc."
msgid "Qwen3 comes with two types of pre-quantized models, FP8 and AWQ. The command serving those models are the same as the original models except for the name change:"
msgid "As of 4.51.0, there are issues with Tranformers when running those checkpoints **across GPUs**. The following method could be used to work around those issues:"
msgid "Uncomment [this line](https://github.com/huggingface/transformers/blob/0720e206c6ba28887e4d60ef60a6a089f6c1cc76/src/transformers/integrations/finegrained_fp8.py#L340) in your local installation of `transformers`."
msgid "The maximum context length in pre-training for Qwen3 models is 32,768 tokens. It can be extended to 131,072 tokens with RoPE scaling techniques. We have validated the performance with YaRN."
msgid "Transformers supports YaRN, which can be enabled either by modifying the model files or overriding the default arguments when loading the model."
msgid "Transformers implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.** We advise adding the `rope_scaling` configuration only when processing long contexts is required. It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0."
msgid "With the help of `TextStreamer`, you can modify your chatting with Qwen3 to streaming mode. It will print the response as being generated to the console or the terminal."
msgid "Besides using `TextStreamer`, we can also use `TextIteratorStreamer` which stores print-ready text in a queue, to be used by a downstream application as an iterator:"
msgid "You may find distributed inference with Transformers is not as fast as you would imagine. Transformers with `device_map=\"auto\"` does not apply tensor parallelism and it only uses one GPU at a time. For Transformers with tensor parallelism, please refer to [its documentation](https://huggingface.co/docs/transformers/v4.51.3/en/perf_infer_gpu_multi)."
#~ msgid "The most significant but also the simplest usage of Qwen2.5 is to chat with it using the `transformers` library. In this document, we show how to chat with `Qwen2.5-7B-Instruct`, in either streaming mode or not."
#~ msgid "You can just write several lines of code with `transformers` to chat with Qwen2.5-Instruct. Essentially, we build the tokenizer and the model with `from_pretrained` method, and we use `generate` method to perform chatting with the help of chat template provided by the tokenizer. Below is an example of how to chat with Qwen2.5-7B-Instruct:"
#~ msgid "To continue the chat, simply append the response to the messages with the role assistant and repeat the procedure. The following shows and example:"
#~ msgstr "如要继续对话,只需将回复内容以 assistant 为 role 加入 messages ,然后重复以上流程即可。下面为示例:"
#~ msgid "Note that the previous method in the original Qwen repo `chat()` is now replaced by `generate()`. The `apply_chat_template()` function is used to convert the messages into a format that the model can understand. The `add_generation_prompt` argument is used to add a generation prompt, which refers to `<|im_start|>assistant\\n` to the input. Notably, we apply ChatML template for chat models following our previous practice. The `max_new_tokens` argument is used to set the maximum length of the response. The `tokenizer.batch_decode()` function is used to decode the response. In terms of the input, the above `messages` is an example to show how to format your dialog history and system prompt. By default, if you do not specify system prompt, we directly use `You are Qwen, created by Alibaba Cloud. You are a helpful assistant.`."
#~ msgstr "请注意,原 Qwen 仓库中的旧方法 `chat()` 现在已被 `generate()` 方法替代。这里使用了 `apply_chat_template()` 函数将消息转换为模型能够理解的格式。其中的 `add_generation_prompt` 参数用于在输入中添加生成提示,该提示指向 `<|im_start|>assistant\\n` 。尤其需要注意的是,我们遵循先前实践,对 chat 模型应用 ChatML 模板。而 `max_new_tokens` 参数则用于设置响应的最大长度。此外,通过 `tokenizer.batch_decode()` 函数对响应进行解码。关于输入部分,上述的 `messages` 是一个示例,展示了如何格式化对话历史记录和系统提示。默认情况下,如果您没有指定系统提示,我们将直接使用 `You are Qwen, created by Alibaba Cloud. You are a helpful assistant.` 作为系统提示。"
#~ msgid "`transformers` provides a functionality called \"pipeline\" that encapsulates the many operations in common tasks. You can chat with the model in just 4 lines of code:"
#~ msgid "To continue the chat, simply append the response to the messages with the role assistant and repeat the procedure. The following shows and example:"
#~ msgstr "如要继续对话,只需将回复内容以 assistant 为 role 加入 messages ,然后重复以上流程即可。下面为示例:"
#~ msgid "Batching"
#~ msgstr "批处理"
#~ msgid "All common `transformers` methods support batched input and output. For basic usage, the following is an example:"
#~ msgstr "`transformers` 常用方法均支持批处理。以下为基本用法的示例:"
#~ msgid "With pipeline, it is simpler:"
#~ msgstr "使用流水线功能,实现批处理代码更简单:"
#~ msgid "Using Flash Attention 2 to Accelerate Generation"
#~ msgstr "使用 Flash Attention 2 加速生成"
#~ msgid "With the latest `transformers` and `torch`, Flash Attention 2 will be applied by default if applicable.[^fa2] You do not need to request the use of Flash Attention 2 in `transformers` or install the `flash_attn` package. The following is intended for users that cannot use the latest versions for various reasons."
#~ msgid "If you would like to apply Flash Attention 2, you need to install an appropriate version of `flash_attn`. You can find pre-built wheels at [its GitHub repository](https://github.com/Dao-AILab/flash-attention/releases), and you should make sure the Python version, the torch version, and the CUDA version of torch are a match. Otherwise, you need to install from source. Please follow the guides at [its GitHub README](https://github.com/Dao-AILab/flash-attention)."
#~ msgid "Normally, memory usage after loading the model can be roughly taken as twice the parameter count. For example, a 7B model will take 14GB memory to load. It is because for large language models, the compute dtype is often 16-bit floating point number. Of course, you will need more memory in inference to store the activations."
#~ msgid "For `transformers`, `torch_dtype=\"auto\"` is recommended and the model will be loaded in `bfloat16` automatically. Otherwise, the model will be loaded in `float32` and it will need double memory. You can also pass `torch.bfloat16` or `torch.float16` as `torch_dtype` explicitly."
#~ msgid "`transformers` relies on `accelerate` for multi-GPU inference and the implementation is a kind of naive model parallelism: different GPUs computes different layers of the model. It is enabled by the use of `device_map=\"auto\"` or a customized `device_map` for multiple GPUs."
#~ msgid "However, this kind of implementation is not efficient as for a single request, only one GPU computes at the same time and the other GPUs just wait. To use all the GPUs, you need to arrange multiple sequences as on a pipeline, making sure each GPU has some work to do. However, that will require concurrency management and load balancing, which is out of the scope of `transformers`. Even if all things are implemented, you can make use of concurrency to improve the total throughput but the latency for each request is not great."
#~ msgid "`RuntimeError: CUDA error: device-side assert triggered`, `Assertion -sizes[i] <= index && index < sizes[i] && \"index out of bounds\" failed.`"
#~ msgstr "`RuntimeError: CUDA error: device-side assert triggered`, `Assertion -sizes[i] <= index && index < sizes[i] && \"index out of bounds\" failed.`"
#~ msgid "If it works with single GPU but not multiple GPUs, especially if there are PCI-E switches in your system, it could be related to drivers."
#~ msgid "For data center GPUs (e.g., A800, H800, and L40s), please use the data center GPU drivers and upgrade to the latest subrelease, e.g., 535.104.05 to 535.183.01. You can check the release note at <https://docs.nvidia.com/datacenter/tesla/index.html>, where the issues fixed and known issues are presented."
#~ msgid "For consumer GPUs (e.g., RTX 3090 and RTX 4090), their GPU drivers are released more frequently and focus more on gaming optimization. There are online reports that 545.29.02 breaks `vllm` and `torch` but 545.29.06 works. Their release notes are also less helpful in identifying the real issues. However, in general, the advice is still upgrading the GPU driver."
#~ msgid "Try disabling P2P for process hang, but it has negative effect on speed."
#~ msgstr "尝试禁用 P2P 以解决进程挂起的问题,但这会对速度产生负面影响。"
#~ msgid "Next Step"
#~ msgstr "下一步"
#~ msgid "Now you can chat with Qwen2.5 in either streaming mode or not. Continue to read the documentation and try to figure out more advanced usages of model inference!"
#~ msgid "The attention module for a model in `transformers` typically has three variants: `sdpa`, `flash_attention_2`, and `eager`. The first two are wrappers around related functions in the `torch` and the `flash_attn` packages. It defaults to `sdpa` if available."
#~ msgid "In addition, `torch` has integrated three implementations for `sdpa`: `FLASH_ATTENTION` (indicating Flash Attention 2 since version 2.2), `EFFICIENT_ATTENTION` (Memory Efficient Attention), and `MATH`. It attempts to automatically select the most optimal implementation based on the inputs. You don't need to install extra packages to use them."
#~ msgid "If you wish to explicitly select the implementations in `torch`, refer to [this tutorial](https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html)."
msgid "For quantized models, one of our recommendations is the usage of [AWQ](https://arxiv.org/abs/2306.00978) with [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)."
msgid "**AutoAWQ** is an easy-to-use Python library for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs."
msgid "Now, `transformers` has officially supported AutoAWQ, which means that you can directly use the quantized model with `transformers`. The following is a very simple code snippet showing how to run `Qwen2.5-7B-Instruct-AWQ` with the quantized model:"
msgid "vLLM has supported AWQ, which means that you can directly use our provided AWQ models or those quantized with `AutoAWQ` with vLLM. We recommend using the latest version of vLLM (`vllm>=0.6.1`) which brings performance improvements to AWQ models; otherwise, the performance might not be well-optimized."
msgid "Actually, the usage is the same with the basic usage of vLLM. We provide a simple example of how to launch OpenAI-API compatible API with vLLM and `Qwen2.5-7B-Instruct-AWQ`:"
msgid "Suppose you have finetuned a model based on `Qwen2.5-7B`, which is named `Qwen2.5-7B-finetuned`, with your own dataset, e.g., Alpaca. To build your own AWQ quantized model, you need to use the training data for calibration. Below, we provide a simple demonstration for you to run:"
msgid "Then you need to prepare your data for calibration. What you need to do is just put samples into a list, each of which is a text. As we directly use our finetuning data for calibration, we first format it with ChatML template. For example,"
msgid "[GPTQ](https://arxiv.org/abs/2210.17323) is a quantization method for GPT-like LLMs, which uses one-shot weight quantization based on approximate second-order information. In this document, we show you how to use the quantized model with Hugging Face `transformers` and also how to quantize your own model with [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ)."
msgid "To use the official Qwen2.5 GPTQ models with `transformers`, please ensure that `optimum>=1.20.0` and compatible versions of `transformers` and `auto_gptq` are installed."
msgid "Now, `transformers` has officially supported AutoGPTQ, which means that you can directly use the quantized model with `transformers`. For each size of Qwen2.5, we provide both Int4 and Int8 GPTQ quantized models. The following is a very simple code snippet showing how to run `Qwen2.5-7B-Instruct-GPTQ-Int4`:"
msgid "vLLM has supported GPTQ, which means that you can directly use our provided GPTQ models or those trained with `AutoGPTQ` with vLLM. If possible, it will automatically use the GPTQ Marlin kernel, which is more efficient."
msgid "Actually, the usage is the same with the basic usage of vLLM. We provide a simple example of how to launch OpenAI-API compatible API with vLLM and `Qwen2.5-7B-Instruct-GPTQ-Int4`:"
msgid "If you want to quantize your own model to GPTQ quantized models, we advise you to use AutoGPTQ. It is suggested installing the latest version of the package by installing from source code:"
msgid "Suppose you have finetuned a model based on `Qwen2.5-7B`, which is named `Qwen2.5-7B-finetuned`, with your own dataset, e.g., Alpaca. To build your own GPTQ quantized model, you need to use the training data for calibration. Below, we provide a simple demonstration for you to run:"
msgid "Then you need to prepare your data for calibration. What you need to do is just put samples into a list, each of which is a text. As we directly use our finetuning data for calibration, we first format it with ChatML template. For example,"
msgid "It is unfortunate that the `save_quantized` method does not support sharding. For sharding, you need to load the model and use `save_pretrained` from transformers to save and shard the model. Except for this, everything is so simple. Enjoy!"
msgid "Generation cannot stop properly. Continual generation after where it should stop, then repeated texts, either single character, a phrase, or paragraphs, are generated."
msgid "`auto_gptq` fails to find a fused CUDA kernel compatible with your environment and falls back to a plain implementation. Follow its [installation guide](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md) to install a pre-built wheel or try installing `auto_gptq` from source."
msgid "Self-quantized Qwen2.5-72B-Instruct-GPTQ with `vllm`, `ValueError: ... must be divisible by ...` is raised. The intermediate size of the self-quantized model is different from the official Qwen2.5-72B-Instruct-GPTQ models."
msgstr "`vllm` 使用自行量化的 Qwen2.5-72B-Instruct-GPTQ 时,会引发 `ValueError: ... must be divisible by ...` 错误。自量化的模型的 intermediate size 与官方的 Qwen2.5-72B-Instruct-GPTQ 模型不同。"
msgid "After quantization the size of the quantized weights are divided by the group size, which is typically 128. The intermediate size for the FFN blocks in Qwen2.5-72B is 29568. Unfortunately, {math}`29568 \\div 128 = 231`. Since the number of attention heads and the dimensions of the weights must be divisible by the tensor parallel size, it means you can only run the quantized model with `tensor_parallel_size=1`, i.e., one GPU card."
msgid "A workaround is to make the intermediate size divisible by {math}`128 \\times 8 = 1024`. To achieve that, the weights should be padded with zeros. While it is mathematically equivalent before and after zero-padding the weights, the results may be slightly different in reality."
msgid "This will save the padded checkpoint to the specified directory. Then, copy other files from the original checkpoint to the new directory and modify the `intermediate_size` in `config.json` to `29696`. Finally, you can quantize the saved model checkpoint."
# This file is distributed under the same license as the Qwen package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/quantization/llama.cpp.md:1
#: 2cde165afca34e508b163ca4d513c50c
msgid "llama.cpp"
msgstr ""
#: ../../Qwen/source/quantization/llama.cpp.md:3
#: c6369d5467e449719f2b30253bfdcb99
msgid "Quantization is a major topic for local inference of LLMs, as it reduces the memory footprint. Undoubtably, llama.cpp natively supports LLM quantization and of course, with flexibility as always."
msgid "At high-level, all quantization supported by llama.cpp is weight quantization: Model parameters are quantized into lower bits, and in inference, they are dequantized and used in computation."
msgid "In addition, you can mix different quantization data types in a single quantized model, e.g., you can quantize the embedding weights using a quantization data type and other weights using a different one. With an adequate mixture of quantization types, much lower quantization error can be attained with just a slight increase of bit-per-weight. The example program `llama-quantize` supports many quantization presets, such as Q4_K_M and Q8_0."
msgid "If you find the quantization errors still more than expected, you can bring your own scales, e.g., as computed by AWQ, or use calibration data to compute an importance matrix using `llama-imatrix`, which can then be used during quantization to enhance the quality of the quantized models."
msgid "In this document, we demonstrate the common way to quantize your model and evaluate the performance of the quantized model. We will assume you have the example programs from llama.cpp at your hand. If you don't, check our guide [here](../run_locally/llama.cpp.html#getting-the-program){.external}."
msgid "Sometimes, it may be better to use fp32 as the start point for quantization. In that case, use"
msgstr "有时,可能最好将fp32作为量化的起点。在这种情况下,使用"
#: ../../Qwen/source/quantization/llama.cpp.md:33
#: d54b89e59e214e1baeba025ecd971e30
msgid "Quantizing the GGUF without Calibration"
msgstr "无校准量化GGUF"
#: ../../Qwen/source/quantization/llama.cpp.md:35
#: a6d57166997a4a1bad8a28eb4cc5593c
msgid "For the simplest way, you can directly quantize the model to lower-bits based on your requirements. An example of quantizing the model to 8 bits is shown below:"
msgid "`Q8_0` is a code for a quantization preset. You can find all the presets in [the source code of `llama-quantize`](https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/quantize.cpp). Look for the variable `QUANT_OPTIONS`. Common ones used for 7B models include `Q8_0`, `Q5_0`, and `Q4_K_M`. The letter case doesn't matter, so `q8_0` or `q4_K_m` are perfectly fine."
msgid "However, the accuracy of the quantized model could be lower than expected occasionally, especially for lower-bit quantization. The program may even prevent you from doing that."
msgid "There are several ways to improve quality of quantized models. A common way is to use a calibration dataset in the target domain to identify the weights that really matter and quantize the model in a way that those weights have lower quantization errors, as introduced in the next two methods."
msgid "To improve the quality of your quantized models, one possible solution is to apply the AWQ scale, following [this script](https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md#gguf-export). First, when you run `model.quantize()` with `autoawq`, remember to add `export_compatible=True` as shown below:"
msgid "The above code will not actually quantize the weights. Instead, it adjusts weights based on a dataset so that they are \"easier\" to quantize.[^AWQ]"
msgid "Finally, you can quantize the model as in the last example:"
msgstr "最后,你可以像最后一个例子那样量化模型:"
#: ../../Qwen/source/quantization/llama.cpp.md:89
#: c92ab12879be4e1c98ef49dcdb66e3e0
msgid "In this way, it should be possible to achieve similar quality with lower bit-per-weight."
msgstr "这样,应该有可能以更低的bpw实现相似的质量。"
#: ../../Qwen/source/quantization/llama.cpp.md:95
#: 95d0914f02b44bacb160815e8f6400c3
msgid "Quantizing the GGUF with Importance Matrix"
msgstr "使用重要性矩阵量化GGUF"
#: ../../Qwen/source/quantization/llama.cpp.md:97
#: 35543f118a84404ca6e5c52e3c51b8f7
msgid "Another possible solution is to use the \"important matrix\"[^imatrix], following [this](https://github.com/ggml-org/llama.cpp/tree/master/examples/imatrix)."
msgid "The text is cut in chunks of length `--chunk` for computation. Preferably, the text should be representative of the target domain. The final results will be saved in a file named `qwen3-8b-imatrix.dat` (`-o`), which can then be used:"
msgid "For lower-bit quantization mixtures for 1-bit or 2-bit, if you do not provide `--imatrix`, a helpful warning will be printed by `llama-quantize`."
msgid "`llama.cpp` provides an example program for us to calculate the perplexity, which evaluate how unlikely the given text is to the model. It should be mostly used for comparisons: the lower the perplexity, the better the model remembers the given text."
msgid "Wait for some time and you will get the perplexity of the model. There are some numbers of different kinds of quantization mixture [here](https://github.com/ggml-org/llama.cpp/blob/master/examples/perplexity/README.md). It might be helpful to look at the difference and grab a sense of how that kind of quantization might perform."
msgid "In this guide, we demonstrate how to conduct quantization and evaluate the perplexity with llama.cpp. For more information, please visit the [llama.cpp GitHub repo](https://github.com/ggml-org/llama.cpp)."
msgid "We usually quantize the fp16 model to 4, 5, 6, and 8-bit models with different quantization mixtures, but sometimes a particular mixture just does not work, so we don't provide those in our HuggingFace Hub. However, others in the community may have success, so if you haven't found what you need in our repos, look around."
msgid "If you are interested in what this means, refer to [the AWQ paper](https://arxiv.org/abs/2306.00978). Basically, important weights (called salient weights in the paper) are identified based on activations across data examples. The weights are scaled accordingly such that the salient weights are protected even after quantization."
msgid "Here, the importance matrix keeps record of how weights affect the output: the weight should be important is a slight change in its value causes huge difference in the results, akin to the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm."
msgid "It is not a good evaluation dataset for instruct models though, but it is very common and easily accessible. You probably want to use a dataset similar to your target domain."
msgid "Before starting, let's first discuss what is llama.cpp and what you should expect, and why we say \"use\" llama.cpp, with \"use\" in quotes. llama.cpp is essentially a different ecosystem with a different design philosophy that targets light-weight footprint, minimal external dependency, multi-platform, and extensive, flexible hardware support:"
msgid "Python is an interpreted language: The code you write is executed line-by-line on-the-fly by an interpreter. You can run the example code snippet or script with an interpreter or a natively interactive interpreter shell. In addition, Python is learner friendly, and even if you don't know much before, you can tweak the source code here and there."
msgid "C++ is a compiled language: The source code you write needs to be compiled beforehand, and it is translated to machine code and an executable program by a compiler. The overhead from the language side is minimal. You do have source code for example programs showcasing how to use the library. But it is not very easy to modify the source code if you are not verse in C++ or C."
msgstr "C++ 是一种编译型语言:你编写的源代码需要预先编译,由编译器将其转换为机器码和可执行程序,来自语言层面的开销微乎其微。llama.cpp也提供了示例程序的源代码,展示了如何使用该库。但是,如果你不精通 C++ 或 C 语言,修改源代码并不容易。"
#: ../../Qwen/source/run_locally/llama.cpp.md:29
#: 8d2bc05e1031475f9d97d5dddc1a31c7
msgid "To use llama.cpp means that you use the llama.cpp library in your own program, like writing the source code of [Ollama](https://ollama.com/), [LM Studio](https://lmstudio.ai/), [GPT4ALL](https://www.nomic.ai/gpt4all), [llamafile](https://llamafile.ai/) etc. But that's not what this guide is intended or could do. Instead, here we introduce how to use the `llama-cli` example program, in the hope that you know that llama.cpp does support Qwen2.5 models and how the ecosystem of llama.cpp generally works."
msgid "In this guide, we will show how to \"use\" [llama.cpp](https://github.com/ggml-org/llama.cpp) to run models on your local machine, in particular, the `llama-cli` and the `llama-server` example program, which comes with the library."
msgid "You can get the programs in various ways. For optimal efficiency, we recommend compiling the programs locally, so you get the CPU optimizations for free. However, if you don't have C++ compilers locally, you can also install using package managers or downloading pre-built binaries. They could be less efficient but for non-production example use, they are fine."
msgid "Here, we show the basic command to compile `llama-cli` locally on **macOS** or **Linux**. For Windows or GPU users, please refer to [the guide from llama.cpp](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)."
msgid "To build locally, a C++ compiler and a build system tool are required. To see if they have been installed already, type `cc --version` or `cmake --version` in a terminal window."
msgid "If installed, the build configuration of the tool will be printed to the terminal, and you are good to go!"
msgstr "如果已安装,工具的构建配置信息将被打印到终端,那么你就可以开始了!"
#: ../../Qwen/source/run_locally/llama.cpp.md:66
#: 58e9dc44f8df4a25a647b68d54c934f4
msgid "If errors are raised, you need to first install the related tools:"
msgstr "如果出现错误,说明你需要先安装相关工具:"
#: ../../Qwen/source/run_locally/llama.cpp.md:67
#: 9c92dddd2e3d44c78c448767851a75b2
msgid "On macOS, install with the command `xcode-select --install`"
msgstr "在macOS上,使用命令`xcode-select --install`来安装。"
#: ../../Qwen/source/run_locally/llama.cpp.md:68
#: 9e5163c275244ec4b1ac423afd7a1446
msgid "On Ubuntu, install with the command `sudo apt install build-essential`. For other Linux distributions, the command may vary; the essential packages needed for this guide are `gcc` and `cmake`."
msgid "For the first step, clone the repo and enter the directory:"
msgstr "第一步是克隆仓库并进入该目录:"
#: ../../Qwen/source/run_locally/llama.cpp.md:81
#: 26e2257b7d264ff098b9e9aac386e8bf
msgid "Then, build llama.cpp using CMake:"
msgstr "随后,使用 CMake 执行 llama.cpp 构建:"
#: ../../Qwen/source/run_locally/llama.cpp.md:87
#: f3dd6fd3218844cf87e77bd8d131bd9a
msgid "The first command will check the local environment and determine which backends and features should be included. The second command will actually build the programs."
msgid "Here, we show how to install `llama-cli` and `llama-server` with Homebrew. For other package managers, please check the instructions [here](https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md)."
msgid "Ensure that Homebrew is available on your operating system. If you don't have Homebrew, you can install it as in [its website](https://brew.sh/)."
msgid "Note that the installed binaries might not be built with the optimal compile options for your hardware, which can lead to poor performance. They also don't support GPU on Linux systems."
msgstr "请注意,安装的二进制文件可能并未针对您的硬件优化编译选项,这可能导致性能不佳。此外,在 Linux 系统上它们也不支持 GPU。"
msgid "You can also download pre-built binaries from [GitHub Releases](https://github.com/ggml-org/llama.cpp/releases). Please note that those pre-built binaries files are architecture-, backend-, and os-specific. If you are not sure what those mean, you probably don't want to use them and running with incompatible versions will most likely fail or lead to poor performance."
msgid "`<version>`: the version of llama.cpp. The latest is preferred, but as llama.cpp is updated and released frequently, the latest may contain bugs. If the latest version does not work, try the previous release until it works."
msgid "`<arch>`: the system architecture. `x64` for `x86_64`, e.g., most Intel and AMD systems, including Intel Mac; `arm64` for `arm64`, e.g., Apple Silicon or Snapdragon-based systems."
msgid "Running on GPU: We suggest try the `cu<cuda_verison>` one for NVIDIA GPUs, `kompute` for AMD GPUs, and `sycl` for Intel GPUs first. Ensure that you have related drivers installed."
msgid "`cu<cuda_verison>`: NVIDIA GPUs, CUDA runtime is not included. You can download the `cudart-llama-bin-win-cu<cuda_version>-x64.zip` and unzip it to the same directory if you don't have the corresponding CUDA toolkit installed."
msgid "macOS: `llama-<version>-bin-macos-x64.zip` for Intel Mac with no GPU support; `llama-<version>-bin-macos-arm64.zip` for Apple Silicon with GPU support."
msgid "After downloading the `.zip` file, unzip them into a directory and open a terminal at that directory."
msgstr "下载`.zip`文件后,将其解压到一个目录中,并在该目录下打开终端。"
#: ../../Qwen/source/run_locally/llama.cpp.md:156
#: fb60a22681e6451b9cbf8b0d581f75b5
msgid "Getting the GGUF"
msgstr "获取 GGUF"
#: ../../Qwen/source/run_locally/llama.cpp.md:158
#: 21b68c30d5c349d3b19ba3aa68be1ee0
msgid "GGUF[^GGUF] is a file format for storing information needed to run a model, including but not limited to model weights, model hyperparameters, default generation configuration, and tokenizer."
msgid "We provide a series of GGUF models in our HuggingFace organization, and to search for what you need you can search the repo names with `-GGUF`."
msgid "This will download the Qwen3-8B model in GGUF format quantized with the scheme Q4_K_M."
msgstr "这将下载采用 Q4_K_M 方案量化的 GGUF 格式的 Qwen3-8B model 模型。"
#: ../../Qwen/source/run_locally/llama.cpp.md:178
#: bd5c9e3ec9994ecba81c83e1f9077427
msgid "Preparing Your Own GGUF"
msgstr "准备您自己的 GGUF"
#: ../../Qwen/source/run_locally/llama.cpp.md:180
#: 6e7a23375e09420fb100575c510ad291
msgid "Model files from HuggingFace Hub can be converted to GGUF, using the `convert-hf-to-gguf.py` Python script. It does require you to have a working Python environment with at least `transformers` installed."
msgid "The first argument to the script refers to the path to the HF model directory or the HF model name, and the second argument refers to the path of your output GGUF file. Remember to create the output directory before you run the command."
msgid "The fp16 model could be a bit heavy for running locally, and you can quantize the model as needed. We introduce the method of creating and quantizing GGUF files in [this guide](../quantization/llama.cpp). You can refer to that document for more information."
msgid "Regarding switching between thinking and non-thinking modes, while the soft switch is always available, the hard switch implemented in the chat template is not exposed in llama.cpp. The quick workaround is to pass a custom chat template equivalennt to always `enable_thinking=False` via `--chat-template-file`."
msgid "[llama-cli](https://github.com/ggml-org/llama.cpp/tree/master/examples/main) is a console program which can be used to chat with LLMs. Simple run the following command where you place the llama.cpp programs:"
msgid "CPU: llama-cli by default will use CPU and you can change `-t` to specify how many threads you would like it to use, e.g., `-t 8` means using 8 threads."
msgid "GPU: If the programs are bulit with GPU support, you can use `-ngl`, which allows offloading some layers to the GPU for computation. If there are multiple GPUs, it will offload to all the GPUs. You can use `-dev` to control the devices used and `-sm` to control which kinds of parallelism is used. For example, `-ngl 99 -dev cuda0,cuda1 -sm row` means offload all layers to GPU 0 and GPU1 using the split mode row. Adding `-fa` may also speed up the generation."
msgid "**Sampling Parameters**: llama.cpp supports [a variety of sampling methods](https://github.com/ggml-org/llama.cpp/tree/master/examples/main#generation-flags) and has default configuration for many of them. It is recommended to adjust those parameters according to the actual case and the recommended parameters from Qwen3 modelcard could be used as a reference. If you encounter repetition and endless generation, it is recommended to pass in addition `--presence-penalty` up to `2.0`."
msgid "**Context Management**: llama.cpp adopts the \"rotating\" context management by default. The `-c` controls the maximum context length (default 4096, 0 means loaded from model), and `-n` controls the maximum generation length each time (default -1 means infinite until ending, -2 means until context full). When the context is full but the generation doesn't end, the first `--keep` tokens (default 0, -1 means all) from the initial prompt is kept, and the first half of the rest is discarded. Then, the model continues to generate based on the new context tokens. You can set `--no-context-shift` to prevent this rotating behaviour and the generation will stop once `-c` is reached."
msgid "**Chat**: `--jinja` indicates using the chat template embedded in the GGUF which is prefered and `--color` indicates coloring the texts so that user input and model output can be better differentiated. If there is a chat template, like in Qwen3 models, llama-cli will enter chat mode automatically. To stop generation or exit press \"Ctrl+C\". You can use `-sys` to add a system prompt."
msgid "[llama-server](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) is a simple HTTP server, including a set of LLM REST APIs and a simple web front end to interact with LLMs using llama.cpp."
msgstr "[llama-server](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) 是一个简单的 HTTP 服务器,包含一组 LLM REST API 和一个简单的 Web 前端,用于通过 llama.cpp 与大型语言模型交互。"
#: ../../Qwen/source/run_locally/llama.cpp.md:253
#: c8c19cde1d2a4d9894b7438e00dbb7b5
msgid "The core command is similar to that of llama-cli. In addition, it supports thinking content parsing and tool call parsing."
msgid "By default the server will listen at `http://localhost:8080` which can be changed by passing `--host` and `--port`. The web front end can be assess from a browser at `http://localhost:8080/`. The OpenAI compatible API is at `http://localhost:8080/v1/`."
msgid "If you still find it difficult to use llama.cpp, don't worry, just check out other llama.cpp-based applications. For example, Qwen3 has already been officially part of Ollama and LM Studio, which are platforms for your to search and run local LLMs."
#~ msgid "Previously, Qwen2 models generate nonsense like `GGGG...` with `llama.cpp` on GPUs. The workaround is to enable flash attention (`-fa`), which uses a different implementation, and offload the whole model to the GPU (`-ngl 80`) due to broken partial GPU offloading with flash attention."
#~ msgid "Remember that `llama-cli` is an example program, not a full-blown application. Sometimes it just does not work in the way you would like. This guide could also get quite technical sometimes. If you would like a smooth experience, check out the application mentioned above, which are much easier to \"use\"."
#~ msgid "The command will only compile the parts needed for `llama-cli`. On macOS, it will enable Metal and Accelerate by default, so you can run with GPUs. On Linux, you won't get GPU support by default, but SIMD-optimization is enabled if available."
#~ msgid "There are other [example programs](https://github.com/ggerganov/llama.cpp/tree/master/examples) in llama.cpp. You can build them at once with simply (it may take some time):"
#~ msgid "or you can also compile only the one you need, for example:"
#~ msgstr "你也可以只编译你需要的,例如:"
#~ msgid "Running the Model"
#~ msgstr "运行模型"
#~ msgid "Due to random sampling and source code updates, the generated content with the same command as given in this section may be different from what is shown in the examples."
#~ msgid "`llama-cli` provide multiple \"mode\" to \"interact\" with the model. Here, we demonstrate three ways to run the model, with increasing difficulty."
#~ msgid "For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode"
#~ msgstr "对于普通用户来说,为了获得类似聊天机器人的体验,建议从对话模式开始。"
#~ msgid "The program will first print metadata to the screen until you see the following:"
#~ msgstr "程序首先会在屏幕上打印元数据,直到你看到以下内容:"
#~ msgid "Now, the model is waiting for your input, and you can chat with the model:"
#~ msgstr "现在,模型正在等待你的输入,你可以与模型进行对话:"
#~ msgid "That's something, isn't it? You can stop the model generation anytime by Ctrl+C or Command+. However, if the model generation is ended and the control is returned to you, pressing the combination will exit the program."
#~ msgid "So what does the command we used actually do? Let's explain a little:"
#~ msgstr "那么,我们使用的命令实际上做了什么呢?让我们来解释一下:"
#~ msgid "-m or --model"
#~ msgstr "-m 或 --model"
#~ msgid "Model path, obviously."
#~ msgstr "显然,这是模型路径。"
#~ msgid "-co or --color"
#~ msgstr "-co 或 --color"
#~ msgid "Colorize output to distinguish prompt and user input from generations. Prompt text is dark yellow; user text is green; generated text is white; error text is red."
#~ msgid "Layers to the GPU for computation if the program is compiled with GPU support."
#~ msgstr "如果程序编译时支持 GPU,则将这么多层分配给 GPU 进行计算。"
#~ msgid "-n or --predict"
#~ msgstr "-n 或 --predict"
#~ msgid "Number of tokens to predict."
#~ msgstr "要预测的token数量。"
#~ msgid "You can also explore other options by"
#~ msgstr "你也可以通过以下方式探索其他选项:"
#~ msgid "Interactive Mode"
#~ msgstr "互动模式"
#~ msgid "The conversation mode hides the inner workings of LLMs. With interactive mode, you are made aware how LLMs work in the way to completion or continuation. The workflow is like"
#~ msgid "Append new texts (with optional prefix and suffix), and then let the model continues the generation."
#~ msgstr "添加新文本(可选前缀和后缀),然后让模型继续生成。"
#~ msgid "Repeat Step 2. and Step 3."
#~ msgstr "重复步骤2和步骤3。"
#~ msgid "This workflow requires a different set of options, since you have to mind the chat template yourselves. To proper run the Qwen2.5 models, try the following:"
#~ msgid "Enter interactive mode. You can interrupt model generation and append new texts."
#~ msgstr "进入互动模式。你可以中断模型生成并添加新文本。"
#~ msgid "-if or --interactive-first"
#~ msgstr "-if 或 --interactive-first"
#~ msgid "Immediately wait for user input. Otherwise, the model will run at once and generate based on the prompt."
#~ msgstr "立即等待用户输入。否则,模型将立即运行并根据提示生成文本。"
#~ msgid "In interactive mode, it is the contexts based on which the model predicts the continuation."
#~ msgstr "在互动模式下,这是模型续写用的上文。"
#~ msgid "--in-prefix"
#~ msgstr ""
#~ msgid "String to prefix user inputs with."
#~ msgstr "用户输入附加的前缀字符串。"
#~ msgid "--in-suffix"
#~ msgstr ""
#~ msgid "String to suffix after user inputs with."
#~ msgstr "用户输入附加的后缀字符串。"
#~ msgid "The result is like this:"
#~ msgstr "结果如下:"
#~ msgid "We use `prompt`, `in-prefix`, and `in-suffix` together to implement the chat template (ChatML-like) used by Qwen2.5 with a system message. So the experience is very similar to the conversation mode: you just need to type in the things you want to ask the model and don't need to worry about the chat template once the program starts. Note that, there should not be a new line after user input according to the template, so remember to end your input with `/`."
#~ msgid "Interactive mode can achieve a lot more flexible workflows, under the condition that the chat template is maintained properly throughout. The following is an example:"
#~ msgid "In the above example, I set `--reverse-prompt` to `\"LLM\"` so that the generation is interrupted whenever the model generates `\"LLM\"`[^rp]. The in prefix and in suffix are also set to empty so that I can add content exactly I want. After every generation of `\"LLM\"`, I added the part `\"...not what you think...\"` which are not likely to be generated by the model. Yet the model can continue generation just as fluent, although the logic is broken the second time around. I think it's fun to play around."
#~ msgstr "在上面的例子中,我将 `--reverse-prompt` 设置为 `\"LLM\"`,以便每当模型生成 `\"LLM\"` 时中断生成过程[^rp]。前缀和后缀也被设置为空,这样我可以精确地添加想要的内容。每次生成 `\"LLM\"` 后,我添加了 `\"...not what you think...\"` 的部分,这部分不太可能由模型生成。然而,模型仍能继续流畅生成,尽管第二次逻辑被破坏。这很有趣,值得探索。"
#~ msgid "Non-interactive Mode"
#~ msgstr "非交互模式"
#~ msgid "You can also use `llama-cli` for text completion by using just the prompt. However, it also means you have to format the input properly and only one turn can be generated."
#~ msgid "There are some gotchas in using `--reverse-prompt` as it matches tokens instead of strings. Since the same string can be tokenized differently in different contexts in BPE tokenization, some reverse prompts are never matched even though the string does exist in generation."
msgid "[mlx-lm](https://github.com/ml-explore/mlx-examples/tree/main/llms) helps you run LLMs locally on Apple Silicon. It is available at MacOS. It has already supported Qwen models and this time, we have also provided checkpoints that you can directly use with it."
msgid "We provide model checkpoints with `mlx-lm` in our Hugging Face organization, and to search for what you need you can search the repo names with `-MLX`."
msgid "[Ollama](https://ollama.com/) helps you run LLMs locally with only a few commands. It is available at MacOS, Linux, and Windows. Now, Qwen2.5 is officially on Ollama, and you can run it with one command:"
msgid "Visit the official website [Ollama](https://ollama.com/) and click download to install Ollama on your device. You can also search models on the website, where you can find the Qwen2.5 models. Except for the default one, you can choose to run Qwen2.5-Instruct models of different sizes by:"
msgid "Sometimes you don't want to pull models and you just want to use Ollama with your own GGUF files. Suppose you have a GGUF file of Qwen2.5, `qwen2.5-7b-instruct-q5_0.gguf`. For the first step, you need to create a file called `Modelfile`. The content of the file is shown below:"
msgid "Tool use is now support Ollama and you should be able to run Qwen2.5 models with it. For more details, see our [function calling guide](../framework/function_call)."
# This file is distributed under the same license as the Qwen package.
#
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/training/llama_factory.rst:2
#: 7a9018d9e7ee41858ac5c59723365a63
msgid "LLaMA-Factory"
msgstr ""
#: ../../Qwen/source/training/llama_factory.rst:5
#: 6e90d8f392914d029783ed85b510063f
msgid "To be updated for Qwen3."
msgstr "仍需为Qwen3更新。"
#: ../../Qwen/source/training/llama_factory.rst:7
#: e82fbe9827774824a4259372afda3240
msgid "Here we provide a script for supervised finetuning Qwen2.5 with `LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`__. This script for supervised finetuning (SFT) has the following features:"
msgid "LLaMA-Factory provides several training datasets in ``data`` folder, you can use it directly. If you are using a custom dataset, please prepare your dataset as follows."
msgid "Organize your data in a **json** file and put your data in ``data`` folder. LLaMA-Factory supports dataset in ``alpaca`` or ``sharegpt`` format."
msgid "and enjoy the training process. To make changes to your training, you can modify the arguments in the training command to adjust the hyperparameters. One argument to note is ``cutoff_len``, which is the maximum length of the training data. Control this parameter to avoid OOM error."
msgid "If you train your model with LoRA, you probably need to merge adapter parameters to the main branch. Run the following command to perform the merging of LoRA adapters."
msgid "ModelScope SWIFT (ms-swift) is the official large model and multimodal model training and deployment framework provided by the ModelScope community."
msgstr "ModelScope SWIFT (ms-swift) 是 ModelScope 社区提供的官方大型模型和多模态模型训练与部署框架。"
msgid "Pre-built datasets are available at: `Supported Datasets <https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html#datasets>`__"
msgid "For detailed support information, please refer to: `Supported Features <https://swift.readthedocs.io/en/latest/Instruction/Pre-training-and-Fine-tuning.html#pre-training-and-fine-tuning>`__"
msgid "ms-swift has built-in preprocessing logic for several datasets, which can be directly used for training via the ``--dataset`` parameter. For supported datasets, please refer to: `Supported Datasets <https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html#datasets>`__"
msgid "When using the built-in accuracy/cosine reward, the dataset must include a ``solution`` column to compute accuracy. The other columns in the dataset will also be passed to the `kwargs` of the reward function."
msgid "Customizing the Reward Function: To tailor the reward function to your specific needs, you can refer to the following resource: `external reward plugin <https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/plugin>`__"