v1.0

a52e53db · chenzk · a52e53db · a52e53db · a52e53db · a52e53db
Commit a52e53db authored Apr 29, 2025 by chenzk
20 changed files
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3"
+
+sphinx:
+   configuration: docs/source/conf.py
+
+# If using Sphinx, optionally build your docs in additional formats such as PDF
+# formats:
+#    - pdf
+
+# Optionally declare the Python requirements required to build your docs
+python:
+   install:
+   - requirements: docs/requirements-docs.txt
--- a/Qwen/Qwen3-8B/README.md
+++ b/Qwen/Qwen3-8B/README.md
+---
+library_name: transformers
+license: apache-2.0
+license_link: https://huggingface.co/Qwen/Qwen3-8B/blob/main/LICENSE
+pipeline_tag: text-generation
+base_model:
+- Qwen/Qwen3-8B-Base
+---
+
+# Qwen3-8B
+<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
+</a>
+
+## Qwen3 Highlights
+
+Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:
+
+- **Uniquely support of seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose dialogue) **within single model**, ensuring optimal performance across various scenarios.
+- **Significantly enhancement in its reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
+- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
+- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
+- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
+
+## Model Overview
+
+**Qwen3-8B** has the following features:
+- Type: Causal Language Models
+- Training Stage: Pretraining & Post-training
+- Number of Parameters: 8.2B
+- Number of Paramaters (Non-Embedding): 6.95B
+- Number of Layers: 36
+- Number of Attention Heads (GQA): 32 for Q and 8 for KV
+- Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts). 
+
+For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
+
+## Quickstart
+
+The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`.
+
+With `transformers<4.51.0`, you will encounter the following error:
+```
+KeyError: 'qwen3'
+```
+
+The following contains a code snippet illustrating how to use the model generate content based on given inputs. 
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_name = "Qwen/Qwen3-8B"
+
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+
+# prepare the model input
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+# conduct text completion
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
+
+# parsing thinking content
+try:
+    # rindex finding 151668 (</think>)
+    index = len(output_ids) - output_ids[::-1].index(151668)
+except ValueError:
+    index = 0
+
+thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
+content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
+
+print("thinking content:", thinking_content)
+print("content:", content)
+```
+
+For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or  to create an OpenAI-compatible API endpoint:
+- SGLang:
+    ```shell
+    python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3
+    ```
+- vLLM:
+    ```shell
+    vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1
+    ```
+
+For local use, applications such as llama.cpp, Ollama, LMStudio, and MLX-LM have also supported Qwen3.
+
+## Switching Between Thinking and Non-Thinking Mode
+
+> [!TIP]
+> The `enable_thinking` switch is also available in APIs created by SGLang and vLLM. 
+> Please refer to our documentation for [SGLang](https://qwen.readthedocs.io/en/latest/deployment/sglang.html#thinking-non-thinking-modes) and [vLLM](https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes) users.
+
+### `enable_thinking=True`
+
+By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting `enable_thinking=True` or leaving it as the default value in `tokenizer.apply_chat_template`, the model will engage its thinking mode.
+
+```python
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True  # True is the default value for enable_thinking
+)
+```
+
+In this mode, the model will generate think content wrapped in a `<think>...</think>` block, followed by the final response.
+
+> [!NOTE]
+> For thinking mode, use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0` (the default setting in `generation_config.json`). **DO NOT use greedy decoding**, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
+
+
+### `enable_thinking=False`
+
+We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.
+
+```python
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
+)
+```
+
+In this mode, the model will not generate any think content and will not include a `<think>...</think>` block.
+
+> [!NOTE]
+> For non-thinking mode, we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
+
+### Advanced Usage: Switching Between Thinking and Non-Thinking Modes via User Input
+
+We provide a soft switch mechanism that allows users to dynamically control the model's behavior when `enable_thinking=True`. Specifically, you can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
+
+Here is an example of a multi-turn conversation:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+class QwenChatbot:
+    def __init__(self, model_name="Qwen/Qwen3-8B"):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModelForCausalLM.from_pretrained(model_name)
+        self.history = []
+
+    def generate_response(self, user_input):
+        messages = self.history + [{"role": "user", "content": user_input}]
+
+        text = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True
+        )
+
+        inputs = self.tokenizer(text, return_tensors="pt")
+        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
+        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
+
+        # Update history
+        self.history.append({"role": "user", "content": user_input})
+        self.history.append({"role": "assistant", "content": response})
+
+        return response
+
+# Example Usage
+if __name__ == "__main__":
+    chatbot = QwenChatbot()
+
+    # First input (without /think or /no_think tags, thinking mode is enabled by default)
+    user_input_1 = "How many r's in strawberries?"
+    print(f"User: {user_input_1}")
+    response_1 = chatbot.generate_response(user_input_1)
+    print(f"Bot: {response_1}")
+    print("----------------------")
+
+    # Second input with /no_think
+    user_input_2 = "Then, how many r's in blueberries? /no_think"
+    print(f"User: {user_input_2}")
+    response_2 = chatbot.generate_response(user_input_2)
+    print(f"Bot: {response_2}") 
+    print("----------------------")
+
+    # Third input with /think
+    user_input_3 = "Really? /think"
+    print(f"User: {user_input_3}")
+    response_3 = chatbot.generate_response(user_input_3)
+    print(f"Bot: {response_3}")
+```
+
+> [!NOTE]
+> For API compatibility, when `enable_thinking=True`, regardless of whether the user uses `/think` or `/no_think`, the model will always output a block wrapped in `<think>...</think>`. However, the content inside this block may be empty if thinking is disabled.
+> When `enable_thinking=False`, the soft switches are not valid. Regardless of any `/think` or `/no_think` tags input by the user, the model will not generate think content and will not include a `<think>...</think>` block.
+
+## Agentic Use
+
+Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
+
+To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
+```python
+from qwen_agent.agents import Assistant
+
+# Define LLM
+llm_cfg = {
+    'model': 'Qwen3-8B',
+
+    # Use the endpoint provided by Alibaba Model Studio:
+    # 'model_type': 'qwen_dashscope',
+    # 'api_key': os.getenv('DASHSCOPE_API_KEY'),
+
+    # Use a custom endpoint compatible with OpenAI API:
+    'model_server': 'http://localhost:8000/v1',  # api_base
+    'api_key': 'EMPTY',
+
+    # Other parameters:
+    # 'generate_cfg': {
+    #         # Add: When the response content is `<think>this is the thought</think>this is the answer;
+    #         # Do not add: When the response has been separated by reasoning_content and content.
+    #         'thought_in_content': True,
+    #     },
+}
+
+# Define Tools
+tools = [
+    {'mcpServers': {  # You can specify the MCP configuration file
+            'time': {
+                'command': 'uvx',
+                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
+            },
+            "fetch": {
+                "command": "uvx",
+                "args": ["mcp-server-fetch"]
+            }
+        }
+    },
+  'code_interpreter',  # Built-in tools
+]
+
+# Define Agent
+bot = Assistant(llm=llm_cfg, function_list=tools)
+
+# Streaming generation
+messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
+for responses in bot.run(messages=messages):
+    pass
+print(responses)
+```
+
+## Processing Long Texts
+
+Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
+
+YaRN is currently supported by several inference frameworks, e.g., `transformers` and `llama.cpp` for local use, `vllm` and `sglang` for deployment. In general, there are two approaches to enabling YaRN for supported frameworks:
+
+- Modifying the model files:
+  In the `config.json` file, add the `rope_scaling` fields:
+    ```json
+    {
+        ...,
+        "rope_scaling": {
+            "type": "yarn",
+            "factor": 4.0,
+            "original_max_position_embeddings": 32768
+        }
+    }
+    ```
+  For `llama.cpp`, you need to regenerate the GGUF file after the modification.
+
+- Passing command line arguments:
+
+  For `vllm`, you can use
+    ```shell
+    vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072  
+    ```
+
+  For `sglang`, you can use
+    ```shell
+    python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
+    ```
+
+  For `llama-server` from `llama.cpp`, you can use
+    ```shell
+    llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
+    ```
+
+> [!IMPORTANT]
+> If you encounter the following warning
+> ```
+> Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
+> ```
+> please upgrade `transformers>=4.51.0`.
+
+> [!NOTE]
+> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
+> We advise adding the `rope_scaling` configuration only when processing long contexts is required. 
+> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0. 
+
+> [!NOTE]
+> The default `max_position_embeddings` in `config.json` is set to 40,960. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
+
+> [!TIP]
+> The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
+
+## Best Practices
+
+To achieve optimal performance, we recommend the following settings:
+
+1. **Sampling Parameters**:
+   - For thinking mode (`enable_thinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0`. **DO NOT use greedy decoding**, as it can lead to performance degradation and endless repetitions.
+   - For non-thinking mode (`enable_thinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`.
+   - For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
+
+2. **Adequate Output Length**: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
+
+3. **Standardize Output Format**: We recommend using prompts to standardize model outputs when benchmarking.
+   - **Math Problems**: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
+   - **Multiple-Choice Questions**: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`."
+
+4. **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.
+
+### Citation
+
+If you find our work helpful, feel free to give us a cite.
+
+```
+@misc{qwen3,
+    title  = {Qwen3},
+    url    = {https://qwenlm.github.io/blog/qwen3/},
+    author = {Qwen Team},
+    month  = {April},
+    year   = {2025}
+}
+```
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# Qwen3
+骨干网络仅含0.45B参数，支持口音强度控制，适于实时语音交互，能满足不同场景下对语音口音克隆的多样化需求。
+
+## 论文
+`无`
+
+## 模型结构
+Qwen3采用通用的Decoder-Only结构，引入了MoE提升性能，首个「混合推理模型」，将「快思考」与「慢思考」集成进同一个模型。
+<div align=center>
+    <img src="./doc/qwen.png"/>
+</div>
+
+## 算法原理
+将输入embedding后放入attention、ffn等提取特征，最后利用Softmax将解码器最后一层产生的未经归一化的分数向量（logits）转换为概率分布，其中每个元素表示生成对应词汇的概率，这使得模型可以生成一个分布，并从中选择最可能的词作为预测结果。
+
+## 环境配置
+```
+mv Qwen3_pytorch Qwen3 # 去框架名后缀
+```
+
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：e77c15729879
+docker run -it --shm-size=64G -v $PWD/Qwen3:/home/Qwen3 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name qwen3 <your IMAGE ID> bash
+cd /home/Qwen3
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+### Dockerfile（方法二）
+```
+cd /home/Qwen3/docker
+docker build --no-cache -t qwen3:latest .
+docker run --shm-size=64G --name qwen3 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Qwen3:/home/Qwen3 -it qwen3 bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.hpccube.com/tool/
+```
+DTK驱动:dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+vllm:0.6.2
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+transformers:4.51.0
+```
+
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/Qwen3
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+
+## 数据集
+`无`
+
+## 训练
+无
+
+## 推理
+预训练权重目录结构：
+```
+/home/Qwen3/
+    └── Qwen/Qwen3-8B
+``` 
+
+### 单机多卡
+```
+# 本项目以Qwen3-8B示例，其它Qwen3模型以此类推。
+cd /home/Qwen3
+python infer_transformers.py
+# vllm>=0.8.4正在适配中，后期将陆续开放vllm版推理。
+```
+
+更多资料可参考源项目中的[`README_orgin`](./README_orgin.md)。
+
+## result
+`输入: `
+```
+prompt: "Give me a short introduction to large language models."
+```
+
+`输出:`
+```
+<think>
+Okay, the user wants a short introduction to large language models. Let me start by defining what they are. I should mention they're AI systems trained on massive text data. Maybe include how they process and generate human-like text. Also, touch on their applications like answering questions, creating content, coding. Need to keep it concise but cover the key points. Oh, and maybe mention their size, like parameters, but not too technical. Avoid jargon. Make sure it's easy to understand. Let me check if I'm missing anything important. Oh, maybe a sentence about their training process? Or just stick to the basics. Alright, structure: definition, training data, capabilities, applications. Keep each part brief. That should work.
+</think>
+
+Large language models (LLMs) are advanced artificial intelligence systems trained on vast amounts of text data to understand and generate human-like language. They can process and respond to complex queries, create written content, code, and even engage in conversations. These models, often with billions of parameters, excel at tasks like answering questions, summarizing information, and translating languages, making them versatile tools for various applications, from customer service to research and creative writing.
+```
+
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+
+## 应用场景
+### 算法类别
+`对话问答`
+### 热点应用行业
+`制造,广媒,金融,能源,医疗,家居,教育`
+## 预训练权重
+魔搭社区下载地址为：[Qwen/Qwen3-8B](https://www.modelscope.cn/Qwen/Qwen3-8B.git)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/Qwen3_pytorch.git
+## 参考资料
+- https://github.com/QwenLM/Qwen3.git
+
--- a/README_orgin.md
+++ b/README_orgin.md
+# Qwen3
+
+<p align="center">
+    <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/logo_qwen3.png" width="400"/>
+<p>
+
+<p align="center">
+          💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 Paper &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/blog/qwen3/">Blog</a> &nbsp&nbsp ｜ &nbsp&nbsp📖 <a href="https://qwen.readthedocs.io/">Documentation</a>
+<br>
+🖥️ <a href="https://huggingface.co/spaces/Qwen/Qwen3-Demo">Demo</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp
+</p>
+
+
+Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with `Qwen3-` or visit the [Qwen3 collection](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f), and you will find all you need! Enjoy!
+
+To learn more about Qwen3, feel free to read our documentation \[[EN](https://qwen.readthedocs.io/en/latest/)|[ZH](https://qwen.readthedocs.io/zh-cn/latest/)\]. Our documentation consists of the following sections:
+
+- Quickstart: the basic usages and demonstrations;
+- Inference: the guidance for the inference with Transformers, including batch inference, streaming, etc.;
+- Run Locally: the instructions for running LLM locally on CPU and GPU, with frameworks like llama.cpp and Ollama;
+- Deployment: the demonstration of how to deploy Qwen for large-scale inference with frameworks like SGLang, vLLM, TGI, etc.;
+- Quantization: the practice of quantizing LLMs with GPTQ, AWQ, as well as the guidance for how to make high-quality quantized GGUF files;
+- Training: the instructions for post-training, including SFT and RLHF (TODO) with frameworks like Axolotl, LLaMA-Factory, etc.
+- Framework: the usage of Qwen with frameworks for application, e.g., RAG, Agent, etc.
+
+## Introduction
+
+We are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models.
+These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5.
+We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models. 
+
+The highlights from Qwen3 include:
+- **Dense and Mixture-of-Experts (MoE) models of various sizes**, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.
+- **Seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose chat), ensuring optimal performance across various scenarios.
+- **Significantly enhancement in reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
+- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
+- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
+- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
+
+> ![IMPORTANT]
+> Qwen3 models adopt a different naming scheme.
+>
+> The post-trained models do not use the "-Instruct" suffix any more. For example, Qwen3-32B is the newer version of Qwen2.5-32B-Instruct.
+>
+> The base models now have names ending with "-Base".
+
+
+## News
+
+- 2025.04.29: We released the Qwen3 series. Check our [blog](https://qwenlm.github.io/blog/qwen3) for more details!
+- 2024.09.19: We released the Qwen2.5 series. This time there are 3 extra model sizes: 3B, 14B, and 32B for more possibilities. Check our [blog](https://qwenlm.github.io/blog/qwen2.5) for more!
+- 2024.06.06: We released the Qwen2 series. Check our [blog](https://qwenlm.github.io/blog/qwen2/)!
+- 2024.03.28: We released the first MoE model of Qwen: Qwen1.5-MoE-A2.7B! Temporarily, only HF transformers and vLLM support the model. We will soon add the support of llama.cpp, mlx-lm, etc. Check our [blog](https://qwenlm.github.io/blog/qwen-moe/) for more information!
+- 2024.02.05: We released the Qwen1.5 series.
+
+## Performance
+
+Detailed evaluation results are reported in this <a href="https://qwenlm.github.io/blog/qwen3/"> 📑 blog</a>.
+
+For requirements on GPU memory and the respective throughput, see results [here](https://qwen.readthedocs.io/en/latest/getting_started/speed_benchmark.html) .
+
+## Run Qwen3
+
+### 🤗 Transformers
+
+Transformers is a library of pretrained natural language processing for inference and training. 
+The latest version of `transformers` is recommended and `transformers>=4.51.0` is required.
+
+The following contains a code snippet illustrating how to use the model generate content based on given inputs. 
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Qwen/Qwen3-8B"
+
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+
+# prepare the model input
+prompt = "Give me a short introduction to large language models."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+# conduct text completion
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
+
+# the result will begin with thinking content in <think></think> tags, followed by the actual response
+print(tokenizer.decode(output_ids, skip_special_tokens=True))
+```
+
+By default, Qwen3 models will think before response.
+This could be controled by
+- `enable_thinking=False`: Passing `enable_thinking=False` to `tokenizer.apply_chat_template` will strictly prevent the model from generating thinking content.
+- `/think` and `/nothink` instructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.
+
+
+
+### ModelScope
+
+We strongly advise users especially those in mainland China to use ModelScope. 
+ModelScope adopts a Python API similar to Transformers.
+The CLI tool `modelscope download` can help you solve issues concerning downloading checkpoints.
+
+
+### llama.cpp
+
+[`llama.cpp`](https://github.com/ggml-org/llama.cpp) enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.
+`llama.cpp>=b5092` is required.
+
+To use the CLI, run the following in a terminal:
+```shell
+./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
+# CTRL+C to exit
+```
+
+To use the API server, run the following in a terminal:
+```shell
+./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080
+```
+A simple web front end will be at `http://localhost:8080` and an OpenAI-compatible API will be at `http://localhost:8080/v1`.
+
+For additional guides, please refer to [our documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html).
+
+### Ollama
+
+After [installing ollama](https://ollama.com/), you can initiate the ollama service with the following command:
+```shell
+ollama serve
+# You need to keep this service running whenever you are using ollama
+```
+
+To pull a model checkpoint and run the model, use the `ollama run` command. You can specify a model size by adding a suffix to `qwen3`, such as `:8b` or `:30b-a3b`:
+```shell
+ollama run qwen3:8b
+# To exit, type "/bye" and press ENTER
+```
+
+You can also access the ollama service via its OpenAI-compatible API. 
+Please note that you need to (1) keep `ollama serve` running while using the API, and (2) execute `ollama run qwen3:8b` before utilizing this API to ensure that the model checkpoint is prepared.
+The API is at `http://localhost:11434/v1/` by default.
+
+For additional details, please visit [ollama.ai](https://ollama.com/).
+
+### LMStudio
+
+Qwen3 has already been supported by [lmstudio.ai](https://lmstudio.ai/). You can directly use LMStudio with our GGUF files.
+
+### MLX-LM
+
+If you are running on Apple Silicon, [`mlx-lm`](https://github.com/ml-explore/mlx-lm) also supports Qwen3 (`mlx-lm>=0.24.0`). 
+Look for models ending with MLX on HuggingFace Hub.
+
+
+<!-- ### OpenVINO
+
+Qwen2.5 has already been supported by [OpenVINO toolkit](https://github.com/openvinotoolkit). You can install and run this [chatbot example](https://github.com/OpenVINO-dev-contest/Qwen2.openvino) with Intel CPU, integrated GPU or discrete GPU.  -->
+
+
+<!-- ### Text generation web UI
+
+You can directly use [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui) for creating a web UI demo. If you use GGUF, remember to install the latest wheel of `llama.cpp` with the support of Qwen2.5. -->
+
+
+<!-- ### llamafile
+
+Clone [`llamafile`](https://github.com/Mozilla-Ocho/llamafile), run source install, and then create your own llamafile with the GGUF file following the guide [here](https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file#creating-llamafiles). You are able to run one line of command, say `./qwen.llamafile`, to create a demo. -->
+
+
+## Deploy Qwen3
+
+Qwen3 is supported by multiple inference frameworks. 
+Here we demonstrate the usage of `SGLang` and `vLLM`.
+You can also find Qwen3 models from various inference providers, e.g., [Alibaba Cloud Model Studio](https://www.alibabacloud.com/en/product/modelstudio).
+
+### SGLang
+
+[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
+SGLang could be used to launch a server with OpenAI-compatible API service. 
+`sglang>=0.4.6.post1` is required.
+It is as easy as
+```shell
+python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000 --reasoning-parser qwen3
+```
+An OpenAI-compatible API will be available at `http://localhost:30000/v1`.
+
+### vLLM
+
+[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
+`vllm>=0.8.4` is required.
+
+```shell
+vllm serve Qwen/Qwen3-8B --port 8000 --enable-reasoning-parser --reasoning-parser deepseek_r1
+```
+An OpenAI-compatible API will be available at `http://localhost:8000/v1`.
+
+### MindIE
+
+For depolyment on Ascend NPUs, please visit [Modelers](https://modelers.cn/) and search for Qwen3.
+
+<!-- 
+### OpenLLM
+
+[OpenLLM](https://github.com/bentoml/OpenLLM) allows you to easily run Qwen2.5 as OpenAI-compatible APIs. You can start a model server using `openllm serve`. For example:
+
+```bash
+openllm serve qwen2.5:7b
+```
+
+The server is active at `http://localhost:3000/`, providing OpenAI-compatible APIs. You can create an OpenAI client to call its chat API. For more information, refer to [our documentation](https://qwen.readthedocs.io/en/latest/deployment/openllm.html). -->
+
+
+## Build with Qwen3
+
+### Tool Use
+
+For tool use capabilities, we recommend taking a look at [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent), which provides a wrapper around these APIs to support tool use or function calling with MCP support.
+Tool use with Qwen3 can also be conducted with SGLang, vLLM,  Transformers, llama.cpp, Ollama, etc.
+Follow guides in our documentation to see how to enable the support.
+
+
+### Finetuning
+
+We advise you to use training frameworks, including [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), [unsloth](https://github.com/unslothai/unsloth), [Swift](https://github.com/modelscope/swift), [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory), etc., to finetune your models with SFT, DPO, GRPO, etc.
+
+
+## License Agreement
+
+All our open-source models are licensed under Apache 2.0. 
+You can find the license files in the respective Hugging Face repositories.
+
+## Citation
+
+If you find our work helpful, feel free to give us a cite.
+
+```
+@article{qwen2.5,
+    title   = {Qwen2.5 Technical Report}, 
+    author  = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
+    journal = {arXiv preprint arXiv:2412.15115},
+    year    = {2024}
+}
+
+@article{qwen2,
+    title   = {Qwen2 Technical Report}, 
+    author  = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
+    journal = {arXiv preprint arXiv:2407.10671},
+    year    = {2024}
+}
+```
+
+## Contact Us
+If you are interested to leave a message to either our research team or product team, join our [Discord](https://discord.gg/z3GAxXZ9Ce) or [WeChat groups](assets/wechat.png)!
--- a/doc/qwen.png
+++ b/doc/qwen.png
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+ENV DEBIAN_FRONTEND=noninteractive
+# RUN yum update && yum install -y git cmake wget build-essential
+# RUN source /opt/dtk-dtk25.04/env.sh
+# # 安装pip相关依赖
+COPY requirements.txt requirements.txt
+RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+
--- a/docker/requirements.txt
+++ b/docker/requirements.txt
+transformers>=4.51.0
--- a/docker_nv/Dockerfile-cu121
+++ b/docker_nv/Dockerfile-cu121
+ARG CUDA_VERSION=12.1.0
+ARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
+
+FROM ${from} as base
+
+RUN <<EOF
+apt update -y && apt upgrade -y && apt install -y --no-install-recommends  \
+    git \
+    git-lfs \
+    python3 \
+    python3-pip \
+    python3-dev \
+    wget \
+    vim \
+&& rm -rf /var/lib/apt/lists/*
+EOF
+
+RUN ln -s /usr/bin/python3 /usr/bin/python
+
+RUN git lfs install
+
+FROM base as dev
+
+WORKDIR /
+
+RUN mkdir -p /data/shared/Qwen
+
+WORKDIR /data/shared/Qwen/
+
+FROM dev as bundle_req
+RUN pip install --no-cache-dir networkx==3.1
+RUN pip3 install --no-cache-dir torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
+RUN pip3 install --no-cache-dir transformers==4.40.2 accelerate tiktoken einops scipy
+    
+FROM bundle_req as bundle_finetune
+ARG BUNDLE_FINETUNE=true
+
+RUN <<EOF
+if [ "$BUNDLE_FINETUNE" = "true" ]; then
+    cd /data/shared/Qwen
+
+    # Full-finetune / LoRA.
+    pip3 install --no-cache-dir "deepspeed==0.14.2" "peft==0.11.1"
+
+    # Q-LoRA.
+    apt update -y && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \
+        libopenmpi-dev openmpi-bin \
+        && rm -rf /var/lib/apt/lists/*
+    pip3 install --no-cache-dir "optimum==1.20.0" "auto-gptq==0.7.1" "autoawq==0.2.5" mpi4py
+fi
+EOF
+
+FROM bundle_finetune as bundle_vllm
+ARG BUNDLE_VLLM=true
+
+RUN <<EOF
+if [ "$BUNDLE_VLLM" = "true" ]; then
+    cd /data/shared/Qwen
+
+    pip3 install --no-cache-dir vllm==0.4.3 "fschat[model_worker,webui]==0.2.36"
+fi
+EOF
+
+FROM bundle_vllm as bundle_flash_attention
+ARG BUNDLE_FLASH_ATTENTION=true
+
+RUN <<EOF 
+if [ "$BUNDLE_FLASH_ATTENTION" = "true" ]; then
+    pip3 install --no-cache-dir flash-attn==2.5.8 --no-build-isolation
+fi
+EOF
+
+FROM bundle_flash_attention as final
+
+COPY ../examples/sft/* ./
+COPY ../examples/demo/* ./
+
+EXPOSE 80
--- a/docker_nv/docker_cli_demo.sh
+++ b/docker_nv/docker_cli_demo.sh
+#!/usr/bin/env bash
+#
+# This script will automatically pull docker image from DockerHub, and start a container to run the Qwen-Chat cli-demo.
+
+IMAGE_NAME=qwenllm/qwen:2-cu121
+QWEN_CHECKPOINT_PATH=/path/to/Qwen-Instruct
+CONTAINER_NAME=qwen2
+
+function usage() {
+    echo '
+Usage: bash docker/docker_cli_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Instruct] [-n CONTAINER_NAME]
+'
+}
+
+while [[ "$1" != "" ]]; do
+    case $1 in
+        -i | --image-name )
+            shift
+            IMAGE_NAME=$1
+            ;;
+        -c | --checkpoint )
+            shift
+            QWEN_CHECKPOINT_PATH=$1
+            ;;
+        -n | --container-name )
+            shift
+            CONTAINER_NAME=$1
+            ;;
+        -h | --help )
+            usage
+            exit 0
+            ;;
+        * )
+            echo "Unknown argument ${1}"
+            exit 1
+            ;;
+    esac
+    shift
+done
+
+if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
+    echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
+    exit 1
+fi
+
+sudo docker pull ${IMAGE_NAME} || {
+    echo "Pulling image ${IMAGE_NAME} failed, exit."
+    exit 1
+}
+
+sudo docker run --gpus all --rm --name ${CONTAINER_NAME} \
+    --mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Instruct \
+    -it ${IMAGE_NAME} \
+    python cli_demo.py -c /data/shared/Qwen/Qwen-Instruct/
\ No newline at end of file
--- a/docker_nv/docker_web_demo.sh
+++ b/docker_nv/docker_web_demo.sh
+#!/usr/bin/env bash
+#
+# This script will automatically pull docker image from DockerHub, and start a daemon container to run the Qwen-Chat web-demo.
+
+IMAGE_NAME=qwenllm/qwen:2-cu121
+QWEN_CHECKPOINT_PATH=/path/to/Qwen-Instruct
+PORT=8901
+CONTAINER_NAME=qwen2
+
+function usage() {
+    echo '
+Usage: bash docker/docker_web_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Instruct] [-n CONTAINER_NAME] [--port PORT]
+'
+}
+
+while [[ "$1" != "" ]]; do
+    case $1 in
+        -i | --image-name )
+            shift
+            IMAGE_NAME=$1
+            ;;
+        -c | --checkpoint )
+            shift
+            QWEN_CHECKPOINT_PATH=$1
+            ;;
+        -n | --container-name )
+            shift
+            CONTAINER_NAME=$1
+            ;;
+        --port )
+            shift
+            PORT=$1
+            ;;
+        -h | --help )
+            usage
+            exit 0
+            ;;
+        * )
+            echo "Unknown argument ${1}"
+            exit 1
+            ;;
+    esac
+    shift
+done
+
+if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
+    echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
+    exit 1
+fi
+
+sudo docker pull ${IMAGE_NAME} || {
+    echo "Pulling image ${IMAGE_NAME} failed, exit."
+    exit 1
+}
+
+sudo docker run --gpus all -d --restart always --name ${CONTAINER_NAME} \
+    -v /var/run/docker.sock:/var/run/docker.sock -p ${PORT}:80 \
+    --mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Instruct \
+    -it ${IMAGE_NAME} \
+    python web_demo.py --server-port 80 --server-name 0.0.0.0 -c /data/shared/Qwen/Qwen-Instruct/ && {
+    echo "Successfully started web demo. Open 'http://localhost:${PORT}' to try!
+Run \`docker logs ${CONTAINER_NAME}\` to check demo status.
+Run \`docker rm -f ${CONTAINER_NAME}\` to stop and remove the demo."
+}
\ No newline at end of file
--- a/docs/Makefile
+++ b/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
--- a/docs/README.md
+++ b/docs/README.md
+# Qwen Documentation
+
+This is the source of the documentation at <https://qwen.readthedocs.io>.
+
+## Quick Start
+
+We use `sphinx` to manage the documentation and use the `furo` theme.
+To get started, simply run
+```bash
+pip install -r requirements-docs.txt
+```
+
+Then run `make html` or `sphinx-build -M html source build` and it will compile the docs and put it under the `build/html` directory.
+
+
+## Translation
+
+The documentation is available in both English and Simplified Chinese. We use
+`sphinx-intl` to work with Sphinx translation flow, following [this article](https://www.sphinx-doc.org/en/master/usage/advanced/intl.html).
+
+You need to install the Python package `sphinx-intl` before starting.
+
+1. After updating the English documentation, run `make gettext`, and the pot files will be placed in the `build/gettext` directory. `make gettext` can be slow if the doc is long.
+
+2. Use the generated pot files to update the po files:
+    ```bash
+    sphinx-intl update -p build/gettext -l zh_CN -w 0
+    ```
+
+3. Translate po files at `locales\zh_CN\LC_MESSAGES`. Pay attention to fuzzy matches (messages after `#, fuzzy`). Please be careful not to break reST notation.
+
+4. Build translated document: `make -e SPHINXOPTS="-D language='zh_CN'" html` or `sphinx-build -M html source build -D language=zh_CN`
+
+## Auto Build
+
+```bash
+pip install sphinx-autobuild
+```
+
+To autobuild the default version:
+```bash
+sphinx-autobuild source build/html
+```
+
+To autobuild the translated version:
+```bash
+sphinx-autobuild source build/html -D language=zh_CN --watch locales/zh_CN
+```
+
+By default, the doc is at `http://127.0.0.1:8000`
\ No newline at end of file
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/openllm.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/openllm.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/openllm.rst:2 986ea00cb5af4a0d82f974ed79a82430
+msgid "OpenLLM"
+msgstr "OpenLLM"
+
+#: ../../Qwen/source/deployment/openllm.rst:5 78be03fbdccb429892b03bf84596411b
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/deployment/openllm.rst:7 a001f11d1c5440188121d20b3baf59db
+msgid "OpenLLM allows developers to run Qwen2.5 models of different sizes as OpenAI-compatible APIs with a single command. It features a built-in chat UI, state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Qwen2.5. Visit `the OpenLLM repository <https://github.com/bentoml/OpenLLM/>`_ to learn more."
+msgstr "OpenLLM 允许开发者通过一个命令运行不同大小的 Qwen2.5 模型，提供 OpenAI 兼容的 API。它具有内置的聊天 UI，先进的推理后端，以及简化的工作流程来使用 Qwen2.5 创建企业级云部署。访问 `OpenLLM 仓库 <https://github.com/bentoml/OpenLLM/>`_ 了解更多信息。"
+
+#: ../../Qwen/source/deployment/openllm.rst:10 229f89c3be65442bbe15905d75a0d13d
+msgid "Installation"
+msgstr "安装"
+
+#: ../../Qwen/source/deployment/openllm.rst:12 79421f700fbc426cb6ce9841aff67503
+msgid "Install OpenLLM using ``pip``."
+msgstr "使用 ``pip`` 安装 OpenLLM。"
+
+#: ../../Qwen/source/deployment/openllm.rst:18 69cfd6fe2e274173ad4065be91b71472
+msgid "Verify the installation and display the help information:"
+msgstr "验证安装并显示帮助信息："
+
+#: ../../Qwen/source/deployment/openllm.rst:25 503cae99b14c4ef4b322b8ec0bd2d32d
+msgid "Quickstart"
+msgstr "快速开始"
+
+#: ../../Qwen/source/deployment/openllm.rst:27 0ea788c801404d8780404611c87644b0
+msgid "Before you run any Qwen2.5 model, ensure your model repository is up to date by syncing it with OpenLLM's latest official repository."
+msgstr "在运行任何 Qwen2.5 模型之前，确保您的模型仓库与 OpenLLM 的最新官方仓库同步。"
+
+#: ../../Qwen/source/deployment/openllm.rst:33 8852ff46ecdb45b2bfc9885bbfaacb02
+msgid "List the supported Qwen2.5 models:"
+msgstr "列出支持的 Qwen2.5 模型："
+
+#: ../../Qwen/source/deployment/openllm.rst:39 3e4f6c11396844adb30d4e5812339484
+msgid "The results also display the required GPU resources and supported platforms:"
+msgstr "结果还会显示所需的 GPU 资源和支持的平台："
+
+#: ../../Qwen/source/deployment/openllm.rst:57 ac4c0db02f5249d5882940820779db9a
+msgid "To start a server with one of the models, use ``openllm serve`` like this:"
+msgstr "要使用其中一个模型来启动服务器，请使用 ``openllm serve`` 命令，例如："
+
+#: ../../Qwen/source/deployment/openllm.rst:63 0a1d3ec35c684e3bb3e971c916aa9be7
+msgid "By default, the server starts at ``http://localhost:3000/``."
+msgstr "默认情况下，服务器启动在 http://localhost:3000/。"
+
+#: ../../Qwen/source/deployment/openllm.rst:66 2e787de9a62f4342bdf8f88ee0df5379
+msgid "Interact with the model server"
+msgstr "与模型服务器交互"
+
+#: ../../Qwen/source/deployment/openllm.rst:68 b22802ad9027458bb30ea0da665fea36
+msgid "With the model server up and running, you can call its APIs in the following ways:"
+msgstr "服务器运行后，可以通过以下方式调用其 API："
+
+#: ../../Qwen/source/deployment/openllm.rst 76214ea690094930899d6f2eddcc1454
+msgid "CURL"
+msgstr "CURL"
+
+#: ../../Qwen/source/deployment/openllm.rst:74 42775a3df58f474782d29f2f82707bd9
+msgid "Send an HTTP request to its ``/generate`` endpoint via CURL:"
+msgstr "通过 CURL 向其 ``/generate`` 端点发送 HTTP 请求："
+
+#: ../../Qwen/source/deployment/openllm.rst 4f0ff3eee2ab49dda5a72bd611a9d45e
+msgid "Python client"
+msgstr "Python 客户端"
+
+#: ../../Qwen/source/deployment/openllm.rst:91 ce2e11a46e434798947b1e74ce82a19c
+msgid "Call the OpenAI-compatible endpoints with frameworks and tools that support the OpenAI API protocol. Here is an example:"
+msgstr "使用支持 OpenAI API 协议的框架和工具来调用。例如："
+
+#: ../../Qwen/source/deployment/openllm.rst 107921d1a855430ca70c8c163d37c7f2
+msgid "Chat UI"
+msgstr "聊天 UI"
+
+#: ../../Qwen/source/deployment/openllm.rst:118
+#: b92df2759cd54c2b8316e2a160ede656
+msgid "OpenLLM provides a chat UI at the ``/chat`` endpoint for the LLM server at http://localhost:3000/chat."
+msgstr "OpenLLM 为 LLM 服务器提供的聊天 UI 位于 ``/chat`` 端点，地址为 http://localhost:3000/chat。"
+
+#: ../../Qwen/source/deployment/openllm.rst:123
+#: 0d3fa679178f443caf9c87623001be1f
+msgid "Model repository"
+msgstr "模型仓库"
+
+#: ../../Qwen/source/deployment/openllm.rst:125
+#: 54d6a9bdcc064aeb95a23b60d3d575ab
+msgid "A model repository in OpenLLM represents a catalog of available LLMs. You can add your own repository to OpenLLM with custom Qwen2.5 variants for your specific needs. See our `documentation to learn details <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_."
+msgstr "OpenLLM 中的模型仓库表示可用的 LLM 目录。您可以为 OpenLLM 添加自定义的 Qwen2.5 模型仓库，以满足您的特定需求。请参阅 `我们的文档 <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_ 了解详细信息。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/sglang.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/sglang.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/sglang.md:1 4886c9be510e44ba968bba79c7e01e2b
+msgid "SGLang"
+msgstr ""
+
+#: ../../Qwen/source/deployment/sglang.md:3 fa388b3c599c454bbe22dc7c831723c1
+msgid "[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models."
+msgstr "[SGLang](https://github.com/sgl-project/sglang) 是一个用于大型语言模型和视觉语言模型的快速推理框架。"
+
+#: ../../Qwen/source/deployment/sglang.md:5 43fe1ab3622b4d619de1ba451ff5b5c4
+msgid "To learn more about SGLang, please refer to the [documentation](https://docs.sglang.ai/)."
+msgstr "要了解更多关于 SGLang 的信息，请参阅[官方文档](https://docs.sglang.ai/)。"
+
+#: ../../Qwen/source/deployment/sglang.md:7 4e7093847f104f5c91bf12495db0e2df
+msgid "Environment Setup"
+msgstr "环境配置"
+
+#: ../../Qwen/source/deployment/sglang.md:9 404501b6bb754a01afa398ce270f4ad6
+msgid "By default, you can install `sglang` with pip in a clean environment:"
+msgstr "默认情况下，你可以通过 pip 在新环境中安装 `sglang` ： "
+
+#: ../../Qwen/source/deployment/sglang.md:15 8794cc70acd141eeaef4717a190b11f4
+msgid "Please note that `sglang` relies on `flashinfer-python` and has strict dependencies on `torch` and its CUDA versions. Check the note in the official document for installation ([link](https://docs.sglang.ai/start/install.html)) for more help."
+msgstr "请留意预构建的 `sglang` 依赖 `flashinfer-python`，并对`torch`和其CUDA版本有强依赖。请查看[官方文档](https://docs.sglang.ai/start/install.html)中的注意事项以获取有关安装的帮助。"
+
+#: ../../Qwen/source/deployment/sglang.md:18 06e04edfe3094363bcbc5b8758c8b16c
+msgid "API Service"
+msgstr "API 服务"
+
+#: ../../Qwen/source/deployment/sglang.md:20 5969d8121d8a4af99d790844c4b348c5
+msgid "It is easy to build an OpenAI-compatible API service with SGLang, which can be deployed as a server that implements OpenAI API protocol. By default, it starts the server at `http://localhost:30000`.  You can specify the address with `--host` and `--port` arguments.  Run the command as shown below:"
+msgstr "借助 SGLang ，构建一个与OpenAI API兼容的API服务十分简便，该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下，它将在 `http://localhost:30000` 启动服务器。您可以通过 `--host` 和 `--port` 参数来自定义地址。请按照以下所示运行命令："
+
+#: ../../Qwen/source/deployment/sglang.md:28 32a52bb639634b9b9c196696dc20e2c5
+msgid "By default, if the `--model-path` does not point to a valid local directory, it will download the model files from the HuggingFace Hub. To download model from ModelScope, set the following before running the above command:"
+msgstr "默认情况下，如果模型未指向有效的本地目录，它将从 HuggingFace Hub 下载模型文件。要从 ModelScope 下载模型，请在运行上述命令之前设置以下内容："
+
+#: ../../Qwen/source/deployment/sglang.md:34 cd33984af3e045549668c8ad682f7612
+msgid "For distrbiuted inference with tensor parallelism, it is as simple as"
+msgstr "对于使用张量并行的分布式推理，操作非常简单："
+
+#: ../../Qwen/source/deployment/sglang.md:38 4db95581ffd046a9b6d532933403d985
+msgid "The above command will use tensor parallelism on 4 GPUs. You should change the number of GPUs according to your demand."
+msgstr "上述命令将在 4 块 GPU 上使用张量并行。您应根据需求调整 GPU 的数量。"
+
+#: ../../Qwen/source/deployment/sglang.md:41 765cc12e934b4ab6881f0a71693fcc3d
+msgid "Basic Usage"
+msgstr "基本用法"
+
+#: ../../Qwen/source/deployment/sglang.md:43 51032557dac94cb3b14c3842076192a8
+msgid "Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:"
+msgstr "然后，您可以利用 [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) 来与Qwen进行对话："
+
+#: ../../Qwen/source/deployment/sglang.md 0708e8d2e6a44e94956e44f3a83bb4d8
+#: 3bfc1bfe04ea4b49bfd1d5c6b5af52d7
+msgid "curl"
+msgstr ""
+
+#: ../../Qwen/source/deployment/sglang.md 3b62fc6e456d44a6ba9cc8f5519fc3c6
+#: ab964c5641584f7a9ef4252ecf0428cb
+msgid "Python"
+msgstr ""
+
+#: ../../Qwen/source/deployment/sglang.md:63
+#: ../../Qwen/source/deployment/sglang.md:130 18da82bbe0db4a59aa430b68b91db904
+#: a2fccb4d7c164911a35e5ff6f30d98df
+msgid "You can use the API client with the `openai` Python SDK as shown below:"
+msgstr "或者您可以如下面所示使用 `openai` Python SDK中的 API 客户端："
+
+#: ../../Qwen/source/deployment/sglang.md:91 2bff40bb9f104cf2b19e6cf8169bf18d
+msgid "While the default sampling parameters would work most of the time for thinking mode, it is recommended to adjust the sampling parameters according to your application,  and always pass the sampling parameters to the API."
+msgstr "虽然默认的采样参数在大多数情况下适用于思考模式，但建议根据您的应用调整采样参数，并始终将采样参数传递给 API。"
+
+#: ../../Qwen/source/deployment/sglang.md:97 e10b0bbcaa7c4e54a59f7a30fa8760ef
+msgid "Thinking & Non-Thinking Modes"
+msgstr "思考与非思考模式"
+
+#: ../../Qwen/source/deployment/sglang.md:100 ff0a121d43d5494597e8fc3b832f4893
+msgid "This feature has not been released. For more information, please see this [pull request](https://github.com/sgl-project/sglang/pull/5551)."
+msgstr "此功能尚未发布。更多信息，请参阅此[pull request](https://github.com/sgl-project/sglang/pull/5551)。"
+
+#: ../../Qwen/source/deployment/sglang.md:104 8ba3c8c378ed4df7acb28f04e41bf067
+msgid "Qwen3 models will think before respond. This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think."
+msgstr "Qwen3 模型会在回复前进行思考。这种行为可以通过硬开关（完全禁用思考）或软开关（模型遵循用户关于是否应该思考的指令）来控制。"
+
+#: ../../Qwen/source/deployment/sglang.md:107 dcc39b3925704aee927b220cbf9b341d
+msgid "The hard switch is availabe in SGLang through the following configuration to the API call. To disable thinking, use"
+msgstr "硬开关在 vLLM 中可以通过以下 API 调用配置使用。要禁用思考，请使用"
+
+#: ../../Qwen/source/deployment/sglang.md:159 952c90f4f1c84daba9cb66bfeb32725f
+msgid "It is recommended to set sampling parameters differently for thinking and non-thinking modes."
+msgstr "建议为思考模式和非思考模式分别设置不同的采样参数。"
+
+#: ../../Qwen/source/deployment/sglang.md:162 750cee1281d74246bc7cf47ac9e0d502
+msgid "Parsing Thinking Content"
+msgstr "解析思考内容"
+
+#: ../../Qwen/source/deployment/sglang.md:164 4f1f6c5d59134ea1bf6a625cd5081c51
+msgid "SGLang supports parsing the thinking content from the model generation into structured messages:"
+msgstr "SGLang 支持将模型生成的思考内容解析为结构化消息："
+
+#: ../../Qwen/source/deployment/sglang.md:169 0517d0a9cf694f6caabcbe69e3e1e845
+msgid "The response message will have a field named `reasoning_content` in addition to `content`, containing the thinking content generated by the model."
+msgstr "响应消息除了包含 `content` 字段外，还会有一个名为 `reasoning_content` 的字段，其中包含模型生成的思考内容。"
+
+#: ../../Qwen/source/deployment/sglang.md:172 0225706aa7fe441c82d34f81b348fd42
+msgid "Please note that this feature is not OpenAI API compatible."
+msgstr "请注意，此功能与 OpenAI API 规范不一致。"
+
+#: ../../Qwen/source/deployment/sglang.md:175 45a5f606e86543c08eacf7686b5a2def
+msgid "Parsing Tool Calls"
+msgstr "解析工具调用"
+
+#: ../../Qwen/source/deployment/sglang.md:177 0aa3be18c7a5476cb915d6686c58387d
+msgid "SGLang supports parsing the tool calling content from the model generation into structured messages:"
+msgstr "SGLang 支持将模型生成的工具调用内容解析为结构化消息："
+
+#: ../../Qwen/source/deployment/sglang.md:182 dc096c7fb79c4b9ca0dd2c9cdd7ec890
+msgid "For more information, please refer to [our guide on Function Calling](../framework/function_call.md)."
+msgstr "详细信息，请参阅[函数调用的指南](../framework/function_call.md#vllm)。"
+
+#: ../../Qwen/source/deployment/sglang.md:184 a58bba52efc44663af792d859bd3b410
+msgid "Structured/JSON Output"
+msgstr "结构化/JSON输出"
+
+#: ../../Qwen/source/deployment/sglang.md:186 518f257e9d6d4080b41f980467573f7f
+msgid "SGLang supports structured/JSON output.  Please refer to [SGLang's documentation](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API). Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt."
+msgstr "SGLang 支持结构化/JSON 输出。请参阅[SGLan文档](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API)。此外，还建议在系统消息或您的提示中指示模型生成特定格式。"
+
+#: ../../Qwen/source/deployment/sglang.md:190 3a6d08a831584d6b8392da2650e8bf0b
+msgid "Serving Quantized models"
+msgstr "部署量化模型"
+
+#: ../../Qwen/source/deployment/sglang.md:192 b2fe212a02b84349940f4c0c30cde88d
+msgid "Qwen3 comes with two types of pre-quantized models, FP8 and AWQ."
+msgstr "Qwen3 提供了两种类型的预量化模型：FP8 和 AWQ。"
+
+#: ../../Qwen/source/deployment/sglang.md:194 efc85fc46a564483bdb872dbf5d61f3c
+msgid "The command serving those models are the same as the original models except for the name change:"
+msgstr "部署这些模型的命令与原始模型相同，只是名称有所更改："
+
+#: ../../Qwen/source/deployment/sglang.md:203 11a6d1bb983d4e60a55f5d579f1eb76b
+msgid "Context Length"
+msgstr "上下文长度"
+
+#: ../../Qwen/source/deployment/sglang.md:205 de0293719e06477fbde6afc533973b1a
+msgid "The context length for Qwen3 models in pretraining is up to 32,768 tokenns. To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied. We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts."
+msgstr "Qwen3 模型在预训练中的上下文长度最长为 32,768 个 token。为了处理显著超过 32,768 个 token 的上下文长度，应应用 RoPE 缩放技术。我们已经验证了 [YaRN](https://arxiv.org/abs/2309.00071) 的性能，这是一种增强模型长度外推的技术，可确保在长文本上的最佳性能。"
+
+#: ../../Qwen/source/deployment/sglang.md:209 0ee16aabbc794331a329e52ab2ca40e7
+msgid "SGLang supports YaRN, which can be configured as"
+msgstr "SGLang 支持 YaRN，可以配置为"
+
+#: ../../Qwen/source/deployment/sglang.md:215 c3ba0a9b3502462795dfd887912e9357
+msgid "SGLang implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.** We advise adding the `rope_scaling` configuration only when processing long contexts is required.  It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0."
+msgstr "SGLang 实现了静态 YaRN，这意味着无论输入长度如何，缩放因子都保持不变，**这可能会对较短文本的性能产生影响。** 我们建议仅在需要处理长上下文时添加 `rope_scaling` 配置。还建议根据需要调整 `factor`。例如，如果您的应用程序的典型上下文长度为 65,536 个 token，则最好将 `factor` 设置为 2.0。"
+
+#: ../../Qwen/source/deployment/sglang.md:221 398c3e38c94e446aa9922dd04dce609c
+msgid "The default `max_position_embeddings` in `config.json` is set to 40,960, which is used by SGLang. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance."
+msgstr "`config.json` 中的默认 `max_position_embeddings` 被设置为 40,960，SGLang 将使用该值。此分配包括为输出保留 32,768 个 token，为典型提示保留 8,192 个 token，这足以应对大多数涉及短文本处理的场景，并为模型思考留出充足空间。如果平均上下文长度不超过 32,768 个 token，我们不建议在此场景中启用 YaRN，因为这可能会降低模型性能。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/skypilot.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/skypilot.po
+# Copyright (C) 2024, Qwen Team, Alibaba Group.
+# This file is distributed under the same license as the Qwen package.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/skypilot.rst:2 795ad4f30e27494d93675f71bb1a5cc4
+msgid "SkyPilot"
+msgstr ""
+
+#: ../../Qwen/source/deployment/skypilot.rst:5 aad807db94a24d868c9c1b364b47e152
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:8 d6bbf736584f4bbfa9c300d50a2ed669
+msgid "What is SkyPilot"
+msgstr "SkyPilot 是什么"
+
+#: ../../Qwen/source/deployment/skypilot.rst:10
+#: b66facae41bf493880e43044e2915a45
+msgid "SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, the highest GPU availability, and managed execution. Its features include:"
+msgstr "SkyPilot 是一个可以在任何云上运行 LLM 、 AI 应用以及批量任务的框架，旨在实现最大程度的成本节省、最高的 GPU 可用性以及受管理的执行过程。其特性包括："
+
+#: ../../Qwen/source/deployment/skypilot.rst:14
+#: 621f021163c549d0aadb1c911a3a3ef5
+msgid "Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds."
+msgstr "通过跨区域和跨云充分利用多个资源池，以获得最佳的 GPU 可用性。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:16
+#: ea1723c3b5be454cad3219836f4386d8
+msgid "Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups."
+msgstr "把费用降到最低—— SkyPilot 在各区域和云平台中为您挑选最便宜的资源。无需任何托管解决方案的额外加价。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:18
+#: e479693ecf08411ca35d8d0727c8f441
+msgid "Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint"
+msgstr "将服务扩展到多个副本上，所有副本通过单一 endpoint 对外提供服务"
+
+#: ../../Qwen/source/deployment/skypilot.rst:20
+#: 1f9cdd2ae2544d1faa8a4c463ee0e42c
+msgid "Everything stays in your cloud account (your VMs & buckets)"
+msgstr "所有内容均保存在您的云账户中（包括您的虚拟机和 bucket ）"
+
+#: ../../Qwen/source/deployment/skypilot.rst:21
+#: 5bb9b617764942d989e5093463a359f0
+msgid "Completely private - no one else sees your chat history"
+msgstr "完全私密 - 没有其他人能看到您的聊天记录"
+
+#: ../../Qwen/source/deployment/skypilot.rst:24
+#: cf0c456ac72f40ac98790c11dc243317
+msgid "Install SkyPilot"
+msgstr "安装 SkyPilot"
+
+#: ../../Qwen/source/deployment/skypilot.rst:26
+#: 78d86c1fa8104b138b01aed640b262fc
+msgid "We advise you to follow the `instruction <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ to install SkyPilot. Here we provide a simple example of using ``pip`` for the installation as shown below."
+msgstr "我们建议您按照 `指示 <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ 安装 SkyPilot 。以下为您提供了一个使用 ``pip`` 进行安装的简单示例："
+
+#: ../../Qwen/source/deployment/skypilot.rst:38
+#: a7c88265bf404f55b85388c81a240199
+msgid "After that, you need to verify cloud access with a command like:"
+msgstr "随后，您需要用如下命令确认是否能使用云："
+
+#: ../../Qwen/source/deployment/skypilot.rst:44
+#: 72025dfba0144f63a720f6da0dd39bfa
+msgid "For more information, check the `official document <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ and see if you have set up your cloud accounts correctly."
+msgstr "若需更多信息，请查阅官方文档，确认您的云账户设置是否正确无误。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:47
+#: 61be006061554e5ea40d55497e11e192
+msgid "Alternatively, you can also use the official docker image with SkyPilot master branch automatically cloned by running:"
+msgstr "或者，您也可以使用官方提供的 docker 镜像，可以自动克隆 SkyPilot 的主分支："
+
+#: ../../Qwen/source/deployment/skypilot.rst:63
+#: 4ae89fb44c6643a3a82fca5cee622af4
+msgid "Running Qwen2.5-72B-Instruct with SkyPilot"
+msgstr "使用 SkyPilot 运行 Qwen2.5-72B-Instruct "
+
+#: ../../Qwen/source/deployment/skypilot.rst:65
+#: 1bc4973c2eb745689ded0af54ba33e0e
+msgid "Start serving Qwen2.5-72B-Instruct on a single instance with any available GPU in the list specified in `serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__ with a vLLM-powered OpenAI-compatible endpoint:"
+msgstr "`serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__ 中列出了支持的 GPU 。您可使用配备这类 GPU 的单个运算实例来部署 Qwen2.5-72B-Instruct 服务。该服务由 vLLM 搭建，并与 OpenAI API 兼容。以下为部署方法："
+
+#: ../../Qwen/source/deployment/skypilot.rst:74
+#: ../../Qwen/source/deployment/skypilot.rst:123
+#: ac3692ed16974facbd58b6886cd111af b325de015e7b4bb0a91491d3f7418792
+msgid "**Before launching, make sure you have changed Qwen/Qwen2-72B-Instruct to Qwen/Qwen2.5-72B-Instruct in the YAML file.**"
+msgstr "**在启动之前，请先将 YAML 文件中的 Qwen/Qwen2-72B-Instruct 修改为 Qwen/Qwen2.5-72B-Instruct。**"
+
+#: ../../Qwen/source/deployment/skypilot.rst:76
+#: 6046b3c86fae4a43878fbadbeb33fbd8
+msgid "Send a request to the endpoint for completion:"
+msgstr "向该 endpoint 发送续写请求："
+
+#: ../../Qwen/source/deployment/skypilot.rst:90
+#: 2ec56c2028a94f568fd2c1a65063d25a
+msgid "Send a request for chat completion:"
+msgstr "向该 endpoint 发送对话续写请求"
+
+#: ../../Qwen/source/deployment/skypilot.rst:112
+#: c8e140ddfd914ff5a460621a7ca1891e
+msgid "Scale up the service with SkyPilot Serve"
+msgstr "使用 SkyPilot Serve 扩展服务规模"
+
+#: ../../Qwen/source/deployment/skypilot.rst:114
+#: 0db304ab396d45adb6017d78cd1ee4a2
+msgid "With `SkyPilot Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__, a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running:"
+msgstr "使用 `SkyPilot Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__ 扩展 Qwen 的服务规模非常容易，只需运行："
+
+#: ../../Qwen/source/deployment/skypilot.rst:125
+#: 25bbbf9e49be44d3899074ff97202d71
+msgid "This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed."
+msgstr "这将启动服务，使用多个副本部署在最经济的可用位置和加速器上。 SkyServe 将自动管理这些副本，监控其健康状况，根据负载进行自动伸缩，并在必要时重启它们。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:130
+#: bda628bab7ef41a0918dc4b80a9b3cfe
+msgid "A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas."
+msgstr "将返回一个 endpoint ，所有发送至该endpoint的请求都将被路由至就绪状态的副本。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:133
+#: b232dbbdcf674d56bcf9c0331c020864
+msgid "To check the status of the service, run:"
+msgstr "运行如下命令检查服务的状态："
+
+#: ../../Qwen/source/deployment/skypilot.rst:139
+#: 556b854caf7243fb93f253ebe2dc9033
+msgid "After a while, you will see the following output:"
+msgstr "很快，您将看到如下输出："
+
+#: ../../Qwen/source/deployment/skypilot.rst:152
+#: 5a6055c5a42c4b2db6693c1095688de8
+msgid "As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the availability of the service while minimizing the cost."
+msgstr "如下所示：该服务现由两个副本提供支持，一个位于 Azure 平台，另一个位于 GCP 平台。同时，已为服务选择云服务商提供的 **最经济实惠** 的加速器类型。这样既最大限度地提升了服务的可用性，又尽可能降低了成本。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:157
+#: a18533d33dc54a1091ded0b4bba0a1eb
+msgid "To access the model, we use a ``curl -L`` command (``-L`` to follow redirect) to send the request to the endpoint:"
+msgstr "要访问模型，我们使用带有 ``curl -L`` （用于跟随重定向），将请求发送到 endpoint ："
+
+#: ../../Qwen/source/deployment/skypilot.rst:182
+#: 34cd50fd79e24d8895075f7841b025e4
+msgid "Accessing Qwen2.5 with Chat GUI"
+msgstr "使用 Chat GUI 调用 Qwen2.5"
+
+#: ../../Qwen/source/deployment/skypilot.rst:184
+#: ca6994cda1cb469e83ce8c026bb67e42
+msgid "It is also possible to access the Qwen2.5 service with GUI by connecting a `FastChat GUI server <https://github.com/lm-sys/FastChat>`__ to the endpoint launched above (see `gui.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/gui.yaml>`__)."
+msgstr "可以通过 `FastChat <https://github.com/lm-sys/FastChat>`__ 来使用 GUI 调用 Qwen2.5 的服务："
+
+#: ../../Qwen/source/deployment/skypilot.rst:188
+#: 99a63e55ab5c46258c20ab89cdfa39dc
+msgid "Start the Chat Web UI:"
+msgstr "开启一个 Chat Web UI"
+
+#: ../../Qwen/source/deployment/skypilot.rst:194
+#: e61593a092c146f8a06af896d6af17f2
+msgid "**Before launching, make sure you have changed Qwen/Qwen1.5-72B-Chat to Qwen/Qwen2.5-72B-Instruct in the YAML file.**"
+msgstr "**在启动之前，请先将 YAML 文件中的 Qwen/Qwen1.5-72B-Chat 修改为 Qwen/Qwen2.5-72B-Instruct。**"
+
+#: ../../Qwen/source/deployment/skypilot.rst:196
+#: 9631068a8b424aa8af6dc6911daac7a9
+msgid "Then, we can access the GUI at the returned gradio link:"
+msgstr "随后，我们可以通过返回的 gradio 链接来访问 GUI ："
+
+#: ../../Qwen/source/deployment/skypilot.rst:202
+#: 1464a56dcd06404aafbe6d7d2c72212b
+msgid "Note that you may get better results by using a different temperature and top_p value."
+msgstr "你可以通过使用不同的温度和 top_p 值来尝试取得更好的结果。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:205
+#: d257f49d835e4c12b28bc680bb78a9cb
+msgid "Summary"
+msgstr "总结"
+
+#: ../../Qwen/source/deployment/skypilot.rst:207
+#: 06b9684a19774eaba4f69862332c5166
+msgid "With SkyPilot, it is easy for you to deploy Qwen2.5 on any cloud. We advise you to read the official doc for more usages and updates. Check `this <https://skypilot.readthedocs.io/>`__ out!"
+msgstr "通过 SkyPilot ，你可以轻松地在任何云上部署 Qwen2.5 。我们建议您阅读 `官方文档 <https://skypilot.readthedocs.io/>`__ 了解更多用法和最新进展。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/tgi.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/tgi.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/tgi.rst:2 2abcc96f9deb4b9187ac9d88fc69e929
+msgid "TGI"
+msgstr ""
+
+#: ../../Qwen/source/deployment/tgi.rst:5 2d124d7cb95f47388aa48c662932ef9b
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/deployment/tgi.rst:7 4e5d299c4fdd46d5aba38c9af5765792
+msgid "Hugging Face's Text Generation Inference (TGI) is a production-ready framework specifically designed for deploying and serving large language models (LLMs) for text generation tasks. It offers a seamless deployment experience, powered by a robust set of features:"
+msgstr "Hugging Face 的 Text Generation Inference (TGI) 是一个专为部署大规模语言模型 (Large Language Models, LLMs) 而设计的生产级框架。TGI提供了流畅的部署体验，并稳定支持如下特性："
+
+#: ../../Qwen/source/deployment/tgi.rst:9 ecd4fc11a95140959915d062791ceba1
+msgid "`Speculative Decoding <Speculative Decoding_>`_: Accelerates generation speeds."
+msgstr "`推测解码 (Speculative Decoding) <Speculative Decoding_>`_ ：提升生成速度。"
+
+#: ../../Qwen/source/deployment/tgi.rst:10 84590a56416348bf85b3f296cf57e257
+msgid "`Tensor Parallelism`_: Enables efficient deployment across multiple GPUs."
+msgstr "张量并行 (`Tensor Parallelism`_) ：高效多卡部署。"
+
+#: ../../Qwen/source/deployment/tgi.rst:11 a996d6ecd7b94c5cb9752d370f29a9b1
+msgid "`Token Streaming`_: Allows for the continuous generation of text."
+msgstr "流式生成 (`Token Streaming`_) ：支持持续性生成文本。"
+
+#: ../../Qwen/source/deployment/tgi.rst:12 8f591c045ba34f4581bb19652db9f9b3
+msgid "Versatile Device Support: Works seamlessly with `AMD`_, `Gaudi`_ and `AWS Inferentia`_."
+msgstr "灵活的硬件支持：与 `AMD`_ ， `Gaudi`_ 和 `AWS Inferentia`_ 无缝衔接。"
+
+#: ../../Qwen/source/deployment/tgi.rst:21 5e8a98b91fc146e0b581422faa683a18
+msgid "Installation"
+msgstr "安装"
+
+#: ../../Qwen/source/deployment/tgi.rst:23 684ef25bfb0e460999d6dcccce41b85f
+msgid "The easiest way to use TGI is via the TGI docker image. In this guide, we show how to use TGI with docker."
+msgstr "通过 TGI docker 镜像使用 TGI 轻而易举。本文将主要介绍 TGI 的 docker 用法。"
+
+#: ../../Qwen/source/deployment/tgi.rst:25 c563fa3eccb04d00a477c1d2e8b15c38
+msgid "It's possible to run it locally via Conda or build locally. Please refer to `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_  and `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ for detailed instructions."
+msgstr "也可通过 Conda 实机安装或搭建服务。请参考 `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_ 与 `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ 以了解详细说明。"
+
+#: ../../Qwen/source/deployment/tgi.rst:28 b55fc58ff4cb472abca08296409c7837
+msgid "Deploy Qwen2.5 with TGI"
+msgstr "通过 TGI 部署 Qwen2.5"
+
+#: ../../Qwen/source/deployment/tgi.rst:30 586a8425ec5d413592fd7daf579c7e87
+msgid "**Find a Qwen2.5 Model:** Choose a model from `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_."
+msgstr "**选定 Qwen2.5 模型：** 从 `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_ 中挑选模型。"
+
+#: ../../Qwen/source/deployment/tgi.rst:31 50fcab8da35941eca308786979dbaf38
+msgid "**Deployment Command:** Run the following command in your terminal, replacing ``model`` with your chosen Qwen2.5 model ID and ``volume`` with the path to your local data directory:"
+msgstr "**部署TGI服务：** 在终端中运行以下命令，注意替换 ``model`` 为选定的 Qwen2.5 模型 ID 、 ``volume`` 为本地的数据路径： "
+
+#: ../../Qwen/source/deployment/tgi.rst:42 2a800533a7d84bdeab1da0976b0cab53
+msgid "Using TGI API"
+msgstr "使用 TGI API"
+
+#: ../../Qwen/source/deployment/tgi.rst:44 f05d1ec08140452782d0659543fad7d1
+msgid "Once deployed, the model will be available on the mapped port (8080)."
+msgstr "一旦成功部署，API 将于选定的映射端口 (8080) 提供服务。"
+
+#: ../../Qwen/source/deployment/tgi.rst:46 f265dc1522b049c98ba31fd5d255c50f
+msgid "TGI comes with a handy API for streaming response:"
+msgstr "TGI 提供了简单直接的 API 支持流式生成："
+
+#: ../../Qwen/source/deployment/tgi.rst:54 e9cc4c0571b74bd08b2a59347503e653
+msgid "It's also available on OpenAI style API:"
+msgstr "也可使用 OpenAI 风格的 API 使用 TGI ："
+
+#: ../../Qwen/source/deployment/tgi.rst:73 5dc7e9c74fc04483ba8e5dcdd7052020
+msgid "The model field in the JSON is not used by TGI, you can put anything."
+msgstr "JSON 中的 model 字段不会被 TGI 识别，您可传入任意值。"
+
+#: ../../Qwen/source/deployment/tgi.rst:75 d60f837152014cda8baebc90d65d1cc0
+#, python-format
+msgid "Refer to the `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ for a complete API reference."
+msgstr "完整 API 文档，请查阅 `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ 。"
+
+#: ../../Qwen/source/deployment/tgi.rst:77 b59564031e5548088aef828f9753e337
+msgid "You can also use Python API:"
+msgstr "你也可以使用 Python 访问 API ："
+
+#: ../../Qwen/source/deployment/tgi.rst:106 62646cecb024479ebfeca5f3063e7322
+msgid "Quantization for Performance"
+msgstr "量化"
+
+#: ../../Qwen/source/deployment/tgi.rst:108 4a8d39bf37be4820afb230f9a977b431
+msgid "Data-dependent quantization (GPTQ and AWQ)"
+msgstr "依赖数据的量化方案（ GPTQ 与 AWQ ）"
+
+#: ../../Qwen/source/deployment/tgi.rst:110 ef2b18f47e4f4f7ebb017be628cb0be9
+msgid "Both GPTQ and AWQ models are data-dependent. The official quantized models can be found from `the Qwen2.5 collection`_ and you can also quantize models with your own dataset to make it perform better on your use case."
+msgstr "GPTQ 与 AWQ 均依赖数据进行量化。我们提供了预先量化好的模型，请于 `the Qwen2.5 collection`_ 查找。你也可以使用自己的数据集自行量化，以在你的场景中取得更好效果。"
+
+#: ../../Qwen/source/deployment/tgi.rst:112 53d94278a2e3409abb9980ebc7c96c24
+msgid "The following shows the command to start TGI with Qwen2.5-7B-Instruct-GPTQ-Int4:"
+msgstr "以下是通过 TGI 部署 Qwen2.5-7B-Instruct-GPTQ-Int4 的指令："
+
+#: ../../Qwen/source/deployment/tgi.rst:122 68ff8a07d0eb40cfa67d79e01adea070
+msgid "If the model is quantized with AWQ, e.g. Qwen/Qwen2.5-7B-Instruct-AWQ, please use ``--quantize awq``."
+msgstr "如果模型是 AWQ 量化的，如 Qwen/Qwen2.5-7B-Instruct-AWQ ，请使用 ``--quantize awq`` 。"
+
+#: ../../Qwen/source/deployment/tgi.rst:124 b4c3b82b1f2a43a8a02383fd0afbda5f
+msgid "Data-agnostic quantization"
+msgstr "不依赖数据的量化方案"
+
+#: ../../Qwen/source/deployment/tgi.rst:126 7a6b89c94b72407482b96790f5bbd272
+msgid "EETQ on the other side is not data dependent and can be used with any model. Note that we're passing in the original model (instead of a quantized model) with the ``--quantize eetq`` flag."
+msgstr "EETQ 是一种不依赖数据的量化方案，可直接用于任意模型。请注意，我们需要传入原始模型，并使用 ``--quantize eetq`` 标志。"
+
+#: ../../Qwen/source/deployment/tgi.rst:138 763166da65924887b3bba99ea4d2baab
+msgid "Multi-Accelerators Deployment"
+msgstr "多卡部署"
+
+#: ../../Qwen/source/deployment/tgi.rst:140 ddcfcff947894f168c7945ae9c42a579
+msgid "Use the ``--num-shard`` flag to specify the number of accelerators. Please also use ``--shm-size 1g`` to enable shared memory for optimal NCCL performance (`reference <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__):"
+msgstr "使用 ``--num-shard`` 指定卡书数量。 请务必传入 ``--shm-size 1g`` 让 NCCL 发挥最好性能 (`说明 <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__) ："
+
+#: ../../Qwen/source/deployment/tgi.rst:151 520c46fb404c4ec9bf89280e4a71f1e8
+msgid "Speculative Decoding"
+msgstr "推测性解码 (Speculative Decoding)"
+
+#: ../../Qwen/source/deployment/tgi.rst:153 74c6b65f76b74d56ad109af9da11f66e
+msgid "Speculative decoding can reduce the time per token by speculating on the next token. Use the ``--speculative-decoding`` flag, setting the value to the number of tokens to speculate on (default: 0 for no speculation):"
+msgstr "推测性解码 (Speculative Decoding) 通过预先推测下一 token 来节约每 token 需要的时间。使用 ``--speculative-decoding`` 设定预先推测 token 的数量 （默认为0，表示不预先推测）："
+
+#: ../../Qwen/source/deployment/tgi.rst:164 dee05ee0fb1a4f2da42b250192d943f5
+msgid "The overall performance of speculative decoding highly depends on the type of task. It works best for code or highly repetitive text."
+msgstr "推测性解码的加速效果依赖于任务类型，对于代码或重复性较高的文本生成任务，提速更明显。"
+
+#: ../../Qwen/source/deployment/tgi.rst:166 731f300bc1174589901dd5feb26e8b2f
+msgid "More context on speculative decoding can be found `here <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__."
+msgstr "更多说明可查阅 `此文档 <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__ 。"
+
+#: ../../Qwen/source/deployment/tgi.rst:170 65a7d5553dd145398f9705c1ee6c28f0
+msgid "Zero-Code Deployment with HF Inference Endpoints"
+msgstr "使用 HF Inference Endpoints 零代码部署"
+
+#: ../../Qwen/source/deployment/tgi.rst:172 721c3a7578f846ae8e21e595923e17e7
+msgid "For effortless deployment, leverage Hugging Face Inference Endpoints:"
+msgstr "使用 Hugging Face Inference Endpoints 不费吹灰之力："
+
+#: ../../Qwen/source/deployment/tgi.rst:174 7741607488d94a9f8be2ffcb6a5322fb
+msgid "**GUI interface:** `<https://huggingface.co/inference-endpoints/dedicated>`__"
+msgstr ""
+
+#: ../../Qwen/source/deployment/tgi.rst:175 02ff4520e66f4a42828483da7d25445f
+msgid "**Coding interface:** `<https://huggingface.co/blog/tgi-messages-api>`__"
+msgstr ""
+
+#: ../../Qwen/source/deployment/tgi.rst:177 d35f9dd4bc96400cb6c7584012d2df49
+msgid "Once deployed, the endpoint can be used as usual."
+msgstr "一旦部署成功，服务使用与本地无异。"
+
+#: ../../Qwen/source/deployment/tgi.rst:181 61c1b825bbf24be2aaaeb99de3f0660e
+msgid "Common Issues"
+msgstr "常见问题"
+
+#: ../../Qwen/source/deployment/tgi.rst:183 b55a2d286fc24dbe92b79ab5c010c7af
+msgid "Qwen2.5 supports long context lengths, so carefully choose the values for ``--max-batch-prefill-tokens``, ``--max-total-tokens``, and ``--max-input-tokens`` to avoid potential out-of-memory (OOM) issues. If an OOM occurs, you'll receive an error message upon startup. The following shows an example to modify those parameters:"
+msgstr "Qwen2.5 支持长上下文，谨慎设定 ``--max-batch-prefill-tokens`` ， ``--max-total-tokens`` 和 ``--max-input-tokens`` 以避免 out-of-memory (OOM) 。如 OOM ，你将在启动 TGI 时收到错误提示。以下为修改这些参数的示例："
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/vllm.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/vllm.po
+# Copyright (C) 2024, Qwen Team, Alibaba Group.
+# This file is distributed under the same license as the Qwen package.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/vllm.md:1 faa6e2bc47c24c6dab7113f73d67b0c4
+msgid "vLLM"
+msgstr ""
+
+#: ../../Qwen/source/deployment/vllm.md:3 45682d60b2ee469bac6a473f7aacbe38
+msgid "We recommend you trying [vLLM](https://github.com/vllm-project/vllm) for your deployment of Qwen.  It is simple to use, and it is fast with state-of-the-art serving throughput, efficient management of attention key value memory with PagedAttention, continuous batching of input requests, optimized CUDA kernels, etc.  To learn more about vLLM, please refer to the [paper](https://arxiv.org/abs/2309.06180) and [documentation](https://docs.vllm.ai/)."
+msgstr "我们建议您在部署 Qwen 时尝试使用 [vLLM](https://github.com/vllm-project/vllm)。它易于使用，且具有最先进的服务吞吐量、高效的注意力键值内存管理（通过PagedAttention实现）、连续批处理输入请求、优化的CUDA内核等功能。要了解更多关于vLLM的信息，请参阅 [论文](https://arxiv.org/abs/2309.06180) 和 [文档](https://docs.vllm.ai/)。"
+
+#: ../../Qwen/source/deployment/vllm.md:7 b6e6f5a91b9e4b749a6aaeca89358752
+msgid "Environment Setup"
+msgstr "环境配置"
+
+#: ../../Qwen/source/deployment/vllm.md:9 14f9a1b8015d45388f5681145ddfcb0b
+msgid "By default, you can install `vllm` with pip in a clean environment:"
+msgstr "默认情况下，你可以通过 pip 在新环境中安装 `vllm` ： "
+
+#: ../../Qwen/source/deployment/vllm.md:15 fdda232bfb3c4d3895643b3ba7f78cbd
+msgid "Please note that the prebuilt `vllm` has strict dependencies on `torch` and its CUDA versions. Check the note in the official document for installation ([link](https://docs.vllm.ai/en/latest/getting_started/installation.html)) for more help."
+msgstr "请留意预构建的`vllm`对`torch`和其CUDA版本有强依赖。请查看[vLLM官方文档](https://docs.vllm.ai/en/latest/getting_started/installation.html)中的注意事项以获取有关安装的帮助。"
+
+#: ../../Qwen/source/deployment/vllm.md:18 a175a37698bf4cfcb0c7bb33509e3775
+msgid "API Service"
+msgstr "API 服务"
+
+#: ../../Qwen/source/deployment/vllm.md:20 6221b5708f054b839b17d3c21d086657
+msgid "It is easy to build an OpenAI-compatible API service with vLLM, which can be deployed as a server that implements OpenAI API protocol. By default, it starts the server at `http://localhost:8000`.  You can specify the address with `--host` and `--port` arguments.  Run the command as shown below:"
+msgstr "借助vLLM，构建一个与OpenAI API兼容的API服务十分简便，该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下，它将在 `http://localhost:8000` 启动服务器。您可以通过 `--host` 和 `--port` 参数来自定义地址。请按照以下所示运行命令："
+
+#: ../../Qwen/source/deployment/vllm.md:28 5b3fb351eff5402caf53fb28af098a14
+msgid "By default, if the model does not point to a valid local directory, it will download the model files from the HuggingFace Hub. To download model from ModelScope, set the following before running the above command:"
+msgstr "默认情况下，如果模型未指向有效的本地目录，它将从 HuggingFace Hub 下载模型文件。要从 ModelScope 下载模型，请在运行上述命令之前设置以下内容："
+
+#: ../../Qwen/source/deployment/vllm.md:34 e968dc2b83f94d8db88730e49cc2b557
+msgid "For distrbiuted inference with tensor parallelism, it is as simple as"
+msgstr "对于使用张量并行的分布式推理，操作非常简单："
+
+#: ../../Qwen/source/deployment/vllm.md:38 a017a4820c164f99b1b818eff1ece7e2
+msgid "The above command will use tensor parallelism on 4 GPUs. You should change the number of GPUs according to your demand."
+msgstr "上述命令将在 4 块 GPU 上使用张量并行。您应根据需求调整 GPU 的数量。"
+
+#: ../../Qwen/source/deployment/vllm.md:41 cf79ffc98aeb4aaeaa87eb67a27bf931
+msgid "Basic Usage"
+msgstr "基本用法"
+
+#: ../../Qwen/source/deployment/vllm.md:43 f422098f08af453fba9a04ffba7a65cf
+msgid "Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:"
+msgstr "然后，您可以利用 [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) 来与Qwen进行对话："
+
+#: ../../Qwen/source/deployment/vllm.md 4fe8fdefc345451692648e733e009f2f
+#: a8b7164993794f1398b4fb97662752d5
+msgid "curl"
+msgstr ""
+
+#: ../../Qwen/source/deployment/vllm.md 5e2fab5952fd4fca952fb4d6bbca2a00
+#: c5aff5ea28cd48bc8040b80173a609b8
+msgid "Python"
+msgstr ""
+
+#: ../../Qwen/source/deployment/vllm.md:63
+#: ../../Qwen/source/deployment/vllm.md:127 648478738dd3476c8dcdaa99cd345bfe
+#: aa39dbf0acd646afa03c5fe79bb74011
+msgid "You can use the API client with the `openai` Python SDK as shown below:"
+msgstr "或者您可以如下面所示使用 `openai` Python SDK中的 API 客户端："
+
+#: ../../Qwen/source/deployment/vllm.md:91 a4dc5343d3214279828ee2a8e8d06106
+msgid "`vllm` will use the sampling parameters from the `generation_config.json` in the model files."
+msgstr "`vllm` 将使用模型文件中 `generation_config.json` 的采样参数。"
+
+#: ../../Qwen/source/deployment/vllm.md:93 ba2912bb69f64837887bc32f1107b9f0
+msgid "While the default sampling parameters would work most of the time for thinking mode, it is recommended to adjust the sampling parameters according to your application,  and always pass the sampling parameters to the API."
+msgstr "虽然默认的采样参数在大多数情况下适用于思考模式，但建议根据您的应用调整采样参数，并始终将采样参数传递给 API。"
+
+#: ../../Qwen/source/deployment/vllm.md:99 9a47113d7a7b44b89f504244fada649f
+msgid "Thinking & Non-Thinking Modes"
+msgstr "思考与非思考模式"
+
+#: ../../Qwen/source/deployment/vllm.md:101 20105d15da5a44bc8af9f9d628b54cb3
+msgid "Qwen3 models will think before respond. This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think."
+msgstr "Qwen3 模型会在回复前进行思考。这种行为可以通过硬开关（完全禁用思考）或软开关（模型遵循用户关于是否应该思考的指令）来控制。"
+
+#: ../../Qwen/source/deployment/vllm.md:104 df2e0e8d7b77407ba31e9efcea1440cf
+msgid "The hard switch is availabe in vLLM through the following configuration to the API call. To disable thinking, use"
+msgstr "硬开关在 vLLM 中可以通过以下 API 调用配置使用。要禁用思考，请使用"
+
+#: ../../Qwen/source/deployment/vllm.md:156 31b40d3493e1487b8624048e64b07321
+msgid "It is recommended to set sampling parameters differently for thinking and non-thinking modes."
+msgstr "建议为思考模式和非思考模式分别设置不同的采样参数。"
+
+#: ../../Qwen/source/deployment/vllm.md:159 ca30c135f89d4d0f9297484617c2c291
+msgid "Parsing Thinking Content"
+msgstr "解析思考内容"
+
+#: ../../Qwen/source/deployment/vllm.md:161 8f8766d09dcc42dab5dfeba44e9495f0
+msgid "vLLM supports parsing the thinking content from the model generation into structured messages:"
+msgstr "vLLM 支持将模型生成的思考内容解析为结构化消息："
+
+#: ../../Qwen/source/deployment/vllm.md:166 ce2e2c1c45804d758aaa536b2a134236
+msgid "The response message will have a field named `reasoning_content` in addition to `content`, containing the thinking content generated by the model."
+msgstr "响应消息除了包含 `content` 字段外，还会有一个名为 `reasoning_content` 的字段，其中包含模型生成的思考内容。"
+
+#: ../../Qwen/source/deployment/vllm.md:169 5740a6011e2e4b7194c8b3ede0ede490
+msgid "Please note that this feature is not OpenAI API compatible."
+msgstr "请注意，此功能与 OpenAI API 规范不一致。"
+
+#: ../../Qwen/source/deployment/vllm.md:172 4bab7e0f542e42c1a6834649c677a13f
+msgid "Parsing Tool Calls"
+msgstr "解析工具调用"
+
+#: ../../Qwen/source/deployment/vllm.md:174 0285408a203d41ff9b7e35f216f911f3
+msgid "vLLM supports parsing the tool calling content from the model generation into structured messages:"
+msgstr "vLLM 支持将模型生成的工具调用内容解析为结构化消息："
+
+#: ../../Qwen/source/deployment/vllm.md:179 b73b1c7b953c4cf9ab0fc4c7f7cce27f
+msgid "For more information, please refer to [our guide on Function Calling](../framework/function_call.md#vllm)."
+msgstr "详细信息，请参阅[函数调用的指南](../framework/function_call.md#vllm)。"
+
+#: ../../Qwen/source/deployment/vllm.md:182 768aec6e39424dd1835c56497f3f9c19
+msgid "As of vLLM 0.5.4, it is not supported to parse the thinking content and the tool calling from the model generation at the same time."
+msgstr "在 vLLM 0.5.4 版本中，尚不支持同时解析模型生成的思考内容和工具调用。"
+
+#: ../../Qwen/source/deployment/vllm.md:185 51dc25116a9346849b9c25436c64e770
+msgid "Structured/JSON Output"
+msgstr "结构化/JSON输出"
+
+#: ../../Qwen/source/deployment/vllm.md:187 abfa9f6a9eb942d5878241295a2fd7d2
+msgid "vLLM supports structured/JSON output.  Please refer to [vLLM's documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#extra-parameters-for-chat-api) for the `guided_json` parameters. Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt."
+msgstr "vLLM 支持结构化/JSON 输出。请参照[vLLM文档](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#extra-parameters-for-chat-api)了解 `guided_json` 参数。此外，也建议在系统消息或用户提示中指示模型生成特定格式，避免仅依赖于推理参数配置。"
+
+#: ../../Qwen/source/deployment/vllm.md:192 d84c568411164a8c8259228be8f6433a
+msgid "Serving Quantized models"
+msgstr "部署量化模型"
+
+#: ../../Qwen/source/deployment/vllm.md:194 d3b9d138d1ec4bf88eec3130a92c8d32
+msgid "Qwen3 comes with two types of pre-quantized models, FP8 and AWQ."
+msgstr "Qwen3 提供了两种类型的预量化模型：FP8 和 AWQ。"
+
+#: ../../Qwen/source/deployment/vllm.md:196 c16aa9098f9e4faf8b19df4430181d1c
+msgid "The command serving those models are the same as the original models except for the name change:"
+msgstr "部署这些模型的命令与原始模型相同，只是名称有所更改："
+
+#: ../../Qwen/source/deployment/vllm.md:206 774a3f0a10914f988aff2609b02ccb4c
+msgid "FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9, that is, Ada Lovelace, Hopper, and later GPUs."
+msgstr "FP8 计算在计算能力 > 8.9 的 NVIDIA GPU 上受支持，即 Ada Lovelace、Hopper 及更新的 GPU。"
+
+#: ../../Qwen/source/deployment/vllm.md:208 71a5529b10014da598a308aed2ff81cb
+msgid "FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin."
+msgstr ""msgstr "FP8 模型将在计算能力 > 8.0（Ampere）的 GPU 上以仅权重 W8A16 的形式运行，利用 FP8 Marlin 技术。"
+
+#: ../../Qwen/source/deployment/vllm.md:212 66310479fe11424d926533edd6d21dd0
+msgid "As of vLLM 0.5.4, there are currently compatibility issues with `vllm` with the Qwen3 FP8 checkpoints.  For a quick fix, you should make the following changes to the file `vllm/vllm/model_executor/layers/linear.py`:"
+msgstr "在 vLLM 0.5.4 版本中，目前 `vllm` 与 Qwen3 FP8 检查点存在兼容性问题。要快速解决此问题，您应对文件 `vllm/vllm/model_executor/layers/linear.py` 进行以下更改："
+
+#: ../../Qwen/source/deployment/vllm.md:236 e10bf662530d4be884f61474a197df6f
+msgid "Context Length"
+msgstr "上下文长度"
+
+#: ../../Qwen/source/deployment/vllm.md:238 100e104b60fb418b8c79ed341092efaa
+msgid "The context length for Qwen3 models in pretraining is up to 32,768 tokenns. To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied. We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts."
+msgstr "Qwen3 模型在预训练中的上下文长度最长为 32,768 个 token。为了处理显著超过 32,768 个 token 的上下文长度，应应用 RoPE 缩放技术。我们已经验证了 [YaRN](https://arxiv.org/abs/2309.00071) 的性能，这是一种增强模型长度外推的技术，可确保在长文本上的最佳性能。"
+
+#: ../../Qwen/source/deployment/vllm.md:242 3f987cfdb9114eb7be76a18cd0d01a1a
+msgid "vLLM supports YaRN, which can be configured as"
+msgstr "vLLM 支持 YaRN，可以配置为"
+
+#: ../../Qwen/source/deployment/vllm.md:248 fb9ef4ed4c9640a18574830368379d15
+msgid "vLLM implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.** We advise adding the `rope_scaling` configuration only when processing long contexts is required.  It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0."
+msgstr "vLLM 实现了静态 YaRN，这意味着无论输入长度如何，缩放因子都保持不变，**这可能会对较短文本的性能产生影响。** 我们建议仅在需要处理长上下文时添加 `rope_scaling` 配置。还建议根据需要调整 `factor`。例如，如果您的应用程序的典型上下文长度为 65,536 个 token，则最好将 `factor` 设置为 2.0。"
+
+#: ../../Qwen/source/deployment/vllm.md:254 e670bcf6664e490c8b5a7e0cc4ebba41
+msgid "The default `max_position_embeddings` in `config.json` is set to 40,960, which used by vLLM, if `--max-model-len` is not specified. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance."
+msgstr "如果未指定 `--max-model-len`，`config.json` 中的默认 `max_position_embeddings` 被设置为 40,960，vLLM 将使用该值。此分配包括为输出保留 32,768 个 token，为典型提示保留 8,192 个 token，这足以应对大多数涉及短文本处理的场景，并为模型思考留出充足空间。如果平均上下文长度不超过 32,768 个 token，我们不建议在此场景中启用 YaRN，因为这可能会降低模型性能。"
+
+#: ../../Qwen/source/deployment/vllm.md:259 386a1e853cbb4ec38ea7050f21bdd0d8
+msgid "Python Library"
+msgstr "Python 库使用"
+
+#: ../../Qwen/source/deployment/vllm.md:261 8ec4de6067e24624807613c89745e894
+msgid "vLLM can also be directly used as a Python library, which is convinient for offline batch inference but lack some API-only features, such as parsing model generation to structure messages."
+msgstr "vLLM 也可以直接用作 Python 库，这对离线批量推理非常方便，但缺少一些仅限 API 的功能，例如将模型生成解析为结构化消息。"
+
+#: ../../Qwen/source/deployment/vllm.md:263 b3722c7e75bd44e5a6b2c3b7f44fd30f
+msgid "The following shows the basic usage of vLLM as a library:"
+msgstr "以下展示了将 vLLM 用作库的基本用法："
+
+#: ../../Qwen/source/deployment/vllm.md:300 534845226be74248bd11a9b93fa153a0
+msgid "FAQ"
+msgstr "常见问题解答"
+
+#: ../../Qwen/source/deployment/vllm.md:302 d1d72b72ccdc459c85ad8e383207931c
+msgid "You may encounter OOM issues that are pretty annoying. We recommend two arguments for you to make some fix."
+msgstr "您可能会遇到令人烦恼的OOM（内存溢出）问题。我们推荐您尝试两个参数进行修复。"
+
+#: ../../Qwen/source/deployment/vllm.md:305 e5c77a3ba017433cb6b1a5c7e6015863
+msgid "The first one is `--max-model-len`. Our provided default `max_position_embedding` is `40960` and thus the maximum length for the serving is also this value, leading to higher requirements of memory. Reducing it to a proper length for yourself often helps with the OOM issue."
+msgstr "第一个参数是 `--max-model-len` 。我们提供的默认最大位置嵌入（`max_position_embedding`）为 40960 ，因此服务时的最大长度也是这个值，这会导致更高的内存需求。将此值适当减小通常有助于解决OOM问题。"
+
+#: ../../Qwen/source/deployment/vllm.md:308 dc513e2855454776a4902cc8381b6c72
+msgid "Another argument you can pay attention to is `--gpu-memory-utilization`. vLLM will pre-allocate this much GPU memory. By default, it is `0.9`. This is also why you find a vLLM service always takes so much memory. If you are in eager mode (by default it is not), you can level it up to tackle the OOM problem. Otherwise, CUDA Graphs are used, which will use GPU memory not controlled by vLLM, and you should try lowering it. If it doesn't work, you should try `--enforce-eager`, which may slow down infernece, or reduce the `--max-model-len`."
+msgstr "另一个您可以关注的参数是 `--gpu-memory-utilization` 。 vLLM将预分配该参数指定比例的显存。默认情况下，该值为 `0.9`。这也是为什么您发现一个vLLM服务总是占用大量内存的原因。如果你使用了eager模式（默认不是），您可以将其调高以应对OOM问题。反之，vLLM会使用CUDA Graphs，而CUDA Graphs会额外占用不受vLLM管理的显存；此时，您应当尝试降低`--gpu-memory-utilization`。如果还是无法解决，可以尝试`--enforce-eager`（这会影响推理效率）或缩小`--max-model-len`。"
+
+#~ msgid "Installation"
+#~ msgstr "安装"
+
+#~ msgid "Offline Batched Inference"
+#~ msgstr "离线推理"
+
+#~ msgid "Models supported by Qwen2.5 codes are supported by vLLM. The simplest usage of vLLM is offline batched inference as demonstrated below."
+#~ msgstr "Qwen2.5代码支持的模型都被vLLM所支持。 vLLM最简单的使用方式是通过以下演示进行离线批量推理。"
+
+#~ msgid "OpenAI-Compatible API Service"
+#~ msgstr "OpenAI兼容的API服务"
+
+#~ msgid "You don't need to worry about chat template as it by default uses the chat template provided by the tokenizer."
+#~ msgstr "你无需担心chat模板，因为它默认会使用由tokenizer提供的chat模板。"
+
+#~ msgid "The OpenAI-compatible server in `vllm` comes with [a default set of sampling parameters](https://github.com/vllm-project/vllm/blob/v0.5.2/vllm/entrypoints/openai/protocol.py#L130), which are not suitable for Qwen2.5 models and prone to repetition. We advise you to always pass sampling parameters to the API."
+#~ msgstr "`vllm` 中的 OpenAI 兼容服务器使用 [一组默认的采样参数](https://github.com/vllm-project/vllm/blob/v0.5.2/vllm/entrypoints/openai/protocol.py#L130)。这组默认参数并不适用于 Qwen2.5 模型，并可能加重重复问题。我们建议您总是为该API传入合适的采样参数。"
+
+#~ msgid "Tool Use"
+#~ msgstr "工具使用"
+
+#~ msgid "Multi-GPU Distributed Serving"
+#~ msgstr "多卡分布式部署"
+
+#~ msgid "To scale up your serving throughput, distributed serving helps you by leveraging more GPU devices.  Besides, for large models like `Qwen2.5-72B-Instruct`, it is impossible to serve it on a single GPU. Here, we demonstrate how to run `Qwen2.5-72B-Instruct` with tensor parallelism just by passing in the argument `tensor_parallel_size`:"
+#~ msgstr "要提高模型的处理吞吐量，分布式服务可以通过利用更多的GPU设备来帮助您。特别是对于像 `Qwen2.5-72B-Instruct` 这样的大模型，单个GPU无法支撑其在线服务。在这里，我们通过演示如何仅通过传入参数 `tensor_parallel_size` ，来使用张量并行来运行 `Qwen2.5-72B-Instruct` 模型："
+
+#~ msgid "Offline"
+#~ msgstr "离线推理"
+
+#~ msgid "API"
+#~ msgstr ""
+
+#~ msgid "Extended Context Support"
+#~ msgstr "上下文支持扩展"
+
+#~ msgid "vLLM supports YARN and it can be enabled by add a `rope_scaling` field to the `config.json` file of the model. For example,"
+#~ msgstr "vLLM 支持 YaRN，并且可以通过在模型的 `config.json` 文件中添加一个 `rope_scaling` 字段来启用它。例如，"
+
+#~ msgid "vLLM supports different types of quantized models, including AWQ, GPTQ, SqueezeLLM, etc.  Here we show how to deploy AWQ and GPTQ models.  The usage is almost the same as above except for an additional argument for quantization.  For example, to run an AWQ model. e.g., `Qwen2.5-7B-Instruct-AWQ`:"
+#~ msgstr "vLLM 支持多种类型的量化模型，例如 AWQ、GPTQ、SqueezeLLM 等。这里我们将展示如何部署 AWQ 和 GPTQ 模型。使用方法与上述基本相同，只不过需要额外指定一个量化参数。例如，要运行一个 AWQ 模型，例如 `Qwen2.5-7B-Instruct-AWQ` ："
+
+#~ msgid "or GPTQ models like `Qwen2.5-7B-Instruct-GPTQ-Int4`:"
+#~ msgstr "或者是GPTQ模型比如 `Qwen2.5-7B-Instruct-GPTQ-Int4` ："
+
+#~ msgid "Additionally, vLLM supports the combination of AWQ or GPTQ models with KV cache quantization, namely FP8 E5M2 KV Cache.  For example,"
+#~ msgstr "此外，vLLM支持将AWQ或GPTQ模型与KV缓存量化相结合，即FP8 E5M2 KV Cache方案。例如："
+
+#~ msgid "Troubleshooting"
+#~ msgstr "常见问题"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/framework/Langchain.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/framework/Langchain.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/framework/Langchain.rst:2 6f9b66430d9c495592b1e275fdfd7c9e
+msgid "Langchain"
+msgstr ""
+
+#: ../../Qwen/source/framework/Langchain.rst:5 1205af46f88e4d6681003403109385c3
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/framework/Langchain.rst:7 115ee7b1c8404629a8f98175264cc114
+msgid "This guide helps you build a question-answering application based on a local knowledge base using ``Qwen2.5-7B-Instruct`` with ``langchain``. The goal is to establish a knowledge base Q&A solution."
+msgstr "本教程旨在帮助您利用 ``Qwen2.5-7B-Instruct`` 与 ``langchain`` ，基于本地知识库构建问答应用。目标是建立一个知识库问答解决方案。"
+
+#: ../../Qwen/source/framework/Langchain.rst:12
+#: 7257b95612fb423bb9ca73212fd12a02
+msgid "Basic Usage"
+msgstr "基础用法"
+
+#: ../../Qwen/source/framework/Langchain.rst:14
+#: fecf7a682dcc4c15a53da1f7cdf145e5
+msgid "The implementation process of this project includes loading files -> reading text -> segmenting text -> vectorizing text -> vectorizing questions -> matching the top k most similar text vectors with the question vectors -> incorporating the matched text as context along with the question into the prompt -> submitting to the Qwen2.5-7B-Instruct to generate an answer. Below is an example:"
+msgstr "您可以仅使用您的文档配合 ``langchain`` 来构建一个问答应用。该项目的实现流程包括加载文件 -> 阅读文本 -> 文本分段 -> 文本向量化 -> 问题向量化 -> 将最相似的前k个文本向量与问题向量匹配 -> 将匹配的文本作为上下文连同问题一起纳入提示 -> 提交给Qwen2.5-7B-Instruct生成答案。以下是一个示例："
+
+#: ../../Qwen/source/framework/Langchain.rst:98
+#: 6ad1ebd2ef4a49f9aa66cfdf777e1290
+msgid "After loading the Qwen2.5-7B-Instruct model, you should specify the txt file for retrieval."
+msgstr "加载Qwen2.5-7B-Instruct模型后，您可以指定需要用于知识库问答的txt文件。"
+
+#: ../../Qwen/source/framework/Langchain.rst:274
+#: 00467b1e4e294a26b9f49886633331e0
+msgid "Next Step"
+msgstr "下一步"
+
+#: ../../Qwen/source/framework/Langchain.rst:276
+#: 15ed906687054af78545290ba0746380
+msgid "Now you can chat with Qwen2.5 use your own document. Continue to read the documentation and try to figure out more advanced usages of model retrieval!"
+msgstr "现在，您可以在您自己的文档上与Qwen2.5进行交流。继续阅读文档，尝试探索模型检索的更多高级用法！"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/framework/LlamaIndex.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/framework/LlamaIndex.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:2
+#: 2e41f8696c20488d8593b670c6361edf
+msgid "LlamaIndex"
+msgstr "LlamaIndex"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:5
+#: 20b3836fd391457bb00bf75b61e23e0d
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:7
+#: 86d9e6f0684749aab40a9824cd026fa3
+msgid "To connect Qwen2.5 with external data, such as documents, web pages, etc., we offer a tutorial on `LlamaIndex <https://www.llamaindex.ai/>`__. This guide helps you quickly implement retrieval-augmented generation (RAG) using LlamaIndex with Qwen2.5."
+msgstr "为了实现 Qwen2.5 与外部数据（例如文档、网页等）的连接，我们提供了 `LlamaIndex <https://www.llamaindex.ai/>`__ 的详细教程。本指南旨在帮助用户利用 LlamaIndex 与 Qwen2.5 快速部署检索增强生成（RAG）技术。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:11
+#: 71ed222858054687a5b33222bb6ac086
+msgid "Preparation"
+msgstr "环境准备"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:13
+#: 161d9153d6484dd5a1f1bdb340847814
+msgid "To implement RAG, we advise you to install the LlamaIndex-related packages first."
+msgstr "为实现检索增强生成（RAG），我们建议您首先安装与 LlamaIndex 相关的软件包。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:16
+#: a8d6acb1001a42c88185b971ae2de3bf
+msgid "The following is a simple code snippet showing how to do this:"
+msgstr "以下是一个简单的代码示例："
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:25
+#: e441d3b8fb6d4a13b52e1560ef250b16
+msgid "Set Parameters"
+msgstr "设置参数"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:27
+#: c2481804c3f34c7f883eed92ffa3111e
+msgid "Now we can set up LLM, embedding model, and the related configurations. Qwen2.5-Instruct supports conversations in multiple languages, including English and Chinese. You can use the ``bge-base-en-v1.5`` model to retrieve from English documents, and you can download the ``bge-base-zh-v1.5`` model to retrieve from Chinese documents. You can also choose ``bge-large`` or ``bge-small`` as the embedding model or modify the context window size or text chunk size depending on your computing resources. Qwen2.5 model families support a maximum of 32K context window size (up to 128K for 7B, 14B, 32B, and 72B, requiring extra configuration)"
+msgstr "现在，我们可以设置语言模型和向量模型。Qwen2.5-Instruct支持包括英语和中文在内的多种语言对话。您可以使用 ``bge-base-en-v1.5`` 模型来检索英文文档，下载 ``bge-base-zh-v1.5`` 模型以检索中文文档。根据您的计算资源，您还可以选择 ``bge-large`` 或 ``bge-small`` 作为向量模型，或调整上下文窗口大小或文本块大小。Qwen2.5模型系列支持最大32K上下文窗口大小（7B 、14B 、32B 及 72B可扩展支持 128K 上下文，但需要额外配置）"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:85
+#: 74c35d5a03734c289d162dfa3813ada6
+msgid "Build Index"
+msgstr "构建索引"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:87
+#: c49859d4ea5f49dba1fa2263f3ae284d
+msgid "Now we can build index from documents or websites."
+msgstr "现在我们可以从文档或网站构建索引。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:89
+#: b460d000037e4266a4d9f43d38f1f9b0
+msgid "The following code snippet demonstrates how to build an index for files (regardless of whether they are in PDF or TXT format) in a local folder named 'document'."
+msgstr "以下代码片段展示了如何为本地名为'document'的文件夹中的文件（无论是PDF格式还是TXT格式）构建索引。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:102
+#: a416d18b227940e29fac1f59851ff8c4
+msgid "The following code snippet demonstrates how to build an index for the content in a list of websites."
+msgstr "以下代码片段展示了如何为一系列网站的内容构建索引。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:118
+#: 487cf928d048424fa1b50438f701137c
+msgid "To save and load the index, you can use the following code snippet."
+msgstr "要保存和加载已构建的索引，您可以使用以下代码示例。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:132
+#: c68419c4318d46e891f5df9191be6d2d
+msgid "RAG"
+msgstr "检索增强（RAG）"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:134
+#: 8ad20a8f43fe496084a40f963ba97440
+msgid "Now you can perform queries, and Qwen2.5 will answer based on the content of the indexed documents."
+msgstr "现在您可以输入查询，Qwen2.5 将基于索引文档的内容提供答案。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/framework/function_call.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/framework/function_call.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/framework/function_call.md:6
+#: 9beab99bf6ea4ebaa37d53ed4100b34d
+msgid "Function Calling"
+msgstr "函数调用"
+
+#: ../../Qwen/source/framework/function_call.md:9
+#: 68bbe10408334355bc375ced535d2192
+msgid "To be updated for Qwen3. Since the support for tool calling in Qwen3 is a superset of that in Qwen2, the examples would still work."
+msgstr "即将更新以适配 Qwen3。由于 Qwen3 对工具调用的支持是 Qwen2 的超集，因此这些示例仍然适用。"
+
+#: ../../Qwen/source/framework/function_call.md:13
+#: e365d2d9a6d0456f8c7eacd41676a9bb
+msgid "Preface"
+msgstr "前言"
+
+#: ../../Qwen/source/framework/function_call.md:15
+#: 098c449c10fc4d7a9d2b43607428bc4b
+msgid "Function calling with large language models is a huge and evolving topic. It is particularly important for AI applications:"
+msgstr "使用大型语言模型进行函数调用 (Function Calling) 是一个庞大且不断发展的主题。这对AI应用尤为重要："
+
+#: ../../Qwen/source/framework/function_call.md:17
+#: 1374bf593fd547d2abe3bd539785fd93
+msgid "either for AI-native applications that strive to work around the shortcomings of current AI technology,"
+msgstr "无论是为了绕过当前AI技术的局限性，而设计的原生AI应用，"
+
+#: ../../Qwen/source/framework/function_call.md:18
+#: 49bfba2b48c344aab7704fd297e46075
+msgid "or for existing applications that seeks the integration of AI technology to improve performance, user interaction and experience, or efficiency."
+msgstr "还是为了提升性能、用户体验或效率，寻求整合AI技术的现有应用。"
+
+#: ../../Qwen/source/framework/function_call.md:20
+#: 2d90524d75e84021857692fad0253dfd
+msgid "This guide will not delve into those discussions or which role an LLM should play in an application and the related best practice. Those views are reflected in the design of AI application frameworks: from LangChain to LlamaIndex to QwenAgent."
+msgstr "本指南不会深入讨论LLM在应用中应扮演的角色及相关的最佳实践。这些观点反映在AI应用框架的设计上：从LangChain到LlamaIndex再到QwenAgent。"
+
+#: ../../Qwen/source/framework/function_call.md:23
+#: eb136bb2cdd845d580dca33855b8926c
+msgid "Instead, we will talk about how Qwen2.5 can be used to support function calling and how it can be used to achieve your goals, from the inference usage for developing application to the inner workings for hardcore customizations.  In this guide,"
+msgstr "相反，我们将讨论如何使用Qwen2.5来支持函数调用，以及如何利用它实现你的目标，从开发应用时的推理用途，到硬核定制的内部运作。在这个指南中，"
+
+#: ../../Qwen/source/framework/function_call.md:25
+#: f354efacac02496799de32fa0f819a20
+msgid "We will first demonstrate how to use function calling with Qwen2.5."
+msgstr "我们首先将展示如何使用Qwen2.5进行函数调用。"
+
+#: ../../Qwen/source/framework/function_call.md:26
+#: 0f77fb1c6a63449a924575764b204612
+msgid "Then, we will introduce the technical details on functional calling with Qwen2.5, which are mainly about the templates."
+msgstr "接着，我们将介绍使用Qwen2.5进行函数调用的技术细节，主要涉及模板的使用。"
+
+#: ../../Qwen/source/framework/function_call.md:28
+#: 174e8cb35a594a06aac5d8fe3f0f96ba
+msgid "Before starting, there is one thing we have not yet introduced, that is ..."
+msgstr "在开始之前，还有一件事我们尚未介绍，那就是…"
+
+#: ../../Qwen/source/framework/function_call.md:30
+#: 61295ff5189b4887aded9a3dd4c87b3a
+msgid "What is function calling?"
+msgstr "什么是函数调用？"
+
+#: ../../Qwen/source/framework/function_call.md:33
+#: 8256bec4b06b401db4f1d6af9e169a1e
+msgid "There is another term \"tool use\" that may be used to refer to the same concept. While some may argue that tools are a generalized form of functions, at present, their difference exists only technically as different I/O types of programming interfaces."
+msgstr "这一概念也可能被称为“工具使用” (\"tool use\")。虽然有人认为“工具”是“函数”的泛化形式，但在当前，它们的区别仅在技术层面上，表现为编程接口的不同输入输出类型。"
+
+#: ../../Qwen/source/framework/function_call.md:37
+#: 2ac692a7f3fc4c1182ba5ca670e2569b
+msgid "Large language models (LLMs) are powerful things. However, sometimes LLMs by themselves are simply not capable enough."
+msgstr "大型语言模型（LLMs）确实强大。然而，有时候单靠大型语言模型的能力还是不够的。"
+
+#: ../../Qwen/source/framework/function_call.md:39
+#: f41173c244994701a410d58790e8d053
+msgid "On the one hand, LLMs have inherent modeling limitations.  For one, they do not know things that are not in their training data, which include those happened after their training ended. In addition, they learn things in the way of likelihood, which suggests that they may not be precise enough for tasks with fixed rule sets, e.g., mathematical computation."
+msgstr "一方面，大型语言模型存在建模局限性。首先，对于训练数据中没有的信息，包括训练结束后发生的事情，它们并不了解。此外，它们通过概率方式学习，这意味着对于有固定规则集的任务，如数学计算，可能不够精确。"
+
+#: ../../Qwen/source/framework/function_call.md:42
+#: 3ab73b4ca14e40bca40e5a657f284f78
+msgid "On the other hand, it is not easy to use LLMs as a Plug-and-Play service programmatically with other things. LLMs mostly talk in words that are open to interpretation and thus ambiguous, while other software or applications or systems talk in code and through programming interfaces that are pre-defined and fixed and structured."
+msgstr "另一方面，将大型语言模型作为即插即用服务与其它系统进行编程式协作，并非易事。大型语言模型的表达多含主观解释成分，因而产生歧义；而其他软件、应用或系统则通过预定义、固定和结构化的代码及编程接口进行沟通。"
+
+#: ../../Qwen/source/framework/function_call.md:45
+#: f65c57ab8f254bff9a7c281260f5e6c7
+msgid "To this end, function calling establishes a common protocol that specifies how LLMs should interact with the other things. The procedure is mainly as follows:"
+msgstr "为此，函数调用确立了一个通用协议，规定了大型语言模型应与其他实体互动的流程。主要流程如下："
+
+#: ../../Qwen/source/framework/function_call.md:47
+#: f246814de8eb454983996810f4dbe082
+msgid "The application provides a set of functions and the instructions of the functions to an LLM."
+msgstr "应用程序向大型语言模型提供一组函数及其使用说明。"
+
+#: ../../Qwen/source/framework/function_call.md:48
+#: c520064a05924ef2923ebd712a2c8e52
+msgid "The LLM choose to or not to, or is forced to use one or many of the functions, in response to user queries."
+msgstr "大型语言模型根据用户查询，选择使用或不使用，或被迫使用一个或多个函数。"
+
+#: ../../Qwen/source/framework/function_call.md:49
+#: 6f5467688c8c474b89580a7d53f718a1
+msgid "If the LLM chooses to use the functions, it states how the functions should be used based on the function instructions."
+msgstr "如果大型语言模型选择使用这些函数，它会根据函数说明如何使用。"
+
+#: ../../Qwen/source/framework/function_call.md:50
+#: 0054606454134e82b07427591564cac4
+msgid "The chosen functions are used as such by the application and the results are obtained, which are then given to the LLM if further interaction is needed."
+msgstr "应用程序按照选择使用这些函数，并获取结果。如果需要进一步互动，结果将提供给大型语言模型。"
+
+#: ../../Qwen/source/framework/function_call.md:52
+#: 143abc83c56042d09ab8440a8a91b0dd
+msgid "They are many ways for LLMs to understand and follow this protocol. As always, the key is prompt engineering or an internalized template known by the model. Qwen2.5 were pre-trained with various types of templates that could support function calling, so that users can directly make use of this procedure."
+msgstr "大型语言模型理解并遵循此协议有多种方式。关键在于提示工程 (Prompt Engineering) 或模型内化的模板。Qwen2预先训练了多种支持函数调用的模板，以便用户可以直接利用这一过程。"
+
+#: ../../Qwen/source/framework/function_call.md:57
+#: e469caad6ca54deb8a8d34249f3b35cd
+msgid "Inference with Function Calling"
+msgstr "使用函数调用进行推理"
+
+#: ../../Qwen/source/framework/function_call.md:60
+#: ded933602fd14966978b243ad3046976
+msgid "Please be aware that the inference usage is subject to change as the frameworks and the Qwen models evolve."
+msgstr "请注意，随着框架和Qwen模型的不断演进，推理的使用方式可能会发生变化。"
+
+#: ../../Qwen/source/framework/function_call.md:63
+#: ab264772fc0e496693304ab0e1b77f31
+msgid "As function calling is essentially implemented using prompt engineering, you could manually construct the model inputs for Qwen2 models. However, frameworks with function calling support can help you with all that laborious work."
+msgstr "由于函数调用本质上是通过提示工程实现的，您可以手动构建Qwen2模型的输入。但是，支持函数调用的框架可以帮助您完成所有繁重的工作。"
+
+#: ../../Qwen/source/framework/function_call.md:66
+#: 8b53faa722604d40b8e6a16679742736
+msgid "In the following, we will introduce the usage (via dedicated function calling chat template) with"
+msgstr "接下来，我们将介绍（通过专用的函数调用模板）使用"
+
+#: ../../Qwen/source/framework/function_call.md:67
+#: 0375c440ccf04d778f992d6d0a4cbe88
+msgid "**Qwen-Agent**,"
+msgstr "**Qwen-Agent**，"
+
+#: ../../Qwen/source/framework/function_call.md:68
+#: 2ddfd6f2c18048fdbfd5c6ef4e9c15eb
+msgid "**Hugging Face transformers**,"
+msgstr "**Hugging Face transformers**，"
+
+#: ../../Qwen/source/framework/function_call.md:69
+#: f5a16e12451744558b5f0aa1e830a158
+msgid "**Ollama**, and"
+msgstr "**Ollama**，和"
+
+#: ../../Qwen/source/framework/function_call.md:70
+#: 9ab26b9373194451abc76749fefdb6d4
+msgid "**vLLM**."
+msgstr "**vLLM**。"
+
+#: ../../Qwen/source/framework/function_call.md:72
+#: 6157f3d3533b41a0821a42366eab0623
+msgid "If you are familiar with the usage of OpenAI API, you could also directly use the OpenAI-compatible API services for Qwen2.5. However, not all of them support function calling for Qwen2.5. Currently, supported solutions include the self-hosted service by [Ollama](https://github.com/ollama/ollama/blob/main/docs/openai.md) or [vLLM](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api) and the cloud service of [ModelStudio \\[zh\\]](https://help.aliyun.com/zh/model-studio/developer-reference/compatibility-of-openai-with-dashscope#97e2b45391x08)."
+msgstr "如果您熟悉OpenAI API的使用，您也可以直接使用适用于Qwen2.5的OpenAI兼容API服务。然而，并非所有服务都支持Qwen2.5的函数调用。目前，支持的解决方案包括由[Ollama](https//github.com/ollama/ollama/blob/main/docs/openai.md)或[vLLM](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api)提供的自托管服务和[阿里云百炼](https://help.aliyun.com/zh/model-studio/developer-reference/compatibility-of-openai-with-dashscope#97e2b45391x08)的云服务。"
+
+#: ../../Qwen/source/framework/function_call.md:76
+#: 425c7b0191fa48a6bfd57ecfd1764041
+msgid "If you are familiar with application frameworks, e.g., LangChain, you can also use function calling abilities in Qwen2.5 via ReAct Prompting."
+msgstr "如果您熟悉应用框架，例如LangChain，您也可以通过ReAct Prompting在Qwen2.5中使用函数调用功能。"
+
+#: ../../Qwen/source/framework/function_call.md:78
+#: 508b40fa17d24550a46415881e7db25b
+msgid "The Example Case"
+msgstr "案例"
+
+#: ../../Qwen/source/framework/function_call.md:80
+#: 0b47cd45a43345a79f6746c6955ca28f
+msgid "Let's also use an example to demonstrate the inference usage. We assume **Python 3.11** is used as the programming language."
+msgstr "我们同样通过一个示例来展示推理的使用方法。假设我们使用的编程语言是**Python 3.11**。"
+
+#: ../../Qwen/source/framework/function_call.md:83
+#: ff88f7b842024062bcff856d0e1e4771
+msgid "**Scenario**: Suppose we would like to ask the model about the temperature of a location. Normally, the model would reply that it cannot provide real-time information. But we have two tools that can be used to obtain the current temperature of and the temperature at a given date of a city respectively, and we would like the model to make use of them."
+msgstr "**场景**：假设我们要询问模型某个地点的温度。通常，模型会回答无法提供实时信息。但我们有两个工具，可以分别获取城市的当前温度和指定日期的温度，我们希望模型能够利用这些工具。"
+
+#: ../../Qwen/source/framework/function_call.md:87
+#: 9e478a1634bb499c88442ed6a8dcddc8
+msgid "To set up the example case, you can use the following code:"
+msgstr "为了这个示例案例，您可以使用以下代码："
+
+#: ../../Qwen/source/framework/function_call.md
+#: aebb484e1452437eb26717c460a1f7ea
+msgid "Preparation Code"
+msgstr "准备代码"
+
+#: ../../Qwen/source/framework/function_call.md:194
+#: c6066c74ceab4ca4b2a15a5d4137c22a
+msgid "In particular, the tools should be described using JSON Schema and the messages should contain as much available information as possible. You can find the explanations of the tools and messages below:"
+msgstr "工具应使用JSON Schema进行描述，消息应包含尽可能多的有效信息。您可以在下面找到工具和消息的解释："
+
+#: ../../Qwen/source/framework/function_call.md
+#: 9a22b936b9894fa6be0cc492c64abb63
+msgid "Example Tools"
+msgstr "示例工具"
+
+#: ../../Qwen/source/framework/function_call.md:199
+#: 596b3ce24b164e23a09a375ac55ada45
+msgid "The tools should be described using the following JSON:"
+msgstr "工具应使用以下JSON进行描述："
+
+#: ../../Qwen/source/framework/function_call.md:263
+#: 3be94377bbd94229b3da3cf040818ab9
+msgid "For each **tool**, it is a JSON object with two fields:"
+msgstr "对于每个**工具**，它是一个具有两个字段的JSON object："
+
+#: ../../Qwen/source/framework/function_call.md:264
+#: 8f3b20c0c2494d51ae6afe6a75662540
+msgid "`type`: a string specifying the type of the tool, currently only `\"function\"` is valid"
+msgstr "`type`：string，用于指定工具类型，目前仅`\"function\"`有效"
+
+#: ../../Qwen/source/framework/function_call.md:265
+#: 23a80c3dfb31433598154f6d75e5fa67
+msgid "`function`: an object detailing the instructions to use the function"
+msgstr "`function`：object，详细说明了如何使用该函数"
+
+#: ../../Qwen/source/framework/function_call.md:267
+#: 25d6764f41c14cd5a54ea390f5fa746d
+msgid "For each **function**, it is a JSON object with three fields:"
+msgstr "对于每个**function**，它是一个具有三个字段的JSON object："
+
+#: ../../Qwen/source/framework/function_call.md:268
+#: cd048320507042129336abe75fa962e7
+msgid "`name`: a string indicating the name of the function"
+msgstr "`name`：string 表示函数名称"
+
+#: ../../Qwen/source/framework/function_call.md:269
+#: 20186ca105aa4914a4ed6a9821a80555
+msgid "`description`: a string describing what the function is used for"
+msgstr "`description`：string 描述函数用途"
+
+#: ../../Qwen/source/framework/function_call.md:270
+#: 93a5888532534627afafdb4a2ed8d2be
+msgid "`parameters`: [a JSON Schema](https://json-schema.org/learn/getting-started-step-by-step) that specifies the parameters the function accepts. Please refer to the linked documentation for how to compose a JSON Schema. Notable fields include `type`, `required`, and `enum`."
+msgstr "`parameters`：[JSON Schema](https://json-schema.org/learn/getting-started-step-by-step)，用于指定函数接受的参数。请参阅链接文档以了解如何构建JSON Schema。值得注意的字段包括`type`、`required`和`enum`。"
+
+#: ../../Qwen/source/framework/function_call.md:272
+#: 0f8d5fa5bf764e888441b5fa445c98ae
+msgid "Most frameworks use the tool format and some may use the function format. Which one to use should be obvious according to the naming."
+msgstr "大多数框架使用“工具”格式，有些可能使用“函数”格式。根据命名，应该很明显应该使用哪一个。"
+
+#: ../../Qwen/source/framework/function_call.md
+#: f15fa96fb4b245b28fd544a5c4a74958
+msgid "Example Messages"
+msgstr "示例消息"
+
+#: ../../Qwen/source/framework/function_call.md:279
+#: 906e53e3f10542819f8506c780957196
+msgid "Our query is `What's the temperature in San Francisco now? How about tomorrow?`. Since the model does not know what the current date is, let alone tomorrow, we should provide the date in the inputs. Here, we decide to supply that information in the system message after the default system message `You are Qwen, created by Alibaba Cloud. You are a helpful assistant.`. You could append the date to user message in your application code."
+msgstr "我们的查询是`What's the temperature in San Francisco now? How about tomorrow?`。由于模型不知道当前日期，更不用说明天了，我们应该在输入中提供日期。在这里，我们决定在默认系统消息`You are Qwen, created by Alibaba Cloud. You are a helpful assistant.`之后的系统消息中提供该信息。您可以在应用程序代码中将日期附加到用户消息。"
+
+#: ../../Qwen/source/framework/function_call.md:292
+#: ../../Qwen/source/framework/function_call.md:555
+#: 16b171b36c9f46fea2b30a3b0491db55 ce3d6dc46c5b420484ad78a89e492b1e
+msgid "Qwen-Agent"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:294
+#: 6ef16c1b08664d7bb94253e4726d1ad9
+msgid "[Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) is actually a Python Agent framework for developing AI applications. Although its intended use cases are higher-level than efficient inference, it does contain the **canonical implementation** of function calling for Qwen2.5. It provides the function calling ability for Qwen2.5 to an OpenAI-compatible API through templates that is transparent to users."
+msgstr "[Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) 实际上是一个用于开发AI应用的Python智能体框架。尽管其设计用例比高效推理更高级，但它确实包含了Qwen2.5函数调用的**规范实现**。基于OpenAI兼容API，它可以通过模板为Qwen2.5提供了对用户透明的的函数调用能力。"
+
+#: ../../Qwen/source/framework/function_call.md:299
+#: 978b5270e15749269a680fbae4a05ab2
+msgid "It's worth noting that since a lot of stuff can be done under the scene with application frameworks, currently the official function calling implementation for Qwen2.5 is very flexible and beyond simple templating, making it hard to adapt it other frameworks that use less capable templating engines."
+msgstr "值得注意的是，由于应用框架可以在幕后完成大量工作，目前Qwen2.5官方的函数调用实现非常灵活且超出了简单的模板化，这使得它难以适应那些使用能力较弱的模板引擎的其他框架。"
+
+#: ../../Qwen/source/framework/function_call.md:301
+#: 8108cc93576b427f86c033cf6847c59a
+msgid "Before starting, let's make sure the latest library is installed:"
+msgstr "在开始之前，让我们确保已安装了最新的库："
+
+#: ../../Qwen/source/framework/function_call.md:306
+#: baeca47856fe4f8e9f105b6d1678c648
+#, fuzzy
+msgid "For this guide, we are at version v0.0.10."
+msgstr "对于本指南，我们处于版本v0.0.9。"
+
+#: ../../Qwen/source/framework/function_call.md:308
+#: ../../Qwen/source/framework/function_call.md:454
+#: ../../Qwen/source/framework/function_call.md:670
+#: ../../Qwen/source/framework/function_call.md:782
+#: b9ce56b7e4544aaba3812335006de981 e3b06686bcbd4cd6bc99ee238def55ea
+#: ecfd39b570fb4fd0b405abccefe91de2 f111d6e7f89e498798711aad21fe8dd4
+msgid "Preparing"
+msgstr "准备工作"
+
+#: ../../Qwen/source/framework/function_call.md:310
+#: 45fecd8307254305a2018726f8adb3ac
+msgid "Qwen-Agent can wrap an OpenAI-compatible API that does not support function calling. You can serve such an API with most inference frameworks or obtain one from cloud providers like DashScope or Together."
+msgstr "Qwen-Agent可以封装一个不支持函数调用的OpenAI兼容API。您可以使用大多数推理框架来提供此类API，或者从DashScope或Together等云提供商处获取一个。"
+
+#: ../../Qwen/source/framework/function_call.md:313
+#: 615ebcb0c6cb437f9723d5b4119800a5
+msgid "Assuming there is an OpenAI-compatible API at `http://localhost:8000/v1`, Qwen-Agent provides a shortcut function `get_chat_model` to obtain a model inference class with function calling support:"
+msgstr "假设在`http://localhost:8000/v1`处有一个OpenAI兼容API，Qwen-Agent提供了一个快捷函数`get_chat_model`，用于获取具有函数调用支持的模型推理类："
+
+#: ../../Qwen/source/framework/function_call.md:325
+#: 783f6b93cc164dfaac23aa4ac20e2c2a
+msgid "In the above, `model_server` is the `api_base` common used in other OpenAI-compatible API clients. It is advised to provide the `api_key` (but not via plaintext in the code), even if the API server does not check it, in which case, you can set it to anything."
+msgstr "在上述代码中，`model_server`是其他OpenAI兼容API客户端常用的`api_base`。建议您提供`api_key`（但不要以明文形式出现在代码中），即使API服务器不检查它，在这种情况下，您可以将其设置为任何值。"
+
+#: ../../Qwen/source/framework/function_call.md:328
+#: 0231335d788a4ff08ba062da648afc94
+msgid "For model inputs, the common message structure for system, user, and assistant history should be used:"
+msgstr "对于模型输入，应使用系统、用户和助手历史记录的通用消息结构："
+
+#: ../../Qwen/source/framework/function_call.md:338
+#: fe2abed2ecda45a88a0a819123624376
+msgid "We add the current date to the system message so that the \"tomorrow\" in the user message is anchored. It can also be added to the user message if one desires."
+msgstr "我们在系统消息中添加当前日期，以便使用户消息中的\"明天\"有明确的参照点。如果需要，也可以将其添加到用户消息中。"
+
+#: ../../Qwen/source/framework/function_call.md:341
+#: 89c4992bdf324623b53ebe1a54191a99
+msgid "At the time, Qwen-Agent works with functions instead of tools. This requires a small change to our tool descriptions, that is, extracting the function fields:"
+msgstr "目前，Qwen-Agent使用“函数”而非“工具”。这需要对我们工具描述进行一些小的更改，即提取函数字段："
+
+#: ../../Qwen/source/framework/function_call.md:348
+#: ../../Qwen/source/framework/function_call.md:495
+#: ../../Qwen/source/framework/function_call.md:684
+#: ../../Qwen/source/framework/function_call.md:813
+#: 17daa4342c054a8e9b72169a6ebf49a1 67ef3d2ae00b4b4c9153a10047ca2522
+#: 75839b0a245e4704834a7d8b12c5d2b9 7dd2c5cc0b404552afb2dbbfe1532cda
+msgid "Tool Calls and Tool Results"
+msgstr "工具调用和工具结果"
+
+#: ../../Qwen/source/framework/function_call.md:350
+#: b271cc28f0ff4422bb5f1984363e4730
+msgid "To interact with the model, the `chat` method should be used:"
+msgstr "为了与模型交互，应使用`chat`方法："
+
+#: ../../Qwen/source/framework/function_call.md:362
+#: c3ea39a48e3d4890a006d4cedb109064
+msgid "In the above code, the `chat` method receives the `messages`, the `functions`, and an `extra_generate_cfg` parameter. You can put sampling parameters, such as `temperature`, and `top_p`, in the `extra_generate_cfg`. Here, we add to it a special control `parallel_function_calls` provided by Qwen-Agent. As its name suggests, it will enable parallel function calls, which means that the model may generate multiple function calls for a single turn as it deems fit."
+msgstr "在上述代码中，`chat`方法接收`messages`、`functions`以及一个`extra_generate_cfg`参数。你可以在`extra_generate_cfg`中放入诸如`temperature`和`top_p`等采样参数。这里，我们添加了Qwen-Agent提供的特殊控制`parallel_function_calls`。顾名思义，它将启用并行函数调用，这意味着模型可能为单次请求生成多个函数调用，按照其判断进行。"
+
+#: ../../Qwen/source/framework/function_call.md:367
+#: 717eca19455f452daaccc3626dd93ac9
+msgid "The `chat` method returns a generator of list, each of which may contain multiple messages. Since we enable `parallel_function_calls`, we should get two messages in the responses:"
+msgstr "`chat`方法返回一个列表的生成器，每个列表可能包含多条消息。因为我们启用了`parallel_function_calls`，我们应该在响应中得到两条消息："
+
+#: ../../Qwen/source/framework/function_call.md:377
+#: 67f5fab10f914b37bf0ae28e5fb4a271
+msgid "As we can see, Qwen-Agent attempts to parse the model generation in an easier to use structural format. The details related to function calls are placed in the `function_call` field of the messages:"
+msgstr "我们可以看到，Qwen-Agent试图以更易于使用的结构化格式解析模型生成。与函数调用相关的详细信息被放置在消息的`function_call`字段中："
+
+#: ../../Qwen/source/framework/function_call.md:379
+#: 29c4b62805bc4e2ab7a914c591e0c9da
+msgid "`name`: a string representing the function to call"
+msgstr "`name`：代表要调用的函数的字符串"
+
+#: ../../Qwen/source/framework/function_call.md:380
+#: 36396d04bf1e412380a4c03c286e263a
+msgid "`arguments`: a JSON-formatted string representing the arguments the function should be called with"
+msgstr "`arguments`：表示函数应带有的参数的JSON格式字符串"
+
+#: ../../Qwen/source/framework/function_call.md:382
+#: 4630f89282b5414ea41d1ef19cd21b6b
+msgid "Note that Qwen2.5-7B-Instruct is quite capable:"
+msgstr "请注意，Qwen2.5-7B-Instruct相当强大："
+
+#: ../../Qwen/source/framework/function_call.md:383
+#: 74c060396f974d09aa25053dd1a3d401
+msgid "It has followed the function instructions to add the state and the country to the location."
+msgstr "它遵循函数指令，在位置中添加了州和国家。"
+
+#: ../../Qwen/source/framework/function_call.md:384
+#: b0ee4de6a3844f2dbd728c6957b033d2
+msgid "It has correctly induced the date of tomorrow and given in the format required by the function."
+msgstr "它正确地推断出明天的日期，并以函数要求的格式给出。"
+
+#: ../../Qwen/source/framework/function_call.md:386
+#: 44c59fa19c3a40e2a9b06c24245c1801
+msgid "Then comes the critical part -- checking and applying the function call:"
+msgstr "接下来是关键部分——检查和应用函数调用："
+
+#: ../../Qwen/source/framework/function_call.md:402
+#: fca90539a17c459b960daf68d509970f
+msgid "To get tool results:"
+msgstr "获取工具结果："
+
+#: ../../Qwen/source/framework/function_call.md:403
+#: 0d88f729b9474ab0a760f004a7594752
+msgid "line 1: We should iterate the function calls in the order the model generates them."
+msgstr "第1行：我们应该按模型生成它们的顺序迭代函数调用。"
+
+#: ../../Qwen/source/framework/function_call.md:404
+#: 6e061ac2d81f46a2b93830206c2841e4
+msgid "line 2: We can check if a function call is needed as deemed by the model by checking the `function_call` field of the generated messages."
+msgstr "第2行：通过检查生成消息的`function_call`字段，我们可以查看是否需要按模型判断进行函数调用。"
+
+#: ../../Qwen/source/framework/function_call.md:405
+#: b400b27066094316b28b0f5742a080a9
+msgid "line 3-4: The related details including the name and the arguments of the function can also be found there, which are `name` and `arguments` respectively."
+msgstr "第3-4行：相关详情，包括函数名称和参数，也可以在那里找到，分别是`name`和`arguments`。"
+
+#: ../../Qwen/source/framework/function_call.md:406
+#: 9d38560976b0456fa38de4b87007cf27
+msgid "line 6: With the details, one should call the function and obtain the results. Here, we assume there is a function named [`get_function_by_name`](#prepcode) to help us get the related function by its name."
+msgstr "第6行：有了这些细节，应该调用函数并获取结果。这里，我们假设有一个名为[`get_function_by_name`](#prepcode)的函数来帮助我们根据名称获取相关函数。"
+
+#: ../../Qwen/source/framework/function_call.md:408
+#: 4c335071c730449faf0eed0159d525c4
+msgid "line 8-12: With the result obtained, add the function result to the messages as `content` and with `role` as `\"function\"`."
+msgstr "第8-12行：获得结果后，将函数结果作为`content`添加到消息中，并将`role`设置为`\"function\"`。"
+
+#: ../../Qwen/source/framework/function_call.md:410
+#: 0b71ee2ed93342018509aaafad0701d4
+msgid "Now the messages are"
+msgstr "现在消息是"
+
+#: ../../Qwen/source/framework/function_call.md:422
+#: ../../Qwen/source/framework/function_call.md:624
+#: ../../Qwen/source/framework/function_call.md:750
+#: ../../Qwen/source/framework/function_call.md:900
+#: 5e6a5cb155d74d94b1adf0278bb896a1 738554523dce4f6394a9aaf5ffd935f6
+#: 7b7eaed481b344b2b581aed31cafbb67 c9afee6dee944cb393d302c297b13b27
+msgid "Final Response"
+msgstr "最终响应"
+
+#: ../../Qwen/source/framework/function_call.md:424
+#: 8b6ac2c7a95747e1bbc86cbc445709d0
+msgid "Finally, run the model again to get the final model results:"
+msgstr "最后，再次运行模型以获取最终的模型结果："
+
+#: ../../Qwen/source/framework/function_call.md:432
+#: 6ce8ec5e4d0049dab95f830e46c6dcea
+msgid "The final response should be like"
+msgstr "最终响应应如下所示"
+
+#: ../../Qwen/source/framework/function_call.md:438
+#: ../../Qwen/source/framework/function_call.md:555
+#: 8c8a902a7d3a40e3b8b6304e3cfd60aa d73648ac333743cd8e21e14eae3db734
+msgid "Hugging Face transformers"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:440
+#: f5acbc281bcb44efa53d6818fa72f5d7
+msgid "Since function calling is based on prompt engineering and templates, `transformers` supports it with its tokenizer utilities, in particular, the `tokenizer.apply_chat_template` method, which hides the sophistication of constructing the model inputs, using the Jinja templating engine. However, it means that users should handle the model output part on their own, which includes parsing the generated function call message."
+msgstr "由于函数调用基于提示工程和模板，`transformers`通过其tokenizer工具支持这一功能，特别是`tokenizer.apply_chat_template`方法，它利用Jinja模板引擎隐藏了构建模型输入的复杂性。然而，这意味着用户需要自行处理模型输出部分，包括解析生成的函数调用消息。"
+
+#: ../../Qwen/source/framework/function_call.md:443
+#: 43387eb2f7c74c0980b05b258aec9491
+msgid "The blog piece [_Tool Use, Unified_](https://huggingface.co/blog/unified-tool-use) is very helpful in understanding its design. Be sure to take a look."
+msgstr "博客文章[_Tool Use, Unified_](https://huggingface.co/blog/unified-tool-use)对于理解其设计非常有帮助。务必阅读一下。"
+
+#: ../../Qwen/source/framework/function_call.md:446
+#: 09e1372474d8426dac3a9241c06eda5c
+msgid "Tool use API is available in transformers since v4.42.0. Before starting, let's check that:"
+msgstr "自v4.42.0版本起，transformers中提供了工具使用API。在开始之前，让我们确认这一点："
+
+#: ../../Qwen/source/framework/function_call.md:452
+#: 322f241f77b040048a084f8cd76c2e0a
+msgid "For this guide, we are at version v4.44.2."
+msgstr "对于本指南，我们处于v4.44.2版本。"
+
+#: ../../Qwen/source/framework/function_call.md:456
+#: 1b1b289980df4b6a9cb4bf1c29d4272d
+msgid "For Qwen2.5, the chat template in `tokenizer_config.json` has already included support for the Hermes-style tool use.  We simply need to load the model and the tokenizer:"
+msgstr "对于 Qwen2.5，`tokenizer_config.json` 中的聊天模板已经包含了对 Hermes 风格工具调用的支持。我们只需加载模型和分词器："
+
+#: ../../Qwen/source/framework/function_call.md:472
+#: ../../Qwen/source/framework/function_call.md:674
+#: ../../Qwen/source/framework/function_call.md:790
+#: 6541fc7c9b774e69b2b0b97a4c491459 888996a83df34b91b30b1355ddfc3494
+#: ea933584017e47cfb87ceff594f54c9c
+msgid "The inputs are the same with those in [the preparation code](#prepcode):"
+msgstr "输入与[准备代码](#prepcode)中的相同："
+
+#: ../../Qwen/source/framework/function_call.md:479
+#: d7ab8f1b743b41038810a7d223c8ffc9
+msgid "In `transformers`, you can also directly use Python functions as tools with certain constraints[^get_json_schema_note]:"
+msgstr "在`transformers`中，您也可以直接将Python函数作为工具使用，但需遵循特定约束[^get_json_schema_note]："
+
+#: ../../Qwen/source/framework/function_call.md:497
+#: a5231c97237249739c5d73db49695b05
+msgid "To construct the input sequence, we should use the `apply_chat_template` method and then let the model continue the texts:"
+msgstr "为了构造输入序列，我们应该使用`apply_chat_template`方法，然后让模型继续生成文本："
+
+#: ../../Qwen/source/framework/function_call.md:506
+#: cb19105f6e464ad1a3eee0c9fe907bb1
+msgid "The output texts should be like"
+msgstr "输出文本应如下所示："
+
+#: ../../Qwen/source/framework/function_call.md:516
+#: 0300726829e542d9942f51a2772206ff
+msgid "Now we need to do two things:"
+msgstr "现在我们需要做两件事："
+
+#: ../../Qwen/source/framework/function_call.md:517
+#: a0c8a86da0524084b636946b4cfeaf87
+msgid "Parse the generated tool calls to a message and add them to the messages, so that the model knows which tools are used."
+msgstr "解析生成的工具调用为一条消息，并将其添加到消息列表中，以便模型了解所使用的工具。"
+
+#: ../../Qwen/source/framework/function_call.md:518
+#: 90ccd935a4554dcbb536a444cd96592d
+msgid "Obtain the results of the tools and add them to the messages, so that the model knows the results of the tool calls."
+msgstr "获取工具的结果并将其添加到消息列表中，以便模型了解工具调用的结果。"
+
+#: ../../Qwen/source/framework/function_call.md:520
+#: 8568f15805ef469790801e79525ba25c
+msgid "In `transformers`, the tool calls should be a field of assistant messages. Let's use a simple function called `try_parse_tool_calls` to parse the tool calls:"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:552
+#: be6ae865b4a6414681ef3d5ed1957c39
+msgid "This function does not cover all possible scenarios and thus is prone to errors. But it should suffice for the purpose of this guide."
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:556
+#: 9de3bba707eb45f09490e25489b410fd
+msgid "The template in the `tokenizer_config.json` assumes that the generated content alongside tool calls is in the same message instead of separate assistant messages, e.g.,"
+msgstr "`tokenizer_config.json` 中的模板假设生成的内容和工具调用是在同一消息中，而不是分开的助手消息，例如："
+
+#: ../../Qwen/source/framework/function_call.md:566
+#: 098c29ea125943ebbc162aabb091c773
+msgid "instead of"
+msgstr "而非"
+
+#: ../../Qwen/source/framework/function_call.md:583
+#: e6738454a6cd4f408b5d375eadd851ee
+msgid "This is implemented roughly in `try_parse_tool_calls` but keep that in mind if you are writing your own tool call parser."
+msgstr "`try_parse_tool_calls` 中大致实现了这一约定，但如果你正在编写自己的工具调用解析器，请留意这一点。"
+
+#: ../../Qwen/source/framework/function_call.md:604
+#: b2b6427b91524ed0bb0c333c0aebfe53
+msgid "The messages now should be like"
+msgstr "现在消息应如下所示："
+
+#: ../../Qwen/source/framework/function_call.md:618
+#: 7eae05a0693f4e10bb0a3d505939b1ba
+msgid "The messages are similar to those of Qwen-Agent, but there are some major differences:"
+msgstr "这些消息类似于Qwen-Agent的消息，但存在一些主要差异："
+
+#: ../../Qwen/source/framework/function_call.md:619
+#: 0685adf142544e6cb3eb76eec3cd9017
+msgid "Tools instead of functions"
+msgstr "工具而非函数"
+
+#: ../../Qwen/source/framework/function_call.md:620
+#: e5c7371882e441008feb0b17910716ee
+msgid "Parallel calls are by default"
+msgstr "默认情况下为并行调用"
+
+#: ../../Qwen/source/framework/function_call.md:621
+#: e6e4530ff2fb4689ac48203bb796b250
+msgid "Multiple tool calls as a list in a single assistant message, instead of multiple messages."
+msgstr "多个工具调用以列表形式在一个助手消息中，而不是多个消息"
+
+#: ../../Qwen/source/framework/function_call.md:622
+#: d6e37c7dd0fb44a79d29e0306a1ff80a
+msgid "The function arguments are parsed into a dict if it is a valid JSON-formatted string."
+msgstr "如果函数参数是有效的JSON格式字符串，则将其解析为字典。"
+
+#: ../../Qwen/source/framework/function_call.md:626
+#: db64d30026b645d389aa1366d55eb177
+msgid "Then it's time for the model to generate the actual response for us based on the tool results.  Let's query the model again:"
+msgstr "现在是时候根据工具结果，让模型为我们生成实际响应了。再次查询模型："
+
+#: ../../Qwen/source/framework/function_call.md:636
+#: 58924dcb06804e05a7b9308933733104
+msgid "The output_text should be like"
+msgstr "输出文本应如下所示："
+
+#: ../../Qwen/source/framework/function_call.md:641
+#: 00a88d2136e648f68133ebc6cf0e01b6
+msgid "Add the result text as an assistant message and the final messages should be ready for further interaction:"
+msgstr "将结果文本作为助手消息添加，最终消息应准备好进行进一步交互："
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: ../../Qwen/source/framework/function_call.md:646
+#: 581caffc70b5478d8508e59d56166ad5 f64b7d2b14f64b42b959e5a6e75a3bf4
+msgid "Ollama"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:648
+#: 5c295d58873d4f949e8a640ab1309f30
+msgid "Ollama is a set of tools for serving LLMs locally.  It also relies on its template implementation to support function calling. Different from transformers, which is written in Python and uses the Jinja template whose syntax is heavily inspired by Django and Python, Ollama, which is mostly written in Go, uses Go's [text/template](https://pkg.go.dev/text/template) packages. In addition, Ollama implements internally a helper function so that it can automatically parse the generated tool calls in texts to structured messages if the format supported."
+msgstr "Ollama是一套用于本地部署LLMs的工具集。它还依赖于其模板实现来支持函数调用。不同于使用Python编写的transformers，采用了受Django和Python语法启发的Jinja模板，主要用Go编写的Ollama则使用了Go的[text/template](https://pkg.go.dev/text/template)包。此外，Ollama内部实现了辅助函数，如果格式被支持的话，它可以自动解析文本中生成的工具调用为结构化的消息。"
+
+#: ../../Qwen/source/framework/function_call.md:653
+#: 92e60ea979c74a0aa9c49d3d4175f12a
+msgid "You could check the [Tool support](https://ollama.com/blog/tool-support) blog post first."
+msgstr "您可以先查阅[Tool support](https://ollama.com/blog/tool-support)的博客文章。"
+
+#: ../../Qwen/source/framework/function_call.md:655
+#: 76a1fb72c14f4d9fbf77fe08e79d2c35
+msgid "Tool support has been available in Ollama since v0.3.0. You can run the following to check the Ollama version:"
+msgstr "自v0.3.0版本以来，Ollama已经提供了工具支持。您可以运行以下命令来检查Ollama的版本："
+
+#: ../../Qwen/source/framework/function_call.md:660
+#: 772f68fcfa2d412a86182d819dde28ad
+msgid "If lower than expected, follow [the official instructions](https://ollama.com/download) to install the latest version."
+msgstr "如果版本低于预期，请遵循[官方说明](https://ollama.com/download)安装最新版本。"
+
+#: ../../Qwen/source/framework/function_call.md:662
+#: 718f28a07fe64d2687e09d472d5cce3a
+msgid "In this guide, we will aslo use [ollama-python](https://github.com/ollama/ollama-python), before starting, make sure it is available in your environment:"
+msgstr "在本指南中，我们将使用[ollama-python](https://github.com/ollama/ollama-python)，在开始之前，请确保您的环境中已安装此库："
+
+#: ../../Qwen/source/framework/function_call.md:667
+#: 34d15300b24e447bb2ee7e24dd0567f7
+msgid "For this guide, the `ollama` binary is at v0.3.9 and the `ollama` Python library is at v0.3.2."
+msgstr "对于本指南，`ollama`二进制文件的版本为v0.3.9，`ollama` Python库的版本为v0.3.2。"
+
+#: ../../Qwen/source/framework/function_call.md:672
+#: e0b6a3f28628471f88bb07298f742b12
+msgid "The messages structure used in Ollama is the same with that in `transformers` and the template in [Qwen2.5 Ollama models](https://ollama.com/library/qwen2.5) has supported tool use."
+msgstr "Ollama 中使用的消息结构与 `transformers` 中的相同，并且 [Qwen2.5 Ollama 模型](https://ollama.com/library/qwen2.5) 的模板已经支持工具调用。"
+
+#: ../../Qwen/source/framework/function_call.md:681
+#: 64eb1d9868c54bc19076c00b0485d371
+msgid "Note that you cannot pass Python functions as tools directly and `tools` has to be a `dict`."
+msgstr "请注意，您不能直接将Python函数作为工具传递，`tool`的类型必须是`dict`。"
+
+#: ../../Qwen/source/framework/function_call.md:686
+#: 3b4c611ec6ce484ca21bb4ed97255d4d
+msgid "We can use the `ollama.chat` method to directly query the underlying API:"
+msgstr "我们可以使用`ollama.chat`方法直接查询底层API："
+
+#: ../../Qwen/source/framework/function_call.md:698
+#: 94735aa651304d088c97dadacd7c456b
+msgid "The main fields in the response could be:"
+msgstr "响应中的主要字段可能是："
+
+#: ../../Qwen/source/framework/function_call.md:713
+#: 347166b959a2441c856d78ec1b964233
+#, fuzzy
+msgid "Ollama's tool call parser has succeeded in parsing the tool results. If not, you may refine [the `try_parse_tool_calls` function above](#parse-function). Then, we can obtain the tool results and add them to the messages. The following is basically the same with `transformers`:"
+msgstr "Ollama的工具调用解析器成功解析出了工具调用。[^tool_call_arg_format] 但如果失败了，您可能需要尝试改进[上面的`try_parse_tool_calls`函数](#prepcode)。 然后，我们可以获取工具的结果并将其添加到消息中。以下操作基本上与`transformers`相同："
+
+#: ../../Qwen/source/framework/function_call.md:736
+#: ../../Qwen/source/framework/function_call.md:886
+#: 3413d71073e543e793ff7a41961402dc e3daa8f3d7a74680b3412b5ad71936fc
+msgid "The messages are now like"
+msgstr "现在消息如下："
+
+#: ../../Qwen/source/framework/function_call.md:752
+#: bbeba05f6afa4ac8b7aa8116a2968155
+msgid "The rest are easy:"
+msgstr "剩下的部分很简单："
+
+#: ../../Qwen/source/framework/function_call.md:763
+#: ed728030eac04637b75496c8f9dc8d42
+msgid "The final message should be like the following:"
+msgstr "最终的消息应该如下所示："
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: ../../Qwen/source/framework/function_call.md:769
+#: d1f29f8199d44329a0f14b51930c9bb8 e0704a724b0842c1bdaa29dd46f0b21a
+msgid "vLLM"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:771
+#: 055e984addc44e04a88e762a01dd54e0
+msgid "vLLM is a fast and easy-to-use library for LLM inference and serving. It uses the tokenizer from `transformers` to format the input, so we should have no trouble preparing the input. In addition, vLLm also implements helper functions so that generated tool calls can be parsed automatically if the format is supported."
+msgstr "vLLM 是一个快速且易于使用的库，用于大型语言模型的推理和部署。它使用 `transformers` 中的分词器来格式化输入，因此我们在准备输入时应该不会遇到任何问题。此外，vLLM 还实现了辅助函数，以便在支持的情况下自动解析生成的工具调用。"
+
+#: ../../Qwen/source/framework/function_call.md:775
+#: 34aeafdc888f4d109f08c6cb46b80d03
+msgid "Tool support has been available in `vllm` since v0.6.0.  Be sure to install a version that supports tool use. For more information, check the [vLLM documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api)."
+msgstr "工具支持自 v0.6.0 版本起已在 `vllm` 中可用。请确保安装了一个支持工具调用的版本。更多信息，请查阅 [vLLM 文档](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api)"
+
+#: ../../Qwen/source/framework/function_call.md:779
+#: 926aa4074d81422f94dba12bd220cf22
+msgid "For this guide, we are at version v0.6.1.post2. We will use the OpenAI-Compatible API by `vllm` with the API client from the `openai` Python library."
+msgstr "在本指南中，我们使用的是 v0.6.1.post2 版本。我们将使用 `vllm` 提供的 OpenAI 兼容 API，并通过 `openai` Python 库的 API 客户端来进行操作。"
+
+#: ../../Qwen/source/framework/function_call.md:784
+#: 18d4ce8e46714651a01fc8c6b8c30587
+msgid "For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use. We simply need to start a OpenAI-compatible API with vLLM:"
+msgstr "对于 Qwen2.5，`tokenizer_config.json` 中的聊天模板已经包含了对 Hermes 风格工具调用的支持。我们只需要启动一个由 vLLM 提供的 OpenAI 兼容 API 即可："
+
+#: ../../Qwen/source/framework/function_call.md:797
+#: 2c4b4f2aa0414a7baeeb48701898c09c
+msgid "Let's also initialize the client:"
+msgstr "我们先初始化API客户端："
+
+#: ../../Qwen/source/framework/function_call.md:815
+#: 0aa57070a73c4785a371c2a454ce9360
+msgid "We can use the create chat completions endpoint to query the model:"
+msgstr "我们可以使用create chat completions endpoint直接查询底层API："
+
+#: ../../Qwen/source/framework/function_call.md:831
+#: ec216e66499d48ee8db45c7fe0a92ebb
+msgid "vLLM should be able to parse the tool calls for us, and the main fields in the response (`response.choices[0]`) should be like"
+msgstr "vLLM应当可以为我们解析工具调用，回复的主要字段(`response.choices[0]`)应如下所示："
+
+#: ../../Qwen/source/framework/function_call.md:858
+#: 82944ee90fcf4223a220674c83ca0255
+msgid "Note that the function arguments are JSON-formatted strings, which Qwen-Agent follows but `transformers` and Ollama differs."
+msgstr "请注意这里函数的参数是JSON格式字符串，Qwen-Agent与其一致，但`transformers`和Ollama与之相异。"
+
+#: ../../Qwen/source/framework/function_call.md:860
+#: afbeea17f4974f3aa4cd04d1af81f6e1
+msgid "As before, chances are that there are corner cases where tool calls are generated but they are malformed and cannot be parsed. For production code, we should try parsing by ourselves."
+msgstr "如前所述，有可能存在边界情况，模型生成了工具调用但格式不良也无法被解析。对于生产代码，我们需要尝试自行解析。"
+
+#: ../../Qwen/source/framework/function_call.md:863
+#: 6679ef4d0e494546bada423b26f7427c
+msgid "Then, we can obtain the tool results and add them to the messages as shown below:"
+msgstr "随后，我们可以调用工具并获得结果，然后将它们加入消息中："
+
+#: ../../Qwen/source/framework/function_call.md:884
+#: 2efb4934cb904461aa61117a3df94c1d
+msgid "It should be noted that the OpenAI API uses `tool_call_id` to identify the relation between tool results and tool calls."
+msgstr "这里需要注意OpenAI API使用`tool_call_id`字段来识别工具结果和工具调用间的联系。"
+
+#: ../../Qwen/source/framework/function_call.md:902
+#: bbcb57b29a374c25bf6114fd4ca1e44a
+msgid "Let's call the endpoint again to seed the tool results and get response:"
+msgstr "让我们再次查询接口，以给模型提供工具结果并获得回复："
+
+#: ../../Qwen/source/framework/function_call.md:919
+#: 7a2c0dfa5cff41df911c16597fd6166d
+#, fuzzy
+msgid "The final response (`response.choices[0].message.content`) should be like"
+msgstr "最终响应 (`response.choices[0].message`)应如"
+
+#: ../../Qwen/source/framework/function_call.md:924
+#: af3cbc40b1b440868c89b2db695caf78
+msgid "Discussions"
+msgstr "小结"
+
+#: ../../Qwen/source/framework/function_call.md:926
+#: 00b863962f2349c1843e5caef08e7b11
+msgid "Now, we have introduced how to conduct inference with function calling using Qwen2 in three different frameworks! Let's make a brief comparison."
+msgstr "现在，我们已经介绍了如何使用Qwen2在三种不同的框架中通过函数调用进行推理！让我们做一个简要的比较。"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: d76096fd53be4eedaf0356aefa56711d
+msgid "Item"
+msgstr "项目"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: c3673767712f4e08888b6be0600889e4
+msgid "OpenAI API"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 82945e42496f4f41aec5173a035e1b57
+msgid "Type"
+msgstr "类型"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 1815e9c221894b3babd8a442fa32bcbd 4b2955158e474544b12f4da04ad815f9
+#: 6cab1554299047a1aad0c6e1f86ee5cf a5ebb0d610af4661bee4ffc8d041b819
+msgid "HTTP API"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 043bdab8051049ac8cf9e76c535ade2b 89650682599d48708f5784a805fa4e0b
+msgid "Python Library"
+msgstr "Python库"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: f90552caaf474c0fa1df085f82461eef
+msgid "Inference Backend"
+msgstr "推理后端"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 6cc61d9e39fe4eeaa2bbacc9dc576fdb e70b9a499e5f454ba898c004872f531b
+msgid "-"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 59f22256744d416f98d03df8f2278c5f ffcd7f68799c4a499a2c14135aec2b87
+msgid "PyTorch"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: a9e2e27a54c542e0a62d2dfa88852b10
+msgid "llama.cpp"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 0024df8a279c4bb4a8a15785807595d1
+msgid "Templating Backend"
+msgstr "模板后端"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 0a624fffdba54431b5e296c3aacf622d 2d81ae6fb5994f4e88303841395a8f05
+msgid "Jinja"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 3ec6d091c75d47838ca192daccd85a8b
+msgid "Go `text/template`"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: c538c37491c14245995e39510fc3488a
+msgid "Python"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: bc969123acb444f49b6f6f34fa1a765b
+msgid "Tools/Functions"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 0099d942538c498789553fad009c1ab0 0285033e57ff48e19d792bbdd164c0be
+#: 59560c34831d48db925d9e56e862c152 9de65d6f04584de483b2224e91424c03
+msgid "Tools"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 32250c9196b64591833201139c23afc9
+msgid "Functions"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 38fed11b5de64e1dba338d98a9b83acd
+msgid "Parallel Calls"
+msgstr "并行调用"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: e0ac64fc47db455c9dbc435b9e944146
+msgid "Default Yes (Configurable)"
+msgstr "默认是（可配置）"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 316ff3cad523414d87967f17ec0d8ca6 62922963f9ef49b9b759e5b1c88c241f
+#: c0513c97d7964c708d6858315d1e64de
+msgid "Yes"
+msgstr "是"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 53188b15532042388b3d90cf06546c0e
+msgid "Default No (Configurable)"
+msgstr "默认否（可配置）"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 7885d6b81a36481f8cb099f1c0fe9635
+msgid "Call Format"
+msgstr "调用格式"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 365e626c700b47e2b943c616796ad4e7 915ff0bab0534d7399db78c8f80177fc
+#: ca4dcf05fd06448387b191655a3eb286 d74bbd9ef649402e97a8a92fd1669646
+msgid "Single assistant message with `tool_calls`"
+msgstr "带有`tool_calls`的单个助手消息"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 0f3197469a194ebdab97cc71290d76c5
+msgid "Multiple assistant messages with `function_call`"
+msgstr "带有`function_call`的多个助手消息"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 8c9ca98e2d504ffcb7a01c44511ff570
+msgid "Call Argument Format"
+msgstr "调用参数格式"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 20a03bfe9dc64de1b50fbbc02d704fb4 7e738ab62348417d8853aeaa45c6c91e
+#: ccc62ac381c6488ba814d5a11a846dc9
+msgid "string"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 021b4af5157d4ba4b329dd5b20c01424 f450234ab42e4a4b82d5f1bd2a3bc6b3
+msgid "object"
+msgstr ""
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: 821091c4a16f4b13b76c8c6b3ddb0288
+msgid "Call Result Format"
+msgstr "调用结果格式"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: a09900cc7d3b403bbbe3539ac5b30dcc ba591284bea645129938adf064ee939a
+#: cbba7c99bad7497b85bec11223978cc9 cde37dd4f93546a08987269cbf5889cd
+msgid "Multiple tool messages with `content`"
+msgstr "带有`content`的多个工具消息"
+
+#: ../../Qwen/source/framework/function_call.md:555
+#: c760b606407d46499fcff6b5fb498a16
+msgid "Multiple function messages with `content`"
+msgstr "带有`content`的多个函数消息"
+
+#: ../../Qwen/source/framework/function_call.md:941
+#: f72961ec5f5448849f916c3fa2fab7aa
+msgid "There are some details not shown in the above table:"
+msgstr "上表中有些特性未被体现："
+
+#: ../../Qwen/source/framework/function_call.md:942
+#: 1874dbb8d1814d7a91e949f90e56f0eb
+msgid "OpenAI API comes with Python, Node.js, Go, and .NET SDKs. It also follows the OpenAPI standard."
+msgstr "OpenAI API附带了Python、Node.js、Go和.NET SDK。它还遵循OpenAPI标准。"
+
+#: ../../Qwen/source/framework/function_call.md:943
+#: 3c70aea4b962498da3822193c6052cb8
+msgid "Ollama comes with Python and Node.js SDKs. It has OpenAI-compatible API at a different base url that can be accessed using OpenAI API SDK."
+msgstr "Ollama附带了Python和Node.js SDK。它在不同的base URL上具有与OpenAI兼容的API，可以使用OpenAI API SDK访问。"
+
+#: ../../Qwen/source/framework/function_call.md:944
+#: 01ff2cb248df45049c9c6dd1b415bba8
+msgid "Qwen-Agent as an application framework can call the tools automatically for you, which is introduced in [the Qwen-Agent guide](./qwen_agent)."
+msgstr "作为应用程序框架，Qwen-Agent可以自动为您调用工具，这在[Qwen-Agent指南](./qwen_agent)中有所介绍。"
+
+#: ../../Qwen/source/framework/function_call.md:947
+#: 515bc99e272149938b4c2add0deee022
+msgid "In addition, there are more on the model side of function calling, which means you may need to consider more things in production code:"
+msgstr "此外，在函数调用的模型方面还有更多内容，这意味着您可能需要在生产代码中考虑更多的事情："
+
+#: ../../Qwen/source/framework/function_call.md:948
+#: d988e24a141d430980204d8badc9a44b
+msgid "**Accuracy of function calling**: When it comes to evaluate the accuracy of function calling, there are two aspects: (a) whether the correct functions (including no functions) are selected and (b) whether the correct function arguments are generated. It is not always the case that Qwen2.5 will be accurate.  Function calling can involve knowledge that is deep and domain-specific. Sometimes, it doesn't fully understand the function and select the wrong one by mistake. Sometimes, it can fall into a loop and require calling the same function again and again.  Sometimes, it will fabricate required function arguments instead of asking the user for input. To improve the function calling accuracy, it is advised to first try prompt engineering: does a more detailed function description help? can we provide instructions and examples to the model in the system message? If not, finetuning on your own data could also improve performance."
+msgstr "**函数调用准确性**：在评估函数调用的准确性时，有两个方面：(a) 是否选择了正确的函数（包括没有函数）以及(b) 是否生成了正确的函数参数。Qwen2.5并不总是准确的。函数调用可能涉及深入且领域特定的知识。有时，它不能完全理解函数并错误地选择了错误的函数。有时，它可能会陷入循环，需要反复调用相同的函数。有时，它会伪造所需的函数参数而不是向用户请求输入。为了提高函数调用的准确性，建议首先尝试提示工程：更详细的函数描述是否有所帮助？我们是否可以在系统消息中为模型提供指导和示例？如果没有，使用自己的数据进行微调也可以提高性能。"
+
+#: ../../Qwen/source/framework/function_call.md:961
+#: 4215aa1f6ddd4cf29be2e121fc6313ff
+msgid "**Protocol consistency**: Even with the proper function calling template, the protocol may break. The model may generate extra texts to tool calls, e.g., explanations. The generated tool call may be invalid JSON-formatted string but a representation of a Python dict The generated tool call may be valid JSON but not conforms to the provided JSON Schema. For those kinds of issues, while some of them could be addressed with prompt engineering, some are caused by the nature of LLMs and can be hard to resolve in a general manner by LLMs themselves. While we strive to improve Qwen2.5 in this regard, edge cases are unlikely to be eliminated completely."
+msgstr "**协议一致性**：即使具备恰当的函数调用模板，协议也可能被破坏。模型可能会在工具调用中生成额外文本，例如解释说明。生成的工具调用可能是无效的JSON格式字符串，但是是Python dict的字符串表示；生成的工具调用可能是有效的JSON，但不符合提供的JSON Schema。对于这类问题，虽然有些可以通过提示工程解决，但有些是由大型语言模型的本质引起的，很难由大模型本身以通用方式解决。尽管我们在这一方面努力改进Qwen2.5，但极端情况不太可能被完全消除。"
+
+#: ../../Qwen/source/framework/function_call.md:970
+#: ba9dfcb4b398472a977699081eb2e1af
+msgid "Function Calling Templates"
+msgstr "函数调用模板"
+
+#: ../../Qwen/source/framework/function_call.md:972
+#: bf17665d24494f919113d22f59aca750
+msgid "The template design for function calling often includes the following aspects:"
+msgstr "函数调用的模板设计通常包括以下方面："
+
+#: ../../Qwen/source/framework/function_call.md:973
+#: 59c47a9fdc2246c08655158e700e457c
+msgid "How to describe the functions to the model, so that the model understands what they are and how to use them."
+msgstr "如何向模型描述这些函数，以便模型理解它们是什么以及如何使用它们。"
+
+#: ../../Qwen/source/framework/function_call.md:974
+#: 6c1d6432a8c24e19bccb999462c604bb
+msgid "How to prompt the model, so that it knows that functions can be used and in what format to generate the function calls."
+msgstr "如何提示模型，以便它知道可以使用函数，并以何种格式生成函数调用。"
+
+#: ../../Qwen/source/framework/function_call.md:975
+#: db33ee8dc25049218020c43a98bd82c3
+msgid "How to tell a function call generation from others in generated text, so that we can extract the calls from the generated texts and actually make the calls."
+msgstr "如何从生成的文本中区分函数调用与其他内容，以便我们能够从生成的文本中提取调用并实际执行调用。"
+
+#: ../../Qwen/source/framework/function_call.md:976
+#: a8365ddddf4140239d1de118b25dcecc
+msgid "How to incorporate the function results to the text, so that the model can tell them from its own generation and make connection among the calls and the results."
+msgstr "如何将函数结果融入文本中，以便模型能够将其与自己的生成区分开来，并在调用和结果之间建立联系。"
+
+#: ../../Qwen/source/framework/function_call.md:978
+#: e9842cf66f10461eacecec888bd6ec9e
+msgid "For experienced prompt engineers, it should be possible to make any LLM support function calling, using in-context learning techniques and with representative examples, though with varied accuracy and stability depending on how \"zero-shot\" the task at hand is."
+msgstr "对于经验丰富的提示工程师而言，应该有可能利用上下文学习技术和代表性示例，使任何大模型支持函数调用，尽管准确性和稳定性会根据手头任务的“零样本”程度而有所不同。"
+
+#: ../../Qwen/source/framework/function_call.md:980
+#: 27d8b9ba72444e5d89f42674af815d5f
+msgid "Starting from ReAct Prompting"
+msgstr "从ReAct Prompting开始"
+
+#: ../../Qwen/source/framework/function_call.md:982
+#: 895d264e474046a6b293266a43788e5a
+msgid "For example, ReAct Prompting can be used to implement function calling with an extra element of planning:"
+msgstr "例如，可以使用ReAct Prompting实现带有额外规划元素的函数调用："
+
+#: ../../Qwen/source/framework/function_call.md:983
+#: 1ee0b022742b44c4984b3230b9b8d59c
+msgid "**Thought**: the overt reasoning path, analyzing the functions and the user query and saying it out \"loud\""
+msgstr "**Thought**：显而易见的推理路径，分析函数和用户查询，并大声“说”出来"
+
+#: ../../Qwen/source/framework/function_call.md:984
+#: 65c7b8f481e34cd287d757d7717b47d7
+msgid "**Action**: the function to use and the arguments with which the function should be called"
+msgstr "**Action**：要使用的函数以及调用该函数时应使用的参数"
+
+#: ../../Qwen/source/framework/function_call.md:985
+#: db5bbcc3860047b3b5022c48d1ca45f1
+msgid "**Observation**: the results of the function"
+msgstr "**Observation**：函数的结果"
+
+#: ../../Qwen/source/framework/function_call.md:987
+#: e0da151009e94c31a8475e2bb1e24694
+msgid "In fact, Qwen2 is verse in the following variant of ReAct Prompting (similar to LangChain ReAct) to make the intermediate texts more structured:"
+msgstr "实际上，Qwen2熟练掌握以下变体的ReAct Prompting（类似于LangChain ReAct），以使中间文本更具结构化："
+
+#: ../../Qwen/source/framework/function_call.md:1017
+#: b13ae622a62f40a0bd6b79da2f9cdfe1
+msgid "As you can see, there is no apparent user/assistant conversation structure in the template. The model will simply continue the texts. One should write the code to actively detect which step the model is at and in particular to add the observations in the process, until the Final Answer is generated."
+msgstr "如您所见，模板中没有明显的用户/助手对话结构。模型将简单地继续文本。应该编写代码来主动检测模型处于哪个步骤，并特别在过程中添加观察结果，直到生成最终答案。"
+
+#: ../../Qwen/source/framework/function_call.md:1021
+#: cde0405e4e434d38b652f48e09c215b8
+#, fuzzy
+msgid "However, as most programming interfaces accept the message structure, there should be some kind of adapter between the two. [The ReAct Chat Agent](https://github.com/QwenLM/Qwen-Agent/blob/v0.0.10/qwen_agent/agents/react_chat.py) in Qwen-Agent facilitates this kind of conversion."
+msgstr "然而，由于大多数编程接口接受“message”结构，两者之间应该有某种适配器。[Qwen-Agent中的ReAct Chat Agent](https://github.com/QwenLM/Qwen-Agent/blob/v0.0.9/qwen_agent/agents/react_chat.py)实现了这种转换。"
+
+#: ../../Qwen/source/framework/function_call.md:1024
+#: c477ceb3bba940cdb1ad6697a80000af
+msgid "Qwen2 Function Calling Template"
+msgstr "Qwen2 函数调用模板"
+
+#: ../../Qwen/source/framework/function_call.md:1026
+#: 59e9eb1bca624a3980c17f50de84eec4
+msgid "As a step forward, the official Qwen2 function calling template is in the vein of the ReAct Prompting format but focuses more on"
+msgstr "作为向前迈进的一步，官方的Qwen2函数调用模板沿袭了ReAct Prompting格式，但更侧重于"
+
+#: ../../Qwen/source/framework/function_call.md:1027
+#: 002b6a55039a45b192c9dff521b7c360
+msgid "differentiating the keywords like `Question`, `Thought`, `Action`, etc., from generation,"
+msgstr "将诸如`Question`、`Thought`、`Action`等关键词与生成区分开来，"
+
+#: ../../Qwen/source/framework/function_call.md:1028
+#: 041d040c25754ea0b50f80cf75deebf8
+msgid "simplifying the process,"
+msgstr "简化这一过程，"
+
+#: ../../Qwen/source/framework/function_call.md:1029
+#: 102345af873d49d8b455d3e0bb840ea7
+msgid "supporting better multi-turn conversation, and"
+msgstr "更好支持多轮对话，以及"
+
+#: ../../Qwen/source/framework/function_call.md:1030
+#: 7fae2cd9100a40f4b4928bd8e191c489
+msgid "adding controls for specialized usage."
+msgstr "为特异性使用添加控制。"
+
+#: ../../Qwen/source/framework/function_call.md:1033
+#: dd5768a0d69a4276a79e6ffbf7fba497
+msgid "An equivalent example would be"
+msgstr "一个等效的例子是"
+
+#: ../../Qwen/source/framework/function_call.md:1065
+#: 55d0107bfa5e47ddbbb631e6a8f7a113
+msgid "Let's first list the obvious differences:"
+msgstr "我们先列出明显的差异："
+
+#: ../../Qwen/source/framework/function_call.md:1066
+#: b7f6a14154d84d6f922d350a07f72abd
+msgid "Keywords (`✿FUNCTION✿`, `✿ARGS✿`, etc.) seem rare in ordinary text and more semantically related to function calling, but not special tokens yet."
+msgstr "关键字（`✿FUNCTION✿`, `✿ARGS✿`等）在普通文本中似乎很少见，且与函数调用语义相关，但尚未成为特殊token。"
+
+#: ../../Qwen/source/framework/function_call.md:1067
+#: 978d3ec5660642948dab4f42e3364745
+msgid "Thought is omitted. This could affect accuracy for some use cases."
+msgstr "Thought被省略了。这可能会影响某些使用场景的准确性。"
+
+#: ../../Qwen/source/framework/function_call.md:1068
+#: f0876e39e84641ada6c6a0f2338e20cc
+msgid "Use the system-user-assistant format for multi-turn conversations. Function calling prompting is moved to the system message."
+msgstr "对于多轮对话，请采用系统-用户-助手格式。函数调用提示已移至系统消息中。"
+
+#: ../../Qwen/source/framework/function_call.md:1070
+#: 921d88ca676942ee859d82fd557b4173
+msgid "How about adding controls for specialized usage? The template actually has the following variants:"
+msgstr "那对于特异性使用添加的控制呢？实际上，该模板有以下变体："
+
+#: ../../Qwen/source/framework/function_call.md:1072
+#: be16d60d0f9c4e389c256c7af46c5c88
+msgid "Language: the above is for non-Chinese language; there is another template in Chinese."
+msgstr "语言：上述内容适用于非中文；另有一份中文模板。"
+
+#: ../../Qwen/source/framework/function_call.md:1073
+#: 6b4267e7d3854c5db6329d7451d4b085
+msgid "Parallel Calls: the above is for non-parallel calls; there is another template for parallel calls."
+msgstr "并行调用：上述内容适用于非并行调用；另有一份并行调用的模板。"
+
+#: ../../Qwen/source/framework/function_call.md:1075
+#: 9b3596c169504fe08ccb3287addfdcab
+msgid "In the canonical implementation in Qwen-Agent, those switches are implemented in Python, according to the configuration and current input."
+msgstr "在Qwen-Agent的标准实现中，这些开关是根据配置和当前输入，用Python实现的。"
+
+#: ../../Qwen/source/framework/function_call.md:1077
+#: 947ef7d8aa5e4025b483d97410764ed7
+#, fuzzy
+msgid "The actual text with _parallel calls_ should be like the following:"
+msgstr "带有并行调用的实际文本应如下所示："
+
+#: ../../Qwen/source/framework/function_call.md:1123
+#: 935ec61058b44a7eb1fdd5a31df47e6e
+#, fuzzy
+msgid "This template is hard to adapt it for other frameworks that use less capable templating engines. But it is doable at least partially for Jinja, which is Python-oriented after all. We didn't use it because using the template in `transformers` leads to more changes to the inference usage, which are not very common for beginners."
+msgstr "[之前](#note-official-template)，我们说过，Qwen2的函数调用模板很难为使用功能较弱模板引擎的其他框架进行适应。但至少部分地，对于Jinja（毕竟它是面向Python的）来说是可行的。我们没有使用它，因为在`transformers`中使用模板会导致对推理使用的更多变更，而这对于初学者来说并不常见。"
+
+#: ../../Qwen/source/framework/function_call.md:1127
+#: 4e3c19410244459785fa13e410b209cb
+msgid "For the interested, you can find the Jinja template and key points on usage below:"
+msgstr "对于有兴趣的人，您可以在下方找到Jinja模板及其使用要点："
+
+#: ../../Qwen/source/framework/function_call.md
+#: bf0cdb89d8c84bb180cee54c0f8c9274
+msgid "Qwen2 Function Calling Jinja Template"
+msgstr "Qwen2 函数调用Jinja模板"
+
+#: ../../Qwen/source/framework/function_call.md:1200
+#: d4defd9c03cb4b28aa9018dde74639c5
+msgid "To use this template in `transformers`:"
+msgstr "要在`transformers`中使用此模板："
+
+#: ../../Qwen/source/framework/function_call.md:1202
+#: e8ca3b34c632423d8d4ae3c7e272b7b4
+msgid "Switches can be enabled by passing them to the `apply_chat_template` method, e.g., `tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, parallel_tool_call=True, language=\"zh\", tokenize=False)`. By default, it is for English non-parallel function calling."
+msgstr "可以通过将它们传递给`apply_chat_template`方法来启用开关，例如，`tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, parallel_tool_call=True, language=\"zh\", tokenize=False)`。默认情况下，这是用于英语非并行函数调用。"
+
+#: ../../Qwen/source/framework/function_call.md:1204
+#: 73a45eb9f800455dbfa23b73c6b8add7
+#, fuzzy
+msgid "The tool arguments should be a Python `dict` instead of a JSON-formatted object `str`."
+msgstr "如果函数参数是有效的JSON格式字符串，则将其解析为字典。"
+
+#: ../../Qwen/source/framework/function_call.md:1206
+#: 7d3bbe9beebb475fb1bf567820a2723f
+msgid "Since the generation needs to be stopped at `✿RESULT✿` or else the model will generate fabricated tool results, we should add it to `stop_strings` in `generation_config`:"
+msgstr "由于生成需要在遇到`✿RESULT✿`时停止，不然模型会继续生成编造的工具结果，我们需要将这些字符串加到`generation_config`中的`stop_strings`字段："
+
+#: ../../Qwen/source/framework/function_call.md:1211
+#: ca9c67ecfacd49918c85b52101ee2ca3
+msgid "As a result of using `stop_strings`, you need to pass the tokenizer to `model.generate` as `model.generate(**inputs, tokenizer=tokenizer, max_new_tokens=512)`."
+msgstr "由于使用了`stop_strings`，您需要将tokenizer传递给`model.generate`，即`model.generate(**inputs, tokenizer=tokenizer, max_new_tokens=512)`。"
+
+#: ../../Qwen/source/framework/function_call.md:1213
+#: f7dd23c1309c4da8bd95fb1a21ace195
+msgid "`response`, i.e., the model generation based on the tool calls and tool results, may contain a leading space. You should not strip it for the model. It is resulted from the tokenization and the template design."
+msgstr "基于工具调用和工具结果的模型生成，即`response`，可能包含一个前导空格。作为后续消息输入模型时，不要碰这个空格。这是由tokenization和模板设计导致的。"
+
+#: ../../Qwen/source/framework/function_call.md:1215
+#: ddc6078fa85040fd96b4c211b4ef8091
+msgid "The `try_parse_tool_calls` function should also be modified accordingly."
+msgstr "`try_parse_tool_calls`函数也应进行相应的修改。"
+
+#: ../../Qwen/source/framework/function_call.md:1219
+#: a959a209b43144a883a32645a1da9a7b
+msgid "Qwen2.5 Function Calling Templates"
+msgstr "Qwen2.5 函数调用模板\""
+
+#: ../../Qwen/source/framework/function_call.md:1221
+#: df3d4e40a6024963999e703789407c4b
+msgid "For `transformers` and Ollama, we have also used templates that are easier to implement with Jinja or Go. They are variants of [the Nous Research's Hermes function calling template](https://github.com/NousResearch/Hermes-Function-Calling#prompt-format-for-function-calling). The Jinja template and the Go template should produce basically the same results. They final text should look like the following:"
+msgstr "对于`transformers`和Ollama，我们也使用易于Jinja和Go实现的模板，它们是[Nous Research的Hermes函数调用模板](https://github.com/NousResearch/Hermes-Function-Calling#prompt-format-for-function-calling)的变体。Jinja模板和Go模板应基本产生相同的结果。最终文本应如下所示："
+
+#: ../../Qwen/source/framework/function_call.md:1266
+#: 679f928fd3b64de9ae62ffbd36b8d8de
+msgid "While the text may seem different from the previous one, the basic prompting structure is still the same. There are just more structural tags and more JSON-formatted strings."
+msgstr "虽然文本可能与官方的有所不同，但基本的提示结构仍然相同。只是有更多结构标签和更多JSON格式的字符串。"
+
+#: ../../Qwen/source/framework/function_call.md:1271
+#: de22c8cf18f24afcbfc29fb305c8099a
+msgid "There is one thing we haven't talked about: how should functions be described to the LLMs. In short, you could describe them as you would normally describe them in an API documentation, as long as you can effectively parse, validate, and execute the tool calls generated by the models. The format with JSON Schema appears a valid and common choice."
+msgstr "有一件事我们尚未提及：如何向大型语言模型描述函数。简而言之，你可以像在API文档中通常描述它们那样来描述它们，只要你能有效地解析、验证并执行由模型生成的工具调用。带有JSON Schema的格式似乎是一个有效且常见的选择。"
+
+#: ../../Qwen/source/framework/function_call.md:1276
+#: 0bda938218774642b5d0296ecdd6a5bc
+msgid "Finally"
+msgstr "最后"
+
+#: ../../Qwen/source/framework/function_call.md:1278
+#: f981058838464bdaac62e94f18d148bd
+msgid "In whichever way you choose to use function calling with Qwen2.5, keep in mind that the limitation and the perks of prompt engineering applies:"
+msgstr "无论你选择哪种方式在Qwen2.5中使用函数调用，请记住提示工程的限制和优势适用："
+
+#: ../../Qwen/source/framework/function_call.md:1279
+#: bdd5e0dee036466ba54c614e8bd254ca
+msgid "It is not guaranteed that the model generation will always follow the protocol even with proper prompting or templates. Especially, for the templates that are more complex and relies more on the model itself to think and stay on track than the ones that are simpler and relies on the template and the use of control or special tokens. The latter one, of course, requires some kind of training. In production code, be prepared that if it breaks, countermeasures or rectifications are in place."
+msgstr "无法保证模型生成将始终遵循协议，即使有适当的提示或模板。特别是对于那些更复杂且更多依赖于模型本身思考和保持方向的模板，而非那些更简单且依赖于模板以及控制或特殊标记使用的模板。当然，后者需要某种训练。在生产代码中，要准备好如果出现问题，采取补救措施或修正措施。"
+
+#: ../../Qwen/source/framework/function_call.md:1283
+#: e032b53f33c042ba9e84a5e16b27edeb
+msgid "If in certain scenarios, the generation is not up to expectation, you can refine the template to add more instructions or constraints. While the templates mentioned here are general enough, they may not be the best or the most specific or the most concise for your use cases. The ultimate solution is fine-tuning using your own data."
+msgstr "如果在某些场景下，生成结果未达到预期，你可以细化模板以添加更多指令或约束。尽管这里提到的模板足够通用，但对于你的具体使用案例，它们可能不是最佳的、最具体的或最简洁的。最终解决方案是使用你自己的数据进行微调。"
+
+#: ../../Qwen/source/framework/function_call.md:1287
+#: 091873c9749b4684891f95af4418d831
+msgid "Have fun prompting!"
+msgstr "享受提示的乐趣吧！"
+
+#: ../../Qwen/source/framework/function_call.md:485
+#: e73e525ef897455dbf61663090503acf
+msgid "`transformers` will use `transformers.utils.get_json_schema` to generate the tool descriptions from Python functions. There are some gotchas with `get_json_schema`, and it is advised to check [its doc \\[v4.44.2\\]](https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/utils/chat_template_utils.py#L183-L288) before relying on it."
+msgstr "`transformers`将使用`transformers.utils.get_json_schema`从Python函数生成工具描述。`get_json_schema`存在一些陷阱，在依赖它之前建议查看[其文档\\[v4.44.2\\]](https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/utils/chat_template_utils.py#L183-L288)。"
+
+#: ../../Qwen/source/framework/function_call.md:488
+#: b365673acafc44fdbd6b5335aff712e9
+msgid "The function should use Python type hints for parameter types and has a Google-style docstring for function description and parameter descriptions."
+msgstr "函数应使用Python类型注释表示参数类型，并具有Google风格的docstring用于函数描述和参数描述。"
+
+#: ../../Qwen/source/framework/function_call.md:489
+#: 5088a9323fbc4f0cb076dc1599cdfd7f
+msgid "Supported types are limited, since the types needs to be mapped to JSON Schema. In particular, `typing.Literal` is not supported. You can instead add `(choices: ...)` at the end of a parameter description, which will be mapped to a `enum` type in JSON Schema."
+msgstr "支持的类型有限，因为这些类型需要映射到JSON Schema。特别是，`typing.Literal`不受支持。你可以在参数描述的末尾添加`(choices: ...)`，这将在JSON Schema中映射为`enum`类型。"
+
+#: ../../Qwen/source/framework/function_call.md:493
+#: 799f082def754a39aac54e6201210023
+msgid "Please be aware that all the returned results in the examples in the linked docstring are actually the content of the `function` field in the actual returned results."
+msgstr "请注意，链接docstring中的所有返回结果示例实际上是实际返回结果中`function`字段的内容。"
+
+#~ msgid "In `transformers`, the tool calls should be a field of assistant messages.[^tool_call_arg_format] Let's use a simple function called `try_parse_tool_calls` to parse the tool calls, which can be found in [the preparation code](#prepcode). This function does not cover all possible scenarios and thus is prone to errors. But it should suffice for the purpose of this guide."
+#~ msgstr "在`transformers`中，工具调用应该是助手消息的一个字段[^tool_call_arg_format]。让我们使用一个简单的函数`try_parse_tool_calls`来解析工具调用，该函数可以在[准备代码](#prepcode)中找到。此函数并未涵盖所有可能场景，因此容易出错。但对于本指南的目的而言，它应该足够了。"
+
+#~ msgid "However, note that the model generates arguments in tool calls not as a JSON object but a JSON-formatted string of the JSON object.  For `transformers` and `ollama`, as the interfaces require the arguments to be JSON objects or Python dicts, there will be differences between the actual model generation and the template results for tool call arguments."
+#~ msgstr "然而，请注意，模型在工具调用中生成的参数不是作为JSON对象，而是该JSON对象的JSON格式字符串。对于`transformers`和`ollama`，由于接口要求参数为JSON对象或Python字典，因此实际模型生成和模板结果之间的工具调用参数格式将存在差异。"
+