v1.0

a52e53db · chenzk · a52e53db · a52e53db · a52e53db · a52e53db
Commit a52e53db authored Apr 29, 2025 by chenzk
20 changed files
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3"
+
+sphinx:
+   configuration: docs/source/conf.py
+
+# If using Sphinx, optionally build your docs in additional formats such as PDF
+# formats:
+#    - pdf
+
+# Optionally declare the Python requirements required to build your docs
+python:
+   install:
+   - requirements: docs/requirements-docs.txt
--- a/Qwen/Qwen3-8B/README.md
+++ b/Qwen/Qwen3-8B/README.md
+---
+library_name: transformers
+license: apache-2.0
+license_link: https://huggingface.co/Qwen/Qwen3-8B/blob/main/LICENSE
+pipeline_tag: text-generation
+base_model:
+- Qwen/Qwen3-8B-Base
+---
+
+# Qwen3-8B
+<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
+</a>
+
+## Qwen3 Highlights
+
+Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:
+
+- **Uniquely support of seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose dialogue) **within single model**, ensuring optimal performance across various scenarios.
+- **Significantly enhancement in its reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
+- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
+- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
+- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
+
+## Model Overview
+
+**Qwen3-8B** has the following features:
+- Type: Causal Language Models
+- Training Stage: Pretraining & Post-training
+- Number of Parameters: 8.2B
+- Number of Paramaters (Non-Embedding): 6.95B
+- Number of Layers: 36
+- Number of Attention Heads (GQA): 32 for Q and 8 for KV
+- Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts). 
+
+For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
+
+## Quickstart
+
+The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`.
+
+With `transformers<4.51.0`, you will encounter the following error:
+```
+KeyError: 'qwen3'
+```
+
+The following contains a code snippet illustrating how to use the model generate content based on given inputs. 
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_name = "Qwen/Qwen3-8B"
+
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+
+# prepare the model input
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+# conduct text completion
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
+
+# parsing thinking content
+try:
+    # rindex finding 151668 (</think>)
+    index = len(output_ids) - output_ids[::-1].index(151668)
+except ValueError:
+    index = 0
+
+thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
+content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
+
+print("thinking content:", thinking_content)
+print("content:", content)
+```
+
+For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or  to create an OpenAI-compatible API endpoint:
+- SGLang:
+    ```shell
+    python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3
+    ```
+- vLLM:
+    ```shell
+    vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1
+    ```
+
+For local use, applications such as llama.cpp, Ollama, LMStudio, and MLX-LM have also supported Qwen3.
+
+## Switching Between Thinking and Non-Thinking Mode
+
+> [!TIP]
+> The `enable_thinking` switch is also available in APIs created by SGLang and vLLM. 
+> Please refer to our documentation for [SGLang](https://qwen.readthedocs.io/en/latest/deployment/sglang.html#thinking-non-thinking-modes) and [vLLM](https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes) users.
+
+### `enable_thinking=True`
+
+By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting `enable_thinking=True` or leaving it as the default value in `tokenizer.apply_chat_template`, the model will engage its thinking mode.
+
+```python
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True  # True is the default value for enable_thinking
+)
+```
+
+In this mode, the model will generate think content wrapped in a `<think>...</think>` block, followed by the final response.
+
+> [!NOTE]
+> For thinking mode, use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0` (the default setting in `generation_config.json`). **DO NOT use greedy decoding**, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
+
+
+### `enable_thinking=False`
+
+We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.
+
+```python
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
+)
+```
+
+In this mode, the model will not generate any think content and will not include a `<think>...</think>` block.
+
+> [!NOTE]
+> For non-thinking mode, we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
+
+### Advanced Usage: Switching Between Thinking and Non-Thinking Modes via User Input
+
+We provide a soft switch mechanism that allows users to dynamically control the model's behavior when `enable_thinking=True`. Specifically, you can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
+
+Here is an example of a multi-turn conversation:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+class QwenChatbot:
+    def __init__(self, model_name="Qwen/Qwen3-8B"):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModelForCausalLM.from_pretrained(model_name)
+        self.history = []
+
+    def generate_response(self, user_input):
+        messages = self.history + [{"role": "user", "content": user_input}]
+
+        text = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True
+        )
+
+        inputs = self.tokenizer(text, return_tensors="pt")
+        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
+        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
+
+        # Update history
+        self.history.append({"role": "user", "content": user_input})
+        self.history.append({"role": "assistant", "content": response})
+
+        return response
+
+# Example Usage
+if __name__ == "__main__":
+    chatbot = QwenChatbot()
+
+    # First input (without /think or /no_think tags, thinking mode is enabled by default)
+    user_input_1 = "How many r's in strawberries?"
+    print(f"User: {user_input_1}")
+    response_1 = chatbot.generate_response(user_input_1)
+    print(f"Bot: {response_1}")
+    print("----------------------")
+
+    # Second input with /no_think
+    user_input_2 = "Then, how many r's in blueberries? /no_think"
+    print(f"User: {user_input_2}")
+    response_2 = chatbot.generate_response(user_input_2)
+    print(f"Bot: {response_2}") 
+    print("----------------------")
+
+    # Third input with /think
+    user_input_3 = "Really? /think"
+    print(f"User: {user_input_3}")
+    response_3 = chatbot.generate_response(user_input_3)
+    print(f"Bot: {response_3}")
+```
+
+> [!NOTE]
+> For API compatibility, when `enable_thinking=True`, regardless of whether the user uses `/think` or `/no_think`, the model will always output a block wrapped in `<think>...</think>`. However, the content inside this block may be empty if thinking is disabled.
+> When `enable_thinking=False`, the soft switches are not valid. Regardless of any `/think` or `/no_think` tags input by the user, the model will not generate think content and will not include a `<think>...</think>` block.
+
+## Agentic Use
+
+Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
+
+To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
+```python
+from qwen_agent.agents import Assistant
+
+# Define LLM
+llm_cfg = {
+    'model': 'Qwen3-8B',
+
+    # Use the endpoint provided by Alibaba Model Studio:
+    # 'model_type': 'qwen_dashscope',
+    # 'api_key': os.getenv('DASHSCOPE_API_KEY'),
+
+    # Use a custom endpoint compatible with OpenAI API:
+    'model_server': 'http://localhost:8000/v1',  # api_base
+    'api_key': 'EMPTY',
+
+    # Other parameters:
+    # 'generate_cfg': {
+    #         # Add: When the response content is `<think>this is the thought</think>this is the answer;
+    #         # Do not add: When the response has been separated by reasoning_content and content.
+    #         'thought_in_content': True,
+    #     },
+}
+
+# Define Tools
+tools = [
+    {'mcpServers': {  # You can specify the MCP configuration file
+            'time': {
+                'command': 'uvx',
+                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
+            },
+            "fetch": {
+                "command": "uvx",
+                "args": ["mcp-server-fetch"]
+            }
+        }
+    },
+  'code_interpreter',  # Built-in tools
+]
+
+# Define Agent
+bot = Assistant(llm=llm_cfg, function_list=tools)
+
+# Streaming generation
+messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
+for responses in bot.run(messages=messages):
+    pass
+print(responses)
+```
+
+## Processing Long Texts
+
+Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
+
+YaRN is currently supported by several inference frameworks, e.g., `transformers` and `llama.cpp` for local use, `vllm` and `sglang` for deployment. In general, there are two approaches to enabling YaRN for supported frameworks:
+
+- Modifying the model files:
+  In the `config.json` file, add the `rope_scaling` fields:
+    ```json
+    {
+        ...,
+        "rope_scaling": {
+            "type": "yarn",
+            "factor": 4.0,
+            "original_max_position_embeddings": 32768
+        }
+    }
+    ```
+  For `llama.cpp`, you need to regenerate the GGUF file after the modification.
+
+- Passing command line arguments:
+
+  For `vllm`, you can use
+    ```shell
+    vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072  
+    ```
+
+  For `sglang`, you can use
+    ```shell
+    python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
+    ```
+
+  For `llama-server` from `llama.cpp`, you can use
+    ```shell
+    llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
+    ```
+
+> [!IMPORTANT]
+> If you encounter the following warning
+> ```
+> Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
+> ```
+> please upgrade `transformers>=4.51.0`.
+
+> [!NOTE]
+> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
+> We advise adding the `rope_scaling` configuration only when processing long contexts is required. 
+> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0. 
+
+> [!NOTE]
+> The default `max_position_embeddings` in `config.json` is set to 40,960. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
+
+> [!TIP]
+> The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
+
+## Best Practices
+
+To achieve optimal performance, we recommend the following settings:
+
+1. **Sampling Parameters**:
+   - For thinking mode (`enable_thinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0`. **DO NOT use greedy decoding**, as it can lead to performance degradation and endless repetitions.
+   - For non-thinking mode (`enable_thinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`.
+   - For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
+
+2. **Adequate Output Length**: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
+
+3. **Standardize Output Format**: We recommend using prompts to standardize model outputs when benchmarking.
+   - **Math Problems**: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
+   - **Multiple-Choice Questions**: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`."
+
+4. **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.
+
+### Citation
+
+If you find our work helpful, feel free to give us a cite.
+
+```
+@misc{qwen3,
+    title  = {Qwen3},
+    url    = {https://qwenlm.github.io/blog/qwen3/},
+    author = {Qwen Team},
+    month  = {April},
+    year   = {2025}
+}
+```
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# Qwen3
+骨干网络仅含0.45B参数，支持口音强度控制，适于实时语音交互，能满足不同场景下对语音口音克隆的多样化需求。
+
+## 论文
+`无`
+
+## 模型结构
+Qwen3采用通用的Decoder-Only结构，引入了MoE提升性能，首个「混合推理模型」，将「快思考」与「慢思考」集成进同一个模型。
+<div align=center>
+    <img src="./doc/qwen.png"/>
+</div>
+
+## 算法原理
+将输入embedding后放入attention、ffn等提取特征，最后利用Softmax将解码器最后一层产生的未经归一化的分数向量（logits）转换为概率分布，其中每个元素表示生成对应词汇的概率，这使得模型可以生成一个分布，并从中选择最可能的词作为预测结果。
+
+## 环境配置
+```
+mv Qwen3_pytorch Qwen3 # 去框架名后缀
+```
+
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：e77c15729879
+docker run -it --shm-size=64G -v $PWD/Qwen3:/home/Qwen3 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name qwen3 <your IMAGE ID> bash
+cd /home/Qwen3
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+### Dockerfile（方法二）
+```
+cd /home/Qwen3/docker
+docker build --no-cache -t qwen3:latest .
+docker run --shm-size=64G --name qwen3 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Qwen3:/home/Qwen3 -it qwen3 bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.hpccube.com/tool/
+```
+DTK驱动:dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+vllm:0.6.2
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+transformers:4.51.0
+```
+
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/Qwen3
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
+```
+
+## 数据集
+`无`
+
+## 训练
+无
+
+## 推理
+预训练权重目录结构：
+```
+/home/Qwen3/
+    └── Qwen/Qwen3-8B
+``` 
+
+### 单机多卡
+```
+# 本项目以Qwen3-8B示例，其它Qwen3模型以此类推。
+cd /home/Qwen3
+python infer_transformers.py
+# vllm>=0.8.4正在适配中，后期将陆续开放vllm版推理。
+```
+
+更多资料可参考源项目中的[`README_orgin`](./README_orgin.md)。
+
+## result
+`输入: `
+```
+prompt: "Give me a short introduction to large language models."
+```
+
+`输出:`
+```
+<think>
+Okay, the user wants a short introduction to large language models. Let me start by defining what they are. I should mention they're AI systems trained on massive text data. Maybe include how they process and generate human-like text. Also, touch on their applications like answering questions, creating content, coding. Need to keep it concise but cover the key points. Oh, and maybe mention their size, like parameters, but not too technical. Avoid jargon. Make sure it's easy to understand. Let me check if I'm missing anything important. Oh, maybe a sentence about their training process? Or just stick to the basics. Alright, structure: definition, training data, capabilities, applications. Keep each part brief. That should work.
+</think>
+
+Large language models (LLMs) are advanced artificial intelligence systems trained on vast amounts of text data to understand and generate human-like language. They can process and respond to complex queries, create written content, code, and even engage in conversations. These models, often with billions of parameters, excel at tasks like answering questions, summarizing information, and translating languages, making them versatile tools for various applications, from customer service to research and creative writing.
+```
+
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+
+## 应用场景
+### 算法类别
+`对话问答`
+### 热点应用行业
+`制造,广媒,金融,能源,医疗,家居,教育`
+## 预训练权重
+魔搭社区下载地址为：[Qwen/Qwen3-8B](https://www.modelscope.cn/Qwen/Qwen3-8B.git)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/Qwen3_pytorch.git
+## 参考资料
+- https://github.com/QwenLM/Qwen3.git
+
--- a/README_orgin.md
+++ b/README_orgin.md
+# Qwen3
+
+<p align="center">
+    <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/logo_qwen3.png" width="400"/>
+<p>
+
+<p align="center">
+          💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 Paper &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/blog/qwen3/">Blog</a> &nbsp&nbsp ｜ &nbsp&nbsp📖 <a href="https://qwen.readthedocs.io/">Documentation</a>
+<br>
+🖥️ <a href="https://huggingface.co/spaces/Qwen/Qwen3-Demo">Demo</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp
+</p>
+
+
+Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with `Qwen3-` or visit the [Qwen3 collection](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f), and you will find all you need! Enjoy!
+
+To learn more about Qwen3, feel free to read our documentation \[[EN](https://qwen.readthedocs.io/en/latest/)|[ZH](https://qwen.readthedocs.io/zh-cn/latest/)\]. Our documentation consists of the following sections:
+
+- Quickstart: the basic usages and demonstrations;
+- Inference: the guidance for the inference with Transformers, including batch inference, streaming, etc.;
+- Run Locally: the instructions for running LLM locally on CPU and GPU, with frameworks like llama.cpp and Ollama;
+- Deployment: the demonstration of how to deploy Qwen for large-scale inference with frameworks like SGLang, vLLM, TGI, etc.;
+- Quantization: the practice of quantizing LLMs with GPTQ, AWQ, as well as the guidance for how to make high-quality quantized GGUF files;
+- Training: the instructions for post-training, including SFT and RLHF (TODO) with frameworks like Axolotl, LLaMA-Factory, etc.
+- Framework: the usage of Qwen with frameworks for application, e.g., RAG, Agent, etc.
+
+## Introduction
+
+We are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models.
+These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5.
+We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models. 
+
+The highlights from Qwen3 include:
+- **Dense and Mixture-of-Experts (MoE) models of various sizes**, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.
+- **Seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose chat), ensuring optimal performance across various scenarios.
+- **Significantly enhancement in reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
+- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
+- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
+- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
+
+> ![IMPORTANT]
+> Qwen3 models adopt a different naming scheme.
+>
+> The post-trained models do not use the "-Instruct" suffix any more. For example, Qwen3-32B is the newer version of Qwen2.5-32B-Instruct.
+>
+> The base models now have names ending with "-Base".
+
+
+## News
+
+- 2025.04.29: We released the Qwen3 series. Check our [blog](https://qwenlm.github.io/blog/qwen3) for more details!
+- 2024.09.19: We released the Qwen2.5 series. This time there are 3 extra model sizes: 3B, 14B, and 32B for more possibilities. Check our [blog](https://qwenlm.github.io/blog/qwen2.5) for more!
+- 2024.06.06: We released the Qwen2 series. Check our [blog](https://qwenlm.github.io/blog/qwen2/)!
+- 2024.03.28: We released the first MoE model of Qwen: Qwen1.5-MoE-A2.7B! Temporarily, only HF transformers and vLLM support the model. We will soon add the support of llama.cpp, mlx-lm, etc. Check our [blog](https://qwenlm.github.io/blog/qwen-moe/) for more information!
+- 2024.02.05: We released the Qwen1.5 series.
+
+## Performance
+
+Detailed evaluation results are reported in this <a href="https://qwenlm.github.io/blog/qwen3/"> 📑 blog</a>.
+
+For requirements on GPU memory and the respective throughput, see results [here](https://qwen.readthedocs.io/en/latest/getting_started/speed_benchmark.html) .
+
+## Run Qwen3
+
+### 🤗 Transformers
+
+Transformers is a library of pretrained natural language processing for inference and training. 
+The latest version of `transformers` is recommended and `transformers>=4.51.0` is required.
+
+The following contains a code snippet illustrating how to use the model generate content based on given inputs. 
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Qwen/Qwen3-8B"
+
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+
+# prepare the model input
+prompt = "Give me a short introduction to large language models."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+# conduct text completion
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
+
+# the result will begin with thinking content in <think></think> tags, followed by the actual response
+print(tokenizer.decode(output_ids, skip_special_tokens=True))
+```
+
+By default, Qwen3 models will think before response.
+This could be controled by
+- `enable_thinking=False`: Passing `enable_thinking=False` to `tokenizer.apply_chat_template` will strictly prevent the model from generating thinking content.
+- `/think` and `/nothink` instructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.
+
+
+
+### ModelScope
+
+We strongly advise users especially those in mainland China to use ModelScope. 
+ModelScope adopts a Python API similar to Transformers.
+The CLI tool `modelscope download` can help you solve issues concerning downloading checkpoints.
+
+
+### llama.cpp
+
+[`llama.cpp`](https://github.com/ggml-org/llama.cpp) enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.
+`llama.cpp>=b5092` is required.
+
+To use the CLI, run the following in a terminal:
+```shell
+./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
+# CTRL+C to exit
+```
+
+To use the API server, run the following in a terminal:
+```shell
+./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080
+```
+A simple web front end will be at `http://localhost:8080` and an OpenAI-compatible API will be at `http://localhost:8080/v1`.
+
+For additional guides, please refer to [our documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html).
+
+### Ollama
+
+After [installing ollama](https://ollama.com/), you can initiate the ollama service with the following command:
+```shell
+ollama serve
+# You need to keep this service running whenever you are using ollama
+```
+
+To pull a model checkpoint and run the model, use the `ollama run` command. You can specify a model size by adding a suffix to `qwen3`, such as `:8b` or `:30b-a3b`:
+```shell
+ollama run qwen3:8b
+# To exit, type "/bye" and press ENTER
+```
+
+You can also access the ollama service via its OpenAI-compatible API. 
+Please note that you need to (1) keep `ollama serve` running while using the API, and (2) execute `ollama run qwen3:8b` before utilizing this API to ensure that the model checkpoint is prepared.
+The API is at `http://localhost:11434/v1/` by default.
+
+For additional details, please visit [ollama.ai](https://ollama.com/).
+
+### LMStudio
+
+Qwen3 has already been supported by [lmstudio.ai](https://lmstudio.ai/). You can directly use LMStudio with our GGUF files.
+
+### MLX-LM
+
+If you are running on Apple Silicon, [`mlx-lm`](https://github.com/ml-explore/mlx-lm) also supports Qwen3 (`mlx-lm>=0.24.0`). 
+Look for models ending with MLX on HuggingFace Hub.
+
+
+<!-- ### OpenVINO
+
+Qwen2.5 has already been supported by [OpenVINO toolkit](https://github.com/openvinotoolkit). You can install and run this [chatbot example](https://github.com/OpenVINO-dev-contest/Qwen2.openvino) with Intel CPU, integrated GPU or discrete GPU.  -->
+
+
+<!-- ### Text generation web UI
+
+You can directly use [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui) for creating a web UI demo. If you use GGUF, remember to install the latest wheel of `llama.cpp` with the support of Qwen2.5. -->
+
+
+<!-- ### llamafile
+
+Clone [`llamafile`](https://github.com/Mozilla-Ocho/llamafile), run source install, and then create your own llamafile with the GGUF file following the guide [here](https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file#creating-llamafiles). You are able to run one line of command, say `./qwen.llamafile`, to create a demo. -->
+
+
+## Deploy Qwen3
+
+Qwen3 is supported by multiple inference frameworks. 
+Here we demonstrate the usage of `SGLang` and `vLLM`.
+You can also find Qwen3 models from various inference providers, e.g., [Alibaba Cloud Model Studio](https://www.alibabacloud.com/en/product/modelstudio).
+
+### SGLang
+
+[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
+SGLang could be used to launch a server with OpenAI-compatible API service. 
+`sglang>=0.4.6.post1` is required.
+It is as easy as
+```shell
+python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000 --reasoning-parser qwen3
+```
+An OpenAI-compatible API will be available at `http://localhost:30000/v1`.
+
+### vLLM
+
+[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
+`vllm>=0.8.4` is required.
+
+```shell
+vllm serve Qwen/Qwen3-8B --port 8000 --enable-reasoning-parser --reasoning-parser deepseek_r1
+```
+An OpenAI-compatible API will be available at `http://localhost:8000/v1`.
+
+### MindIE
+
+For depolyment on Ascend NPUs, please visit [Modelers](https://modelers.cn/) and search for Qwen3.
+
+<!-- 
+### OpenLLM
+
+[OpenLLM](https://github.com/bentoml/OpenLLM) allows you to easily run Qwen2.5 as OpenAI-compatible APIs. You can start a model server using `openllm serve`. For example:
+
+```bash
+openllm serve qwen2.5:7b
+```
+
+The server is active at `http://localhost:3000/`, providing OpenAI-compatible APIs. You can create an OpenAI client to call its chat API. For more information, refer to [our documentation](https://qwen.readthedocs.io/en/latest/deployment/openllm.html). -->
+
+
+## Build with Qwen3
+
+### Tool Use
+
+For tool use capabilities, we recommend taking a look at [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent), which provides a wrapper around these APIs to support tool use or function calling with MCP support.
+Tool use with Qwen3 can also be conducted with SGLang, vLLM,  Transformers, llama.cpp, Ollama, etc.
+Follow guides in our documentation to see how to enable the support.
+
+
+### Finetuning
+
+We advise you to use training frameworks, including [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), [unsloth](https://github.com/unslothai/unsloth), [Swift](https://github.com/modelscope/swift), [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory), etc., to finetune your models with SFT, DPO, GRPO, etc.
+
+
+## License Agreement
+
+All our open-source models are licensed under Apache 2.0. 
+You can find the license files in the respective Hugging Face repositories.
+
+## Citation
+
+If you find our work helpful, feel free to give us a cite.
+
+```
+@article{qwen2.5,
+    title   = {Qwen2.5 Technical Report}, 
+    author  = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
+    journal = {arXiv preprint arXiv:2412.15115},
+    year    = {2024}
+}
+
+@article{qwen2,
+    title   = {Qwen2 Technical Report}, 
+    author  = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
+    journal = {arXiv preprint arXiv:2407.10671},
+    year    = {2024}
+}
+```
+
+## Contact Us
+If you are interested to leave a message to either our research team or product team, join our [Discord](https://discord.gg/z3GAxXZ9Ce) or [WeChat groups](assets/wechat.png)!
--- a/doc/qwen.png
+++ b/doc/qwen.png
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+ENV DEBIAN_FRONTEND=noninteractive
+# RUN yum update && yum install -y git cmake wget build-essential
+# RUN source /opt/dtk-dtk25.04/env.sh
+# # 安装pip相关依赖
+COPY requirements.txt requirements.txt
+RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+
--- a/docker/requirements.txt
+++ b/docker/requirements.txt
+transformers>=4.51.0
--- a/docker_nv/Dockerfile-cu121
+++ b/docker_nv/Dockerfile-cu121
+ARG CUDA_VERSION=12.1.0
+ARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
+
+FROM ${from} as base
+
+RUN <<EOF
+apt update -y && apt upgrade -y && apt install -y --no-install-recommends  \
+    git \
+    git-lfs \
+    python3 \
+    python3-pip \
+    python3-dev \
+    wget \
+    vim \
+&& rm -rf /var/lib/apt/lists/*
+EOF
+
+RUN ln -s /usr/bin/python3 /usr/bin/python
+
+RUN git lfs install
+
+FROM base as dev
+
+WORKDIR /
+
+RUN mkdir -p /data/shared/Qwen
+
+WORKDIR /data/shared/Qwen/
+
+FROM dev as bundle_req
+RUN pip install --no-cache-dir networkx==3.1
+RUN pip3 install --no-cache-dir torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
+RUN pip3 install --no-cache-dir transformers==4.40.2 accelerate tiktoken einops scipy
+    
+FROM bundle_req as bundle_finetune
+ARG BUNDLE_FINETUNE=true
+
+RUN <<EOF
+if [ "$BUNDLE_FINETUNE" = "true" ]; then
+    cd /data/shared/Qwen
+
+    # Full-finetune / LoRA.
+    pip3 install --no-cache-dir "deepspeed==0.14.2" "peft==0.11.1"
+
+    # Q-LoRA.
+    apt update -y && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \
+        libopenmpi-dev openmpi-bin \
+        && rm -rf /var/lib/apt/lists/*
+    pip3 install --no-cache-dir "optimum==1.20.0" "auto-gptq==0.7.1" "autoawq==0.2.5" mpi4py
+fi
+EOF
+
+FROM bundle_finetune as bundle_vllm
+ARG BUNDLE_VLLM=true
+
+RUN <<EOF
+if [ "$BUNDLE_VLLM" = "true" ]; then
+    cd /data/shared/Qwen
+
+    pip3 install --no-cache-dir vllm==0.4.3 "fschat[model_worker,webui]==0.2.36"
+fi
+EOF
+
+FROM bundle_vllm as bundle_flash_attention
+ARG BUNDLE_FLASH_ATTENTION=true
+
+RUN <<EOF 
+if [ "$BUNDLE_FLASH_ATTENTION" = "true" ]; then
+    pip3 install --no-cache-dir flash-attn==2.5.8 --no-build-isolation
+fi
+EOF
+
+FROM bundle_flash_attention as final
+
+COPY ../examples/sft/* ./
+COPY ../examples/demo/* ./
+
+EXPOSE 80
--- a/docker_nv/docker_cli_demo.sh
+++ b/docker_nv/docker_cli_demo.sh
+#!/usr/bin/env bash
+#
+# This script will automatically pull docker image from DockerHub, and start a container to run the Qwen-Chat cli-demo.
+
+IMAGE_NAME=qwenllm/qwen:2-cu121
+QWEN_CHECKPOINT_PATH=/path/to/Qwen-Instruct
+CONTAINER_NAME=qwen2
+
+function usage() {
+    echo '
+Usage: bash docker/docker_cli_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Instruct] [-n CONTAINER_NAME]
+'
+}
+
+while [[ "$1" != "" ]]; do
+    case $1 in
+        -i | --image-name )
+            shift
+            IMAGE_NAME=$1
+            ;;
+        -c | --checkpoint )
+            shift
+            QWEN_CHECKPOINT_PATH=$1
+            ;;
+        -n | --container-name )
+            shift
+            CONTAINER_NAME=$1
+            ;;
+        -h | --help )
+            usage
+            exit 0
+            ;;
+        * )
+            echo "Unknown argument ${1}"
+            exit 1
+            ;;
+    esac
+    shift
+done
+
+if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
+    echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
+    exit 1
+fi
+
+sudo docker pull ${IMAGE_NAME} || {
+    echo "Pulling image ${IMAGE_NAME} failed, exit."
+    exit 1
+}
+
+sudo docker run --gpus all --rm --name ${CONTAINER_NAME} \
+    --mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Instruct \
+    -it ${IMAGE_NAME} \
+    python cli_demo.py -c /data/shared/Qwen/Qwen-Instruct/
\ No newline at end of file
--- a/docker_nv/docker_web_demo.sh
+++ b/docker_nv/docker_web_demo.sh
+#!/usr/bin/env bash
+#
+# This script will automatically pull docker image from DockerHub, and start a daemon container to run the Qwen-Chat web-demo.
+
+IMAGE_NAME=qwenllm/qwen:2-cu121
+QWEN_CHECKPOINT_PATH=/path/to/Qwen-Instruct
+PORT=8901
+CONTAINER_NAME=qwen2
+
+function usage() {
+    echo '
+Usage: bash docker/docker_web_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Instruct] [-n CONTAINER_NAME] [--port PORT]
+'
+}
+
+while [[ "$1" != "" ]]; do
+    case $1 in
+        -i | --image-name )
+            shift
+            IMAGE_NAME=$1
+            ;;
+        -c | --checkpoint )
+            shift
+            QWEN_CHECKPOINT_PATH=$1
+            ;;
+        -n | --container-name )
+            shift
+            CONTAINER_NAME=$1
+            ;;
+        --port )
+            shift
+            PORT=$1
+            ;;
+        -h | --help )
+            usage
+            exit 0
+            ;;
+        * )
+            echo "Unknown argument ${1}"
+            exit 1
+            ;;
+    esac
+    shift
+done
+
+if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
+    echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
+    exit 1
+fi
+
+sudo docker pull ${IMAGE_NAME} || {
+    echo "Pulling image ${IMAGE_NAME} failed, exit."
+    exit 1
+}
+
+sudo docker run --gpus all -d --restart always --name ${CONTAINER_NAME} \
+    -v /var/run/docker.sock:/var/run/docker.sock -p ${PORT}:80 \
+    --mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Instruct \
+    -it ${IMAGE_NAME} \
+    python web_demo.py --server-port 80 --server-name 0.0.0.0 -c /data/shared/Qwen/Qwen-Instruct/ && {
+    echo "Successfully started web demo. Open 'http://localhost:${PORT}' to try!
+Run \`docker logs ${CONTAINER_NAME}\` to check demo status.
+Run \`docker rm -f ${CONTAINER_NAME}\` to stop and remove the demo."
+}
\ No newline at end of file
--- a/docs/Makefile
+++ b/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
--- a/docs/README.md
+++ b/docs/README.md
+# Qwen Documentation
+
+This is the source of the documentation at <https://qwen.readthedocs.io>.
+
+## Quick Start
+
+We use `sphinx` to manage the documentation and use the `furo` theme.
+To get started, simply run
+```bash
+pip install -r requirements-docs.txt
+```
+
+Then run `make html` or `sphinx-build -M html source build` and it will compile the docs and put it under the `build/html` directory.
+
+
+## Translation
+
+The documentation is available in both English and Simplified Chinese. We use
+`sphinx-intl` to work with Sphinx translation flow, following [this article](https://www.sphinx-doc.org/en/master/usage/advanced/intl.html).
+
+You need to install the Python package `sphinx-intl` before starting.
+
+1. After updating the English documentation, run `make gettext`, and the pot files will be placed in the `build/gettext` directory. `make gettext` can be slow if the doc is long.
+
+2. Use the generated pot files to update the po files:
+    ```bash
+    sphinx-intl update -p build/gettext -l zh_CN -w 0
+    ```
+
+3. Translate po files at `locales\zh_CN\LC_MESSAGES`. Pay attention to fuzzy matches (messages after `#, fuzzy`). Please be careful not to break reST notation.
+
+4. Build translated document: `make -e SPHINXOPTS="-D language='zh_CN'" html` or `sphinx-build -M html source build -D language=zh_CN`
+
+## Auto Build
+
+```bash
+pip install sphinx-autobuild
+```
+
+To autobuild the default version:
+```bash
+sphinx-autobuild source build/html
+```
+
+To autobuild the translated version:
+```bash
+sphinx-autobuild source build/html -D language=zh_CN --watch locales/zh_CN
+```
+
+By default, the doc is at `http://127.0.0.1:8000`
\ No newline at end of file
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/openllm.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/openllm.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/openllm.rst:2 986ea00cb5af4a0d82f974ed79a82430
+msgid "OpenLLM"
+msgstr "OpenLLM"
+
+#: ../../Qwen/source/deployment/openllm.rst:5 78be03fbdccb429892b03bf84596411b
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/deployment/openllm.rst:7 a001f11d1c5440188121d20b3baf59db
+msgid "OpenLLM allows developers to run Qwen2.5 models of different sizes as OpenAI-compatible APIs with a single command. It features a built-in chat UI, state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Qwen2.5. Visit `the OpenLLM repository <https://github.com/bentoml/OpenLLM/>`_ to learn more."
+msgstr "OpenLLM 允许开发者通过一个命令运行不同大小的 Qwen2.5 模型，提供 OpenAI 兼容的 API。它具有内置的聊天 UI，先进的推理后端，以及简化的工作流程来使用 Qwen2.5 创建企业级云部署。访问 `OpenLLM 仓库 <https://github.com/bentoml/OpenLLM/>`_ 了解更多信息。"
+
+#: ../../Qwen/source/deployment/openllm.rst:10 229f89c3be65442bbe15905d75a0d13d
+msgid "Installation"
+msgstr "安装"
+
+#: ../../Qwen/source/deployment/openllm.rst:12 79421f700fbc426cb6ce9841aff67503
+msgid "Install OpenLLM using ``pip``."
+msgstr "使用 ``pip`` 安装 OpenLLM。"
+
+#: ../../Qwen/source/deployment/openllm.rst:18 69cfd6fe2e274173ad4065be91b71472
+msgid "Verify the installation and display the help information:"
+msgstr "验证安装并显示帮助信息："
+
+#: ../../Qwen/source/deployment/openllm.rst:25 503cae99b14c4ef4b322b8ec0bd2d32d
+msgid "Quickstart"
+msgstr "快速开始"
+
+#: ../../Qwen/source/deployment/openllm.rst:27 0ea788c801404d8780404611c87644b0
+msgid "Before you run any Qwen2.5 model, ensure your model repository is up to date by syncing it with OpenLLM's latest official repository."
+msgstr "在运行任何 Qwen2.5 模型之前，确保您的模型仓库与 OpenLLM 的最新官方仓库同步。"
+
+#: ../../Qwen/source/deployment/openllm.rst:33 8852ff46ecdb45b2bfc9885bbfaacb02
+msgid "List the supported Qwen2.5 models:"
+msgstr "列出支持的 Qwen2.5 模型："
+
+#: ../../Qwen/source/deployment/openllm.rst:39 3e4f6c11396844adb30d4e5812339484
+msgid "The results also display the required GPU resources and supported platforms:"
+msgstr "结果还会显示所需的 GPU 资源和支持的平台："
+
+#: ../../Qwen/source/deployment/openllm.rst:57 ac4c0db02f5249d5882940820779db9a
+msgid "To start a server with one of the models, use ``openllm serve`` like this:"
+msgstr "要使用其中一个模型来启动服务器，请使用 ``openllm serve`` 命令，例如："
+
+#: ../../Qwen/source/deployment/openllm.rst:63 0a1d3ec35c684e3bb3e971c916aa9be7
+msgid "By default, the server starts at ``http://localhost:3000/``."
+msgstr "默认情况下，服务器启动在 http://localhost:3000/。"
+
+#: ../../Qwen/source/deployment/openllm.rst:66 2e787de9a62f4342bdf8f88ee0df5379
+msgid "Interact with the model server"
+msgstr "与模型服务器交互"
+
+#: ../../Qwen/source/deployment/openllm.rst:68 b22802ad9027458bb30ea0da665fea36
+msgid "With the model server up and running, you can call its APIs in the following ways:"
+msgstr "服务器运行后，可以通过以下方式调用其 API："
+
+#: ../../Qwen/source/deployment/openllm.rst 76214ea690094930899d6f2eddcc1454
+msgid "CURL"
+msgstr "CURL"
+
+#: ../../Qwen/source/deployment/openllm.rst:74 42775a3df58f474782d29f2f82707bd9
+msgid "Send an HTTP request to its ``/generate`` endpoint via CURL:"
+msgstr "通过 CURL 向其 ``/generate`` 端点发送 HTTP 请求："
+
+#: ../../Qwen/source/deployment/openllm.rst 4f0ff3eee2ab49dda5a72bd611a9d45e
+msgid "Python client"
+msgstr "Python 客户端"
+
+#: ../../Qwen/source/deployment/openllm.rst:91 ce2e11a46e434798947b1e74ce82a19c
+msgid "Call the OpenAI-compatible endpoints with frameworks and tools that support the OpenAI API protocol. Here is an example:"
+msgstr "使用支持 OpenAI API 协议的框架和工具来调用。例如："
+
+#: ../../Qwen/source/deployment/openllm.rst 107921d1a855430ca70c8c163d37c7f2
+msgid "Chat UI"
+msgstr "聊天 UI"
+
+#: ../../Qwen/source/deployment/openllm.rst:118
+#: b92df2759cd54c2b8316e2a160ede656
+msgid "OpenLLM provides a chat UI at the ``/chat`` endpoint for the LLM server at http://localhost:3000/chat."
+msgstr "OpenLLM 为 LLM 服务器提供的聊天 UI 位于 ``/chat`` 端点，地址为 http://localhost:3000/chat。"
+
+#: ../../Qwen/source/deployment/openllm.rst:123
+#: 0d3fa679178f443caf9c87623001be1f
+msgid "Model repository"
+msgstr "模型仓库"
+
+#: ../../Qwen/source/deployment/openllm.rst:125
+#: 54d6a9bdcc064aeb95a23b60d3d575ab
+msgid "A model repository in OpenLLM represents a catalog of available LLMs. You can add your own repository to OpenLLM with custom Qwen2.5 variants for your specific needs. See our `documentation to learn details <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_."
+msgstr "OpenLLM 中的模型仓库表示可用的 LLM 目录。您可以为 OpenLLM 添加自定义的 Qwen2.5 模型仓库，以满足您的特定需求。请参阅 `我们的文档 <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_ 了解详细信息。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/sglang.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/sglang.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/sglang.md:1 4886c9be510e44ba968bba79c7e01e2b
+msgid "SGLang"
+msgstr ""
+
+#: ../../Qwen/source/deployment/sglang.md:3 fa388b3c599c454bbe22dc7c831723c1
+msgid "[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models."
+msgstr "[SGLang](https://github.com/sgl-project/sglang) 是一个用于大型语言模型和视觉语言模型的快速推理框架。"
+
+#: ../../Qwen/source/deployment/sglang.md:5 43fe1ab3622b4d619de1ba451ff5b5c4
+msgid "To learn more about SGLang, please refer to the [documentation](https://docs.sglang.ai/)."
+msgstr "要了解更多关于 SGLang 的信息，请参阅[官方文档](https://docs.sglang.ai/)。"
+
+#: ../../Qwen/source/deployment/sglang.md:7 4e7093847f104f5c91bf12495db0e2df
+msgid "Environment Setup"
+msgstr "环境配置"
+
+#: ../../Qwen/source/deployment/sglang.md:9 404501b6bb754a01afa398ce270f4ad6
+msgid "By default, you can install `sglang` with pip in a clean environment:"
+msgstr "默认情况下，你可以通过 pip 在新环境中安装 `sglang` ： "
+
+#: ../../Qwen/source/deployment/sglang.md:15 8794cc70acd141eeaef4717a190b11f4
+msgid "Please note that `sglang` relies on `flashinfer-python` and has strict dependencies on `torch` and its CUDA versions. Check the note in the official document for installation ([link](https://docs.sglang.ai/start/install.html)) for more help."
+msgstr "请留意预构建的 `sglang` 依赖 `flashinfer-python`，并对`torch`和其CUDA版本有强依赖。请查看[官方文档](https://docs.sglang.ai/start/install.html)中的注意事项以获取有关安装的帮助。"
+
+#: ../../Qwen/source/deployment/sglang.md:18 06e04edfe3094363bcbc5b8758c8b16c
+msgid "API Service"
+msgstr "API 服务"
+
+#: ../../Qwen/source/deployment/sglang.md:20 5969d8121d8a4af99d790844c4b348c5
+msgid "It is easy to build an OpenAI-compatible API service with SGLang, which can be deployed as a server that implements OpenAI API protocol. By default, it starts the server at `http://localhost:30000`.  You can specify the address with `--host` and `--port` arguments.  Run the command as shown below:"
+msgstr "借助 SGLang ，构建一个与OpenAI API兼容的API服务十分简便，该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下，它将在 `http://localhost:30000` 启动服务器。您可以通过 `--host` 和 `--port` 参数来自定义地址。请按照以下所示运行命令："
+
+#: ../../Qwen/source/deployment/sglang.md:28 32a52bb639634b9b9c196696dc20e2c5
+msgid "By default, if the `--model-path` does not point to a valid local directory, it will download the model files from the HuggingFace Hub. To download model from ModelScope, set the following before running the above command:"
+msgstr "默认情况下，如果模型未指向有效的本地目录，它将从 HuggingFace Hub 下载模型文件。要从 ModelScope 下载模型，请在运行上述命令之前设置以下内容："
+
+#: ../../Qwen/source/deployment/sglang.md:34 cd33984af3e045549668c8ad682f7612
+msgid "For distrbiuted inference with tensor parallelism, it is as simple as"
+msgstr "对于使用张量并行的分布式推理，操作非常简单："
+
+#: ../../Qwen/source/deployment/sglang.md:38 4db95581ffd046a9b6d532933403d985
+msgid "The above command will use tensor parallelism on 4 GPUs. You should change the number of GPUs according to your demand."
+msgstr "上述命令将在 4 块 GPU 上使用张量并行。您应根据需求调整 GPU 的数量。"
+
+#: ../../Qwen/source/deployment/sglang.md:41 765cc12e934b4ab6881f0a71693fcc3d
+msgid "Basic Usage"
+msgstr "基本用法"
+
+#: ../../Qwen/source/deployment/sglang.md:43 51032557dac94cb3b14c3842076192a8
+msgid "Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:"
+msgstr "然后，您可以利用 [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) 来与Qwen进行对话："
+
+#: ../../Qwen/source/deployment/sglang.md 0708e8d2e6a44e94956e44f3a83bb4d8
+#: 3bfc1bfe04ea4b49bfd1d5c6b5af52d7
+msgid "curl"
+msgstr ""
+
+#: ../../Qwen/source/deployment/sglang.md 3b62fc6e456d44a6ba9cc8f5519fc3c6
+#: ab964c5641584f7a9ef4252ecf0428cb
+msgid "Python"
+msgstr ""
+
+#: ../../Qwen/source/deployment/sglang.md:63
+#: ../../Qwen/source/deployment/sglang.md:130 18da82bbe0db4a59aa430b68b91db904
+#: a2fccb4d7c164911a35e5ff6f30d98df
+msgid "You can use the API client with the `openai` Python SDK as shown below:"
+msgstr "或者您可以如下面所示使用 `openai` Python SDK中的 API 客户端："
+
+#: ../../Qwen/source/deployment/sglang.md:91 2bff40bb9f104cf2b19e6cf8169bf18d
+msgid "While the default sampling parameters would work most of the time for thinking mode, it is recommended to adjust the sampling parameters according to your application,  and always pass the sampling parameters to the API."
+msgstr "虽然默认的采样参数在大多数情况下适用于思考模式，但建议根据您的应用调整采样参数，并始终将采样参数传递给 API。"
+
+#: ../../Qwen/source/deployment/sglang.md:97 e10b0bbcaa7c4e54a59f7a30fa8760ef
+msgid "Thinking & Non-Thinking Modes"
+msgstr "思考与非思考模式"
+
+#: ../../Qwen/source/deployment/sglang.md:100 ff0a121d43d5494597e8fc3b832f4893
+msgid "This feature has not been released. For more information, please see this [pull request](https://github.com/sgl-project/sglang/pull/5551)."
+msgstr "此功能尚未发布。更多信息，请参阅此[pull request](https://github.com/sgl-project/sglang/pull/5551)。"
+
+#: ../../Qwen/source/deployment/sglang.md:104 8ba3c8c378ed4df7acb28f04e41bf067
+msgid "Qwen3 models will think before respond. This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think."
+msgstr "Qwen3 模型会在回复前进行思考。这种行为可以通过硬开关（完全禁用思考）或软开关（模型遵循用户关于是否应该思考的指令）来控制。"
+
+#: ../../Qwen/source/deployment/sglang.md:107 dcc39b3925704aee927b220cbf9b341d
+msgid "The hard switch is availabe in SGLang through the following configuration to the API call. To disable thinking, use"
+msgstr "硬开关在 vLLM 中可以通过以下 API 调用配置使用。要禁用思考，请使用"
+
+#: ../../Qwen/source/deployment/sglang.md:159 952c90f4f1c84daba9cb66bfeb32725f
+msgid "It is recommended to set sampling parameters differently for thinking and non-thinking modes."
+msgstr "建议为思考模式和非思考模式分别设置不同的采样参数。"
+
+#: ../../Qwen/source/deployment/sglang.md:162 750cee1281d74246bc7cf47ac9e0d502
+msgid "Parsing Thinking Content"
+msgstr "解析思考内容"
+
+#: ../../Qwen/source/deployment/sglang.md:164 4f1f6c5d59134ea1bf6a625cd5081c51
+msgid "SGLang supports parsing the thinking content from the model generation into structured messages:"
+msgstr "SGLang 支持将模型生成的思考内容解析为结构化消息："
+
+#: ../../Qwen/source/deployment/sglang.md:169 0517d0a9cf694f6caabcbe69e3e1e845
+msgid "The response message will have a field named `reasoning_content` in addition to `content`, containing the thinking content generated by the model."
+msgstr "响应消息除了包含 `content` 字段外，还会有一个名为 `reasoning_content` 的字段，其中包含模型生成的思考内容。"
+
+#: ../../Qwen/source/deployment/sglang.md:172 0225706aa7fe441c82d34f81b348fd42
+msgid "Please note that this feature is not OpenAI API compatible."
+msgstr "请注意，此功能与 OpenAI API 规范不一致。"
+
+#: ../../Qwen/source/deployment/sglang.md:175 45a5f606e86543c08eacf7686b5a2def
+msgid "Parsing Tool Calls"
+msgstr "解析工具调用"
+
+#: ../../Qwen/source/deployment/sglang.md:177 0aa3be18c7a5476cb915d6686c58387d
+msgid "SGLang supports parsing the tool calling content from the model generation into structured messages:"
+msgstr "SGLang 支持将模型生成的工具调用内容解析为结构化消息："
+
+#: ../../Qwen/source/deployment/sglang.md:182 dc096c7fb79c4b9ca0dd2c9cdd7ec890
+msgid "For more information, please refer to [our guide on Function Calling](../framework/function_call.md)."
+msgstr "详细信息，请参阅[函数调用的指南](../framework/function_call.md#vllm)。"
+
+#: ../../Qwen/source/deployment/sglang.md:184 a58bba52efc44663af792d859bd3b410
+msgid "Structured/JSON Output"
+msgstr "结构化/JSON输出"
+
+#: ../../Qwen/source/deployment/sglang.md:186 518f257e9d6d4080b41f980467573f7f
+msgid "SGLang supports structured/JSON output.  Please refer to [SGLang's documentation](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API). Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt."
+msgstr "SGLang 支持结构化/JSON 输出。请参阅[SGLan文档](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API)。此外，还建议在系统消息或您的提示中指示模型生成特定格式。"
+
+#: ../../Qwen/source/deployment/sglang.md:190 3a6d08a831584d6b8392da2650e8bf0b
+msgid "Serving Quantized models"
+msgstr "部署量化模型"
+
+#: ../../Qwen/source/deployment/sglang.md:192 b2fe212a02b84349940f4c0c30cde88d
+msgid "Qwen3 comes with two types of pre-quantized models, FP8 and AWQ."
+msgstr "Qwen3 提供了两种类型的预量化模型：FP8 和 AWQ。"
+
+#: ../../Qwen/source/deployment/sglang.md:194 efc85fc46a564483bdb872dbf5d61f3c
+msgid "The command serving those models are the same as the original models except for the name change:"
+msgstr "部署这些模型的命令与原始模型相同，只是名称有所更改："
+
+#: ../../Qwen/source/deployment/sglang.md:203 11a6d1bb983d4e60a55f5d579f1eb76b
+msgid "Context Length"
+msgstr "上下文长度"
+
+#: ../../Qwen/source/deployment/sglang.md:205 de0293719e06477fbde6afc533973b1a
+msgid "The context length for Qwen3 models in pretraining is up to 32,768 tokenns. To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied. We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts."
+msgstr "Qwen3 模型在预训练中的上下文长度最长为 32,768 个 token。为了处理显著超过 32,768 个 token 的上下文长度，应应用 RoPE 缩放技术。我们已经验证了 [YaRN](https://arxiv.org/abs/2309.00071) 的性能，这是一种增强模型长度外推的技术，可确保在长文本上的最佳性能。"
+
+#: ../../Qwen/source/deployment/sglang.md:209 0ee16aabbc794331a329e52ab2ca40e7
+msgid "SGLang supports YaRN, which can be configured as"
+msgstr "SGLang 支持 YaRN，可以配置为"
+
+#: ../../Qwen/source/deployment/sglang.md:215 c3ba0a9b3502462795dfd887912e9357
+msgid "SGLang implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.** We advise adding the `rope_scaling` configuration only when processing long contexts is required.  It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0."
+msgstr "SGLang 实现了静态 YaRN，这意味着无论输入长度如何，缩放因子都保持不变，**这可能会对较短文本的性能产生影响。** 我们建议仅在需要处理长上下文时添加 `rope_scaling` 配置。还建议根据需要调整 `factor`。例如，如果您的应用程序的典型上下文长度为 65,536 个 token，则最好将 `factor` 设置为 2.0。"
+
+#: ../../Qwen/source/deployment/sglang.md:221 398c3e38c94e446aa9922dd04dce609c
+msgid "The default `max_position_embeddings` in `config.json` is set to 40,960, which is used by SGLang. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance."
+msgstr "`config.json` 中的默认 `max_position_embeddings` 被设置为 40,960，SGLang 将使用该值。此分配包括为输出保留 32,768 个 token，为典型提示保留 8,192 个 token，这足以应对大多数涉及短文本处理的场景，并为模型思考留出充足空间。如果平均上下文长度不超过 32,768 个 token，我们不建议在此场景中启用 YaRN，因为这可能会降低模型性能。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/skypilot.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/skypilot.po
+# Copyright (C) 2024, Qwen Team, Alibaba Group.
+# This file is distributed under the same license as the Qwen package.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/skypilot.rst:2 795ad4f30e27494d93675f71bb1a5cc4
+msgid "SkyPilot"
+msgstr ""
+
+#: ../../Qwen/source/deployment/skypilot.rst:5 aad807db94a24d868c9c1b364b47e152
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:8 d6bbf736584f4bbfa9c300d50a2ed669
+msgid "What is SkyPilot"
+msgstr "SkyPilot 是什么"
+
+#: ../../Qwen/source/deployment/skypilot.rst:10
+#: b66facae41bf493880e43044e2915a45
+msgid "SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, the highest GPU availability, and managed execution. Its features include:"
+msgstr "SkyPilot 是一个可以在任何云上运行 LLM 、 AI 应用以及批量任务的框架，旨在实现最大程度的成本节省、最高的 GPU 可用性以及受管理的执行过程。其特性包括："
+
+#: ../../Qwen/source/deployment/skypilot.rst:14
+#: 621f021163c549d0aadb1c911a3a3ef5
+msgid "Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds."
+msgstr "通过跨区域和跨云充分利用多个资源池，以获得最佳的 GPU 可用性。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:16
+#: ea1723c3b5be454cad3219836f4386d8
+msgid "Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups."
+msgstr "把费用降到最低—— SkyPilot 在各区域和云平台中为您挑选最便宜的资源。无需任何托管解决方案的额外加价。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:18
+#: e479693ecf08411ca35d8d0727c8f441
+msgid "Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint"
+msgstr "将服务扩展到多个副本上，所有副本通过单一 endpoint 对外提供服务"
+
+#: ../../Qwen/source/deployment/skypilot.rst:20
+#: 1f9cdd2ae2544d1faa8a4c463ee0e42c
+msgid "Everything stays in your cloud account (your VMs & buckets)"
+msgstr "所有内容均保存在您的云账户中（包括您的虚拟机和 bucket ）"
+
+#: ../../Qwen/source/deployment/skypilot.rst:21
+#: 5bb9b617764942d989e5093463a359f0
+msgid "Completely private - no one else sees your chat history"
+msgstr "完全私密 - 没有其他人能看到您的聊天记录"
+
+#: ../../Qwen/source/deployment/skypilot.rst:24
+#: cf0c456ac72f40ac98790c11dc243317
+msgid "Install SkyPilot"
+msgstr "安装 SkyPilot"
+
+#: ../../Qwen/source/deployment/skypilot.rst:26
+#: 78d86c1fa8104b138b01aed640b262fc
+msgid "We advise you to follow the `instruction <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ to install SkyPilot. Here we provide a simple example of using ``pip`` for the installation as shown below."
+msgstr "我们建议您按照 `指示 <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ 安装 SkyPilot 。以下为您提供了一个使用 ``pip`` 进行安装的简单示例："
+
+#: ../../Qwen/source/deployment/skypilot.rst:38
+#: a7c88265bf404f55b85388c81a240199
+msgid "After that, you need to verify cloud access with a command like:"
+msgstr "随后，您需要用如下命令确认是否能使用云："
+
+#: ../../Qwen/source/deployment/skypilot.rst:44
+#: 72025dfba0144f63a720f6da0dd39bfa
+msgid "For more information, check the `official document <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ and see if you have set up your cloud accounts correctly."
+msgstr "若需更多信息，请查阅官方文档，确认您的云账户设置是否正确无误。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:47
+#: 61be006061554e5ea40d55497e11e192
+msgid "Alternatively, you can also use the official docker image with SkyPilot master branch automatically cloned by running:"
+msgstr "或者，您也可以使用官方提供的 docker 镜像，可以自动克隆 SkyPilot 的主分支："
+
+#: ../../Qwen/source/deployment/skypilot.rst:63
+#: 4ae89fb44c6643a3a82fca5cee622af4
+msgid "Running Qwen2.5-72B-Instruct with SkyPilot"
+msgstr "使用 SkyPilot 运行 Qwen2.5-72B-Instruct "
+
+#: ../../Qwen/source/deployment/skypilot.rst:65
+#: 1bc4973c2eb745689ded0af54ba33e0e
+msgid "Start serving Qwen2.5-72B-Instruct on a single instance with any available GPU in the list specified in `serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__ with a vLLM-powered OpenAI-compatible endpoint:"
+msgstr "`serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__ 中列出了支持的 GPU 。您可使用配备这类 GPU 的单个运算实例来部署 Qwen2.5-72B-Instruct 服务。该服务由 vLLM 搭建，并与 OpenAI API 兼容。以下为部署方法："
+
+#: ../../Qwen/source/deployment/skypilot.rst:74
+#: ../../Qwen/source/deployment/skypilot.rst:123
+#: ac3692ed16974facbd58b6886cd111af b325de015e7b4bb0a91491d3f7418792
+msgid "**Before launching, make sure you have changed Qwen/Qwen2-72B-Instruct to Qwen/Qwen2.5-72B-Instruct in the YAML file.**"
+msgstr "**在启动之前，请先将 YAML 文件中的 Qwen/Qwen2-72B-Instruct 修改为 Qwen/Qwen2.5-72B-Instruct。**"
+
+#: ../../Qwen/source/deployment/skypilot.rst:76
+#: 6046b3c86fae4a43878fbadbeb33fbd8
+msgid "Send a request to the endpoint for completion:"
+msgstr "向该 endpoint 发送续写请求："
+
+#: ../../Qwen/source/deployment/skypilot.rst:90
+#: 2ec56c2028a94f568fd2c1a65063d25a
+msgid "Send a request for chat completion:"
+msgstr "向该 endpoint 发送对话续写请求"
+
+#: ../../Qwen/source/deployment/skypilot.rst:112
+#: c8e140ddfd914ff5a460621a7ca1891e
+msgid "Scale up the service with SkyPilot Serve"
+msgstr "使用 SkyPilot Serve 扩展服务规模"
+
+#: ../../Qwen/source/deployment/skypilot.rst:114
+#: 0db304ab396d45adb6017d78cd1ee4a2
+msgid "With `SkyPilot Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__, a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running:"
+msgstr "使用 `SkyPilot Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__ 扩展 Qwen 的服务规模非常容易，只需运行："
+
+#: ../../Qwen/source/deployment/skypilot.rst:125
+#: 25bbbf9e49be44d3899074ff97202d71
+msgid "This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed."
+msgstr "这将启动服务，使用多个副本部署在最经济的可用位置和加速器上。 SkyServe 将自动管理这些副本，监控其健康状况，根据负载进行自动伸缩，并在必要时重启它们。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:130
+#: bda628bab7ef41a0918dc4b80a9b3cfe
+msgid "A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas."
+msgstr "将返回一个 endpoint ，所有发送至该endpoint的请求都将被路由至就绪状态的副本。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:133
+#: b232dbbdcf674d56bcf9c0331c020864
+msgid "To check the status of the service, run:"
+msgstr "运行如下命令检查服务的状态："
+
+#: ../../Qwen/source/deployment/skypilot.rst:139
+#: 556b854caf7243fb93f253ebe2dc9033
+msgid "After a while, you will see the following output:"
+msgstr "很快，您将看到如下输出："
+
+#: ../../Qwen/source/deployment/skypilot.rst:152
+#: 5a6055c5a42c4b2db6693c1095688de8
+msgid "As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the availability of the service while minimizing the cost."
+msgstr "如下所示：该服务现由两个副本提供支持，一个位于 Azure 平台，另一个位于 GCP 平台。同时，已为服务选择云服务商提供的 **最经济实惠** 的加速器类型。这样既最大限度地提升了服务的可用性，又尽可能降低了成本。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:157
+#: a18533d33dc54a1091ded0b4bba0a1eb
+msgid "To access the model, we use a ``curl -L`` command (``-L`` to follow redirect) to send the request to the endpoint:"
+msgstr "要访问模型，我们使用带有 ``curl -L`` （用于跟随重定向），将请求发送到 endpoint ："
+
+#: ../../Qwen/source/deployment/skypilot.rst:182
+#: 34cd50fd79e24d8895075f7841b025e4
+msgid "Accessing Qwen2.5 with Chat GUI"
+msgstr "使用 Chat GUI 调用 Qwen2.5"
+
+#: ../../Qwen/source/deployment/skypilot.rst:184
+#: ca6994cda1cb469e83ce8c026bb67e42
+msgid "It is also possible to access the Qwen2.5 service with GUI by connecting a `FastChat GUI server <https://github.com/lm-sys/FastChat>`__ to the endpoint launched above (see `gui.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/gui.yaml>`__)."
+msgstr "可以通过 `FastChat <https://github.com/lm-sys/FastChat>`__ 来使用 GUI 调用 Qwen2.5 的服务："
+
+#: ../../Qwen/source/deployment/skypilot.rst:188
+#: 99a63e55ab5c46258c20ab89cdfa39dc
+msgid "Start the Chat Web UI:"
+msgstr "开启一个 Chat Web UI"
+
+#: ../../Qwen/source/deployment/skypilot.rst:194
+#: e61593a092c146f8a06af896d6af17f2
+msgid "**Before launching, make sure you have changed Qwen/Qwen1.5-72B-Chat to Qwen/Qwen2.5-72B-Instruct in the YAML file.**"
+msgstr "**在启动之前，请先将 YAML 文件中的 Qwen/Qwen1.5-72B-Chat 修改为 Qwen/Qwen2.5-72B-Instruct。**"
+
+#: ../../Qwen/source/deployment/skypilot.rst:196
+#: 9631068a8b424aa8af6dc6911daac7a9
+msgid "Then, we can access the GUI at the returned gradio link:"
+msgstr "随后，我们可以通过返回的 gradio 链接来访问 GUI ："
+
+#: ../../Qwen/source/deployment/skypilot.rst:202
+#: 1464a56dcd06404aafbe6d7d2c72212b
+msgid "Note that you may get better results by using a different temperature and top_p value."
+msgstr "你可以通过使用不同的温度和 top_p 值来尝试取得更好的结果。"
+
+#: ../../Qwen/source/deployment/skypilot.rst:205
+#: d257f49d835e4c12b28bc680bb78a9cb
+msgid "Summary"
+msgstr "总结"
+
+#: ../../Qwen/source/deployment/skypilot.rst:207
+#: 06b9684a19774eaba4f69862332c5166
+msgid "With SkyPilot, it is easy for you to deploy Qwen2.5 on any cloud. We advise you to read the official doc for more usages and updates. Check `this <https://skypilot.readthedocs.io/>`__ out!"
+msgstr "通过 SkyPilot ，你可以轻松地在任何云上部署 Qwen2.5 。我们建议您阅读 `官方文档 <https://skypilot.readthedocs.io/>`__ 了解更多用法和最新进展。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/tgi.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/tgi.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/deployment/tgi.rst:2 2abcc96f9deb4b9187ac9d88fc69e929
+msgid "TGI"
+msgstr ""
+
+#: ../../Qwen/source/deployment/tgi.rst:5 2d124d7cb95f47388aa48c662932ef9b
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/deployment/tgi.rst:7 4e5d299c4fdd46d5aba38c9af5765792
+msgid "Hugging Face's Text Generation Inference (TGI) is a production-ready framework specifically designed for deploying and serving large language models (LLMs) for text generation tasks. It offers a seamless deployment experience, powered by a robust set of features:"
+msgstr "Hugging Face 的 Text Generation Inference (TGI) 是一个专为部署大规模语言模型 (Large Language Models, LLMs) 而设计的生产级框架。TGI提供了流畅的部署体验，并稳定支持如下特性："
+
+#: ../../Qwen/source/deployment/tgi.rst:9 ecd4fc11a95140959915d062791ceba1
+msgid "`Speculative Decoding <Speculative Decoding_>`_: Accelerates generation speeds."
+msgstr "`推测解码 (Speculative Decoding) <Speculative Decoding_>`_ ：提升生成速度。"
+
+#: ../../Qwen/source/deployment/tgi.rst:10 84590a56416348bf85b3f296cf57e257
+msgid "`Tensor Parallelism`_: Enables efficient deployment across multiple GPUs."
+msgstr "张量并行 (`Tensor Parallelism`_) ：高效多卡部署。"
+
+#: ../../Qwen/source/deployment/tgi.rst:11 a996d6ecd7b94c5cb9752d370f29a9b1
+msgid "`Token Streaming`_: Allows for the continuous generation of text."
+msgstr "流式生成 (`Token Streaming`_) ：支持持续性生成文本。"
+
+#: ../../Qwen/source/deployment/tgi.rst:12 8f591c045ba34f4581bb19652db9f9b3
+msgid "Versatile Device Support: Works seamlessly with `AMD`_, `Gaudi`_ and `AWS Inferentia`_."
+msgstr "灵活的硬件支持：与 `AMD`_ ， `Gaudi`_ 和 `AWS Inferentia`_ 无缝衔接。"
+
+#: ../../Qwen/source/deployment/tgi.rst:21 5e8a98b91fc146e0b581422faa683a18
+msgid "Installation"
+msgstr "安装"
+
+#: ../../Qwen/source/deployment/tgi.rst:23 684ef25bfb0e460999d6dcccce41b85f
+msgid "The easiest way to use TGI is via the TGI docker image. In this guide, we show how to use TGI with docker."
+msgstr "通过 TGI docker 镜像使用 TGI 轻而易举。本文将主要介绍 TGI 的 docker 用法。"
+
+#: ../../Qwen/source/deployment/tgi.rst:25 c563fa3eccb04d00a477c1d2e8b15c38
+msgid "It's possible to run it locally via Conda or build locally. Please refer to `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_  and `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ for detailed instructions."
+msgstr "也可通过 Conda 实机安装或搭建服务。请参考 `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_ 与 `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ 以了解详细说明。"
+
+#: ../../Qwen/source/deployment/tgi.rst:28 b55fc58ff4cb472abca08296409c7837
+msgid "Deploy Qwen2.5 with TGI"
+msgstr "通过 TGI 部署 Qwen2.5"
+
+#: ../../Qwen/source/deployment/tgi.rst:30 586a8425ec5d413592fd7daf579c7e87
+msgid "**Find a Qwen2.5 Model:** Choose a model from `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_."
+msgstr "**选定 Qwen2.5 模型：** 从 `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_ 中挑选模型。"
+
+#: ../../Qwen/source/deployment/tgi.rst:31 50fcab8da35941eca308786979dbaf38
+msgid "**Deployment Command:** Run the following command in your terminal, replacing ``model`` with your chosen Qwen2.5 model ID and ``volume`` with the path to your local data directory:"
+msgstr "**部署TGI服务：** 在终端中运行以下命令，注意替换 ``model`` 为选定的 Qwen2.5 模型 ID 、 ``volume`` 为本地的数据路径： "
+
+#: ../../Qwen/source/deployment/tgi.rst:42 2a800533a7d84bdeab1da0976b0cab53
+msgid "Using TGI API"
+msgstr "使用 TGI API"
+
+#: ../../Qwen/source/deployment/tgi.rst:44 f05d1ec08140452782d0659543fad7d1
+msgid "Once deployed, the model will be available on the mapped port (8080)."
+msgstr "一旦成功部署，API 将于选定的映射端口 (8080) 提供服务。"
+
+#: ../../Qwen/source/deployment/tgi.rst:46 f265dc1522b049c98ba31fd5d255c50f
+msgid "TGI comes with a handy API for streaming response:"
+msgstr "TGI 提供了简单直接的 API 支持流式生成："
+
+#: ../../Qwen/source/deployment/tgi.rst:54 e9cc4c0571b74bd08b2a59347503e653
+msgid "It's also available on OpenAI style API:"
+msgstr "也可使用 OpenAI 风格的 API 使用 TGI ："
+
+#: ../../Qwen/source/deployment/tgi.rst:73 5dc7e9c74fc04483ba8e5dcdd7052020
+msgid "The model field in the JSON is not used by TGI, you can put anything."
+msgstr "JSON 中的 model 字段不会被 TGI 识别，您可传入任意值。"
+
+#: ../../Qwen/source/deployment/tgi.rst:75 d60f837152014cda8baebc90d65d1cc0
+#, python-format
+msgid "Refer to the `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ for a complete API reference."
+msgstr "完整 API 文档，请查阅 `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ 。"
+
+#: ../../Qwen/source/deployment/tgi.rst:77 b59564031e5548088aef828f9753e337
+msgid "You can also use Python API:"
+msgstr "你也可以使用 Python 访问 API ："
+
+#: ../../Qwen/source/deployment/tgi.rst:106 62646cecb024479ebfeca5f3063e7322
+msgid "Quantization for Performance"
+msgstr "量化"
+
+#: ../../Qwen/source/deployment/tgi.rst:108 4a8d39bf37be4820afb230f9a977b431
+msgid "Data-dependent quantization (GPTQ and AWQ)"
+msgstr "依赖数据的量化方案（ GPTQ 与 AWQ ）"
+
+#: ../../Qwen/source/deployment/tgi.rst:110 ef2b18f47e4f4f7ebb017be628cb0be9
+msgid "Both GPTQ and AWQ models are data-dependent. The official quantized models can be found from `the Qwen2.5 collection`_ and you can also quantize models with your own dataset to make it perform better on your use case."
+msgstr "GPTQ 与 AWQ 均依赖数据进行量化。我们提供了预先量化好的模型，请于 `the Qwen2.5 collection`_ 查找。你也可以使用自己的数据集自行量化，以在你的场景中取得更好效果。"
+
+#: ../../Qwen/source/deployment/tgi.rst:112 53d94278a2e3409abb9980ebc7c96c24
+msgid "The following shows the command to start TGI with Qwen2.5-7B-Instruct-GPTQ-Int4:"
+msgstr "以下是通过 TGI 部署 Qwen2.5-7B-Instruct-GPTQ-Int4 的指令："
+
+#: ../../Qwen/source/deployment/tgi.rst:122 68ff8a07d0eb40cfa67d79e01adea070
+msgid "If the model is quantized with AWQ, e.g. Qwen/Qwen2.5-7B-Instruct-AWQ, please use ``--quantize awq``."
+msgstr "如果模型是 AWQ 量化的，如 Qwen/Qwen2.5-7B-Instruct-AWQ ，请使用 ``--quantize awq`` 。"
+
+#: ../../Qwen/source/deployment/tgi.rst:124 b4c3b82b1f2a43a8a02383fd0afbda5f
+msgid "Data-agnostic quantization"
+msgstr "不依赖数据的量化方案"
+
+#: ../../Qwen/source/deployment/tgi.rst:126 7a6b89c94b72407482b96790f5bbd272
+msgid "EETQ on the other side is not data dependent and can be used with any model. Note that we're passing in the original model (instead of a quantized model) with the ``--quantize eetq`` flag."
+msgstr "EETQ 是一种不依赖数据的量化方案，可直接用于任意模型。请注意，我们需要传入原始模型，并使用 ``--quantize eetq`` 标志。"
+
+#: ../../Qwen/source/deployment/tgi.rst:138 763166da65924887b3bba99ea4d2baab
+msgid "Multi-Accelerators Deployment"
+msgstr "多卡部署"
+
+#: ../../Qwen/source/deployment/tgi.rst:140 ddcfcff947894f168c7945ae9c42a579
+msgid "Use the ``--num-shard`` flag to specify the number of accelerators. Please also use ``--shm-size 1g`` to enable shared memory for optimal NCCL performance (`reference <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__):"
+msgstr "使用 ``--num-shard`` 指定卡书数量。 请务必传入 ``--shm-size 1g`` 让 NCCL 发挥最好性能 (`说明 <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__) ："
+
+#: ../../Qwen/source/deployment/tgi.rst:151 520c46fb404c4ec9bf89280e4a71f1e8
+msgid "Speculative Decoding"
+msgstr "推测性解码 (Speculative Decoding)"
+
+#: ../../Qwen/source/deployment/tgi.rst:153 74c6b65f76b74d56ad109af9da11f66e
+msgid "Speculative decoding can reduce the time per token by speculating on the next token. Use the ``--speculative-decoding`` flag, setting the value to the number of tokens to speculate on (default: 0 for no speculation):"
+msgstr "推测性解码 (Speculative Decoding) 通过预先推测下一 token 来节约每 token 需要的时间。使用 ``--speculative-decoding`` 设定预先推测 token 的数量 （默认为0，表示不预先推测）："
+
+#: ../../Qwen/source/deployment/tgi.rst:164 dee05ee0fb1a4f2da42b250192d943f5
+msgid "The overall performance of speculative decoding highly depends on the type of task. It works best for code or highly repetitive text."
+msgstr "推测性解码的加速效果依赖于任务类型，对于代码或重复性较高的文本生成任务，提速更明显。"
+
+#: ../../Qwen/source/deployment/tgi.rst:166 731f300bc1174589901dd5feb26e8b2f
+msgid "More context on speculative decoding can be found `here <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__."
+msgstr "更多说明可查阅 `此文档 <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__ 。"
+
+#: ../../Qwen/source/deployment/tgi.rst:170 65a7d5553dd145398f9705c1ee6c28f0
+msgid "Zero-Code Deployment with HF Inference Endpoints"
+msgstr "使用 HF Inference Endpoints 零代码部署"
+
+#: ../../Qwen/source/deployment/tgi.rst:172 721c3a7578f846ae8e21e595923e17e7
+msgid "For effortless deployment, leverage Hugging Face Inference Endpoints:"
+msgstr "使用 Hugging Face Inference Endpoints 不费吹灰之力："
+
+#: ../../Qwen/source/deployment/tgi.rst:174 7741607488d94a9f8be2ffcb6a5322fb
+msgid "**GUI interface:** `<https://huggingface.co/inference-endpoints/dedicated>`__"
+msgstr ""
+
+#: ../../Qwen/source/deployment/tgi.rst:175 02ff4520e66f4a42828483da7d25445f
+msgid "**Coding interface:** `<https://huggingface.co/blog/tgi-messages-api>`__"
+msgstr ""
+
+#: ../../Qwen/source/deployment/tgi.rst:177 d35f9dd4bc96400cb6c7584012d2df49
+msgid "Once deployed, the endpoint can be used as usual."
+msgstr "一旦部署成功，服务使用与本地无异。"
+
+#: ../../Qwen/source/deployment/tgi.rst:181 61c1b825bbf24be2aaaeb99de3f0660e
+msgid "Common Issues"
+msgstr "常见问题"
+
+#: ../../Qwen/source/deployment/tgi.rst:183 b55a2d286fc24dbe92b79ab5c010c7af
+msgid "Qwen2.5 supports long context lengths, so carefully choose the values for ``--max-batch-prefill-tokens``, ``--max-total-tokens``, and ``--max-input-tokens`` to avoid potential out-of-memory (OOM) issues. If an OOM occurs, you'll receive an error message upon startup. The following shows an example to modify those parameters:"
+msgstr "Qwen2.5 支持长上下文，谨慎设定 ``--max-batch-prefill-tokens`` ， ``--max-total-tokens`` 和 ``--max-input-tokens`` 以避免 out-of-memory (OOM) 。如 OOM ，你将在启动 TGI 时收到错误提示。以下为修改这些参数的示例："
+
--- a/docs/locales/zh_CN/LC_MESSAGES/deployment/vllm.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/deployment/vllm.po
--- a/docs/locales/zh_CN/LC_MESSAGES/framework/Langchain.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/framework/Langchain.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/framework/Langchain.rst:2 6f9b66430d9c495592b1e275fdfd7c9e
+msgid "Langchain"
+msgstr ""
+
+#: ../../Qwen/source/framework/Langchain.rst:5 1205af46f88e4d6681003403109385c3
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/framework/Langchain.rst:7 115ee7b1c8404629a8f98175264cc114
+msgid "This guide helps you build a question-answering application based on a local knowledge base using ``Qwen2.5-7B-Instruct`` with ``langchain``. The goal is to establish a knowledge base Q&A solution."
+msgstr "本教程旨在帮助您利用 ``Qwen2.5-7B-Instruct`` 与 ``langchain`` ，基于本地知识库构建问答应用。目标是建立一个知识库问答解决方案。"
+
+#: ../../Qwen/source/framework/Langchain.rst:12
+#: 7257b95612fb423bb9ca73212fd12a02
+msgid "Basic Usage"
+msgstr "基础用法"
+
+#: ../../Qwen/source/framework/Langchain.rst:14
+#: fecf7a682dcc4c15a53da1f7cdf145e5
+msgid "The implementation process of this project includes loading files -> reading text -> segmenting text -> vectorizing text -> vectorizing questions -> matching the top k most similar text vectors with the question vectors -> incorporating the matched text as context along with the question into the prompt -> submitting to the Qwen2.5-7B-Instruct to generate an answer. Below is an example:"
+msgstr "您可以仅使用您的文档配合 ``langchain`` 来构建一个问答应用。该项目的实现流程包括加载文件 -> 阅读文本 -> 文本分段 -> 文本向量化 -> 问题向量化 -> 将最相似的前k个文本向量与问题向量匹配 -> 将匹配的文本作为上下文连同问题一起纳入提示 -> 提交给Qwen2.5-7B-Instruct生成答案。以下是一个示例："
+
+#: ../../Qwen/source/framework/Langchain.rst:98
+#: 6ad1ebd2ef4a49f9aa66cfdf777e1290
+msgid "After loading the Qwen2.5-7B-Instruct model, you should specify the txt file for retrieval."
+msgstr "加载Qwen2.5-7B-Instruct模型后，您可以指定需要用于知识库问答的txt文件。"
+
+#: ../../Qwen/source/framework/Langchain.rst:274
+#: 00467b1e4e294a26b9f49886633331e0
+msgid "Next Step"
+msgstr "下一步"
+
+#: ../../Qwen/source/framework/Langchain.rst:276
+#: 15ed906687054af78545290ba0746380
+msgid "Now you can chat with Qwen2.5 use your own document. Continue to read the documentation and try to figure out more advanced usages of model retrieval!"
+msgstr "现在，您可以在您自己的文档上与Qwen2.5进行交流。继续阅读文档，尝试探索模型检索的更多高级用法！"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/framework/LlamaIndex.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/framework/LlamaIndex.po
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2024, Qwen Team
+# This file is distributed under the same license as the Qwen package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: Qwen \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-04-28 19:42+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:2
+#: 2e41f8696c20488d8593b670c6361edf
+msgid "LlamaIndex"
+msgstr "LlamaIndex"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:5
+#: 20b3836fd391457bb00bf75b61e23e0d
+msgid "To be updated for Qwen3."
+msgstr "仍需为Qwen3更新。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:7
+#: 86d9e6f0684749aab40a9824cd026fa3
+msgid "To connect Qwen2.5 with external data, such as documents, web pages, etc., we offer a tutorial on `LlamaIndex <https://www.llamaindex.ai/>`__. This guide helps you quickly implement retrieval-augmented generation (RAG) using LlamaIndex with Qwen2.5."
+msgstr "为了实现 Qwen2.5 与外部数据（例如文档、网页等）的连接，我们提供了 `LlamaIndex <https://www.llamaindex.ai/>`__ 的详细教程。本指南旨在帮助用户利用 LlamaIndex 与 Qwen2.5 快速部署检索增强生成（RAG）技术。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:11
+#: 71ed222858054687a5b33222bb6ac086
+msgid "Preparation"
+msgstr "环境准备"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:13
+#: 161d9153d6484dd5a1f1bdb340847814
+msgid "To implement RAG, we advise you to install the LlamaIndex-related packages first."
+msgstr "为实现检索增强生成（RAG），我们建议您首先安装与 LlamaIndex 相关的软件包。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:16
+#: a8d6acb1001a42c88185b971ae2de3bf
+msgid "The following is a simple code snippet showing how to do this:"
+msgstr "以下是一个简单的代码示例："
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:25
+#: e441d3b8fb6d4a13b52e1560ef250b16
+msgid "Set Parameters"
+msgstr "设置参数"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:27
+#: c2481804c3f34c7f883eed92ffa3111e
+msgid "Now we can set up LLM, embedding model, and the related configurations. Qwen2.5-Instruct supports conversations in multiple languages, including English and Chinese. You can use the ``bge-base-en-v1.5`` model to retrieve from English documents, and you can download the ``bge-base-zh-v1.5`` model to retrieve from Chinese documents. You can also choose ``bge-large`` or ``bge-small`` as the embedding model or modify the context window size or text chunk size depending on your computing resources. Qwen2.5 model families support a maximum of 32K context window size (up to 128K for 7B, 14B, 32B, and 72B, requiring extra configuration)"
+msgstr "现在，我们可以设置语言模型和向量模型。Qwen2.5-Instruct支持包括英语和中文在内的多种语言对话。您可以使用 ``bge-base-en-v1.5`` 模型来检索英文文档，下载 ``bge-base-zh-v1.5`` 模型以检索中文文档。根据您的计算资源，您还可以选择 ``bge-large`` 或 ``bge-small`` 作为向量模型，或调整上下文窗口大小或文本块大小。Qwen2.5模型系列支持最大32K上下文窗口大小（7B 、14B 、32B 及 72B可扩展支持 128K 上下文，但需要额外配置）"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:85
+#: 74c35d5a03734c289d162dfa3813ada6
+msgid "Build Index"
+msgstr "构建索引"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:87
+#: c49859d4ea5f49dba1fa2263f3ae284d
+msgid "Now we can build index from documents or websites."
+msgstr "现在我们可以从文档或网站构建索引。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:89
+#: b460d000037e4266a4d9f43d38f1f9b0
+msgid "The following code snippet demonstrates how to build an index for files (regardless of whether they are in PDF or TXT format) in a local folder named 'document'."
+msgstr "以下代码片段展示了如何为本地名为'document'的文件夹中的文件（无论是PDF格式还是TXT格式）构建索引。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:102
+#: a416d18b227940e29fac1f59851ff8c4
+msgid "The following code snippet demonstrates how to build an index for the content in a list of websites."
+msgstr "以下代码片段展示了如何为一系列网站的内容构建索引。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:118
+#: 487cf928d048424fa1b50438f701137c
+msgid "To save and load the index, you can use the following code snippet."
+msgstr "要保存和加载已构建的索引，您可以使用以下代码示例。"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:132
+#: c68419c4318d46e891f5df9191be6d2d
+msgid "RAG"
+msgstr "检索增强（RAG）"
+
+#: ../../Qwen/source/framework/LlamaIndex.rst:134
+#: 8ad20a8f43fe496084a40f963ba97440
+msgid "Now you can perform queries, and Qwen2.5 will answer based on the content of the indexed documents."
+msgstr "现在您可以输入查询，Qwen2.5 将基于索引文档的内容提供答案。"
+
--- a/docs/locales/zh_CN/LC_MESSAGES/framework/function_call.po
+++ b/docs/locales/zh_CN/LC_MESSAGES/framework/function_call.po