Commit a52e53db authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #2680 canceled with stages
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
version: 2
build:
os: ubuntu-22.04
tools:
python: "3"
sphinx:
configuration: docs/source/conf.py
# If using Sphinx, optionally build your docs in additional formats such as PDF
# formats:
# - pdf
# Optionally declare the Python requirements required to build your docs
python:
install:
- requirements: docs/requirements-docs.txt
---
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-8B/blob/main/LICENSE
pipeline_tag: text-generation
base_model:
- Qwen/Qwen3-8B-Base
---
# Qwen3-8B
<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
</a>
## Qwen3 Highlights
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:
- **Uniquely support of seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose dialogue) **within single model**, ensuring optimal performance across various scenarios.
- **Significantly enhancement in its reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
## Model Overview
**Qwen3-8B** has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Number of Parameters: 8.2B
- Number of Paramaters (Non-Embedding): 6.95B
- Number of Layers: 36
- Number of Attention Heads (GQA): 32 for Q and 8 for KV
- Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts).
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
## Quickstart
The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`.
With `transformers<4.51.0`, you will encounter the following error:
```
KeyError: 'qwen3'
```
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
```
For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create an OpenAI-compatible API endpoint:
- SGLang:
```shell
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3
```
- vLLM:
```shell
vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1
```
For local use, applications such as llama.cpp, Ollama, LMStudio, and MLX-LM have also supported Qwen3.
## Switching Between Thinking and Non-Thinking Mode
> [!TIP]
> The `enable_thinking` switch is also available in APIs created by SGLang and vLLM.
> Please refer to our documentation for [SGLang](https://qwen.readthedocs.io/en/latest/deployment/sglang.html#thinking-non-thinking-modes) and [vLLM](https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes) users.
### `enable_thinking=True`
By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting `enable_thinking=True` or leaving it as the default value in `tokenizer.apply_chat_template`, the model will engage its thinking mode.
```python
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # True is the default value for enable_thinking
)
```
In this mode, the model will generate think content wrapped in a `<think>...</think>` block, followed by the final response.
> [!NOTE]
> For thinking mode, use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0` (the default setting in `generation_config.json`). **DO NOT use greedy decoding**, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
### `enable_thinking=False`
We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.
```python
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # Setting enable_thinking=False disables thinking mode
)
```
In this mode, the model will not generate any think content and will not include a `<think>...</think>` block.
> [!NOTE]
> For non-thinking mode, we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. For more detailed guidance, please refer to the [Best Practices](#best-practices) section.
### Advanced Usage: Switching Between Thinking and Non-Thinking Modes via User Input
We provide a soft switch mechanism that allows users to dynamically control the model's behavior when `enable_thinking=True`. Specifically, you can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
Here is an example of a multi-turn conversation:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
class QwenChatbot:
def __init__(self, model_name="Qwen/Qwen3-8B"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.history = []
def generate_response(self, user_input):
messages = self.history + [{"role": "user", "content": user_input}]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.tokenizer(text, return_tensors="pt")
response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
# Update history
self.history.append({"role": "user", "content": user_input})
self.history.append({"role": "assistant", "content": response})
return response
# Example Usage
if __name__ == "__main__":
chatbot = QwenChatbot()
# First input (without /think or /no_think tags, thinking mode is enabled by default)
user_input_1 = "How many r's in strawberries?"
print(f"User: {user_input_1}")
response_1 = chatbot.generate_response(user_input_1)
print(f"Bot: {response_1}")
print("----------------------")
# Second input with /no_think
user_input_2 = "Then, how many r's in blueberries? /no_think"
print(f"User: {user_input_2}")
response_2 = chatbot.generate_response(user_input_2)
print(f"Bot: {response_2}")
print("----------------------")
# Third input with /think
user_input_3 = "Really? /think"
print(f"User: {user_input_3}")
response_3 = chatbot.generate_response(user_input_3)
print(f"Bot: {response_3}")
```
> [!NOTE]
> For API compatibility, when `enable_thinking=True`, regardless of whether the user uses `/think` or `/no_think`, the model will always output a block wrapped in `<think>...</think>`. However, the content inside this block may be empty if thinking is disabled.
> When `enable_thinking=False`, the soft switches are not valid. Regardless of any `/think` or `/no_think` tags input by the user, the model will not generate think content and will not include a `<think>...</think>` block.
## Agentic Use
Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
```python
from qwen_agent.agents import Assistant
# Define LLM
llm_cfg = {
'model': 'Qwen3-8B',
# Use the endpoint provided by Alibaba Model Studio:
# 'model_type': 'qwen_dashscope',
# 'api_key': os.getenv('DASHSCOPE_API_KEY'),
# Use a custom endpoint compatible with OpenAI API:
'model_server': 'http://localhost:8000/v1', # api_base
'api_key': 'EMPTY',
# Other parameters:
# 'generate_cfg': {
# # Add: When the response content is `<think>this is the thought</think>this is the answer;
# # Do not add: When the response has been separated by reasoning_content and content.
# 'thought_in_content': True,
# },
}
# Define Tools
tools = [
{'mcpServers': { # You can specify the MCP configuration file
'time': {
'command': 'uvx',
'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
},
"fetch": {
"command": "uvx",
"args": ["mcp-server-fetch"]
}
}
},
'code_interpreter', # Built-in tools
]
# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)
# Streaming generation
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
pass
print(responses)
```
## Processing Long Texts
Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
YaRN is currently supported by several inference frameworks, e.g., `transformers` and `llama.cpp` for local use, `vllm` and `sglang` for deployment. In general, there are two approaches to enabling YaRN for supported frameworks:
- Modifying the model files:
In the `config.json` file, add the `rope_scaling` fields:
```json
{
...,
"rope_scaling": {
"type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
}
```
For `llama.cpp`, you need to regenerate the GGUF file after the modification.
- Passing command line arguments:
For `vllm`, you can use
```shell
vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
```
For `sglang`, you can use
```shell
python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
```
For `llama-server` from `llama.cpp`, you can use
```shell
llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
```
> [!IMPORTANT]
> If you encounter the following warning
> ```
> Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
> ```
> please upgrade `transformers>=4.51.0`.
> [!NOTE]
> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
> We advise adding the `rope_scaling` configuration only when processing long contexts is required.
> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
> [!NOTE]
> The default `max_position_embeddings` in `config.json` is set to 40,960. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
> [!TIP]
> The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
## Best Practices
To achieve optimal performance, we recommend the following settings:
1. **Sampling Parameters**:
- For thinking mode (`enable_thinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0`. **DO NOT use greedy decoding**, as it can lead to performance degradation and endless repetitions.
- For non-thinking mode (`enable_thinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`.
- For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
2. **Adequate Output Length**: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
3. **Standardize Output Format**: We recommend using prompts to standardize model outputs when benchmarking.
- **Math Problems**: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
- **Multiple-Choice Questions**: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`."
4. **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.
### Citation
If you find our work helpful, feel free to give us a cite.
```
@misc{qwen3,
title = {Qwen3},
url = {https://qwenlm.github.io/blog/qwen3/},
author = {Qwen Team},
month = {April},
year = {2025}
}
```
\ No newline at end of file
# Qwen3
骨干网络仅含0.45B参数,支持口音强度控制,适于实时语音交互,能满足不同场景下对语音口音克隆的多样化需求。
## 论文
`无`
## 模型结构
Qwen3采用通用的Decoder-Only结构,引入了MoE提升性能,首个「混合推理模型」,将「快思考」与「慢思考」集成进同一个模型。
<div align=center>
<img src="./doc/qwen.png"/>
</div>
## 算法原理
将输入embedding后放入attention、ffn等提取特征,最后利用Softmax将解码器最后一层产生的未经归一化的分数向量(logits)转换为概率分布,其中每个元素表示生成对应词汇的概率,这使得模型可以生成一个分布,并从中选择最可能的词作为预测结果。
## 环境配置
```
mv Qwen3_pytorch Qwen3 # 去框架名后缀
```
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
# <your IMAGE ID>为以上拉取的docker的镜像ID替换,本镜像为:e77c15729879
docker run -it --shm-size=64G -v $PWD/Qwen3:/home/Qwen3 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name qwen3 <your IMAGE ID> bash
cd /home/Qwen3
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
```
### Dockerfile(方法二)
```
cd /home/Qwen3/docker
docker build --no-cache -t qwen3:latest .
docker run --shm-size=64G --name qwen3 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Qwen3:/home/Qwen3 -it qwen3 bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt。
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.hpccube.com/tool/
```
DTK驱动:dtk2504
python:python3.10
torch:2.4.1
torchvision:0.19.1
triton:3.0.0
vllm:0.6.2
flash-attn:2.6.1
deepspeed:0.14.2
apex:1.4.0
transformers:4.51.0
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
cd /home/Qwen3
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
```
## 数据集
`无`
## 训练
## 推理
预训练权重目录结构:
```
/home/Qwen3/
└── Qwen/Qwen3-8B
```
### 单机多卡
```
# 本项目以Qwen3-8B示例,其它Qwen3模型以此类推。
cd /home/Qwen3
python infer_transformers.py
# vllm>=0.8.4正在适配中,后期将陆续开放vllm版推理。
```
更多资料可参考源项目中的[`README_orgin`](./README_orgin.md)
## result
`输入: `
```
prompt: "Give me a short introduction to large language models."
```
`输出:`
```
<think>
Okay, the user wants a short introduction to large language models. Let me start by defining what they are. I should mention they're AI systems trained on massive text data. Maybe include how they process and generate human-like text. Also, touch on their applications like answering questions, creating content, coding. Need to keep it concise but cover the key points. Oh, and maybe mention their size, like parameters, but not too technical. Avoid jargon. Make sure it's easy to understand. Let me check if I'm missing anything important. Oh, maybe a sentence about their training process? Or just stick to the basics. Alright, structure: definition, training data, capabilities, applications. Keep each part brief. That should work.
</think>
Large language models (LLMs) are advanced artificial intelligence systems trained on vast amounts of text data to understand and generate human-like language. They can process and respond to complex queries, create written content, code, and even engage in conversations. These models, often with billions of parameters, excel at tasks like answering questions, summarizing information, and translating languages, making them versatile tools for various applications, from customer service to research and creative writing.
```
### 精度
DCU与GPU精度一致,推理框架:pytorch。
## 应用场景
### 算法类别
`对话问答`
### 热点应用行业
`制造,广媒,金融,能源,医疗,家居,教育`
## 预训练权重
魔搭社区下载地址为:[Qwen/Qwen3-8B](https://www.modelscope.cn/Qwen/Qwen3-8B.git)
## 源码仓库及问题反馈
- http://developer.sourcefind.cn/codes/modelzoo/Qwen3_pytorch.git
## 参考资料
- https://github.com/QwenLM/Qwen3.git
# Qwen3
<p align="center">
<img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/logo_qwen3.png" width="400"/>
<p>
<p align="center">
💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 Paper &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/blog/qwen3/">Blog</a> &nbsp&nbsp | &nbsp&nbsp📖 <a href="https://qwen.readthedocs.io/">Documentation</a>
<br>
🖥️ <a href="https://huggingface.co/spaces/Qwen/Qwen3-Demo">Demo</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp
</p>
Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with `Qwen3-` or visit the [Qwen3 collection](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f), and you will find all you need! Enjoy!
To learn more about Qwen3, feel free to read our documentation \[[EN](https://qwen.readthedocs.io/en/latest/)|[ZH](https://qwen.readthedocs.io/zh-cn/latest/)\]. Our documentation consists of the following sections:
- Quickstart: the basic usages and demonstrations;
- Inference: the guidance for the inference with Transformers, including batch inference, streaming, etc.;
- Run Locally: the instructions for running LLM locally on CPU and GPU, with frameworks like llama.cpp and Ollama;
- Deployment: the demonstration of how to deploy Qwen for large-scale inference with frameworks like SGLang, vLLM, TGI, etc.;
- Quantization: the practice of quantizing LLMs with GPTQ, AWQ, as well as the guidance for how to make high-quality quantized GGUF files;
- Training: the instructions for post-training, including SFT and RLHF (TODO) with frameworks like Axolotl, LLaMA-Factory, etc.
- Framework: the usage of Qwen with frameworks for application, e.g., RAG, Agent, etc.
## Introduction
We are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models.
These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5.
We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models.
The highlights from Qwen3 include:
- **Dense and Mixture-of-Experts (MoE) models of various sizes**, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.
- **Seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose chat), ensuring optimal performance across various scenarios.
- **Significantly enhancement in reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
> ![IMPORTANT]
> Qwen3 models adopt a different naming scheme.
>
> The post-trained models do not use the "-Instruct" suffix any more. For example, Qwen3-32B is the newer version of Qwen2.5-32B-Instruct.
>
> The base models now have names ending with "-Base".
## News
- 2025.04.29: We released the Qwen3 series. Check our [blog](https://qwenlm.github.io/blog/qwen3) for more details!
- 2024.09.19: We released the Qwen2.5 series. This time there are 3 extra model sizes: 3B, 14B, and 32B for more possibilities. Check our [blog](https://qwenlm.github.io/blog/qwen2.5) for more!
- 2024.06.06: We released the Qwen2 series. Check our [blog](https://qwenlm.github.io/blog/qwen2/)!
- 2024.03.28: We released the first MoE model of Qwen: Qwen1.5-MoE-A2.7B! Temporarily, only HF transformers and vLLM support the model. We will soon add the support of llama.cpp, mlx-lm, etc. Check our [blog](https://qwenlm.github.io/blog/qwen-moe/) for more information!
- 2024.02.05: We released the Qwen1.5 series.
## Performance
Detailed evaluation results are reported in this <a href="https://qwenlm.github.io/blog/qwen3/"> 📑 blog</a>.
For requirements on GPU memory and the respective throughput, see results [here](https://qwen.readthedocs.io/en/latest/getting_started/speed_benchmark.html) .
## Run Qwen3
### 🤗 Transformers
Transformers is a library of pretrained natural language processing for inference and training.
The latest version of `transformers` is recommended and `transformers>=4.51.0` is required.
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# the result will begin with thinking content in <think></think> tags, followed by the actual response
print(tokenizer.decode(output_ids, skip_special_tokens=True))
```
By default, Qwen3 models will think before response.
This could be controled by
- `enable_thinking=False`: Passing `enable_thinking=False` to `tokenizer.apply_chat_template` will strictly prevent the model from generating thinking content.
- `/think` and `/nothink` instructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.
### ModelScope
We strongly advise users especially those in mainland China to use ModelScope.
ModelScope adopts a Python API similar to Transformers.
The CLI tool `modelscope download` can help you solve issues concerning downloading checkpoints.
### llama.cpp
[`llama.cpp`](https://github.com/ggml-org/llama.cpp) enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.
`llama.cpp>=b5092` is required.
To use the CLI, run the following in a terminal:
```shell
./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
# CTRL+C to exit
```
To use the API server, run the following in a terminal:
```shell
./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080
```
A simple web front end will be at `http://localhost:8080` and an OpenAI-compatible API will be at `http://localhost:8080/v1`.
For additional guides, please refer to [our documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html).
### Ollama
After [installing ollama](https://ollama.com/), you can initiate the ollama service with the following command:
```shell
ollama serve
# You need to keep this service running whenever you are using ollama
```
To pull a model checkpoint and run the model, use the `ollama run` command. You can specify a model size by adding a suffix to `qwen3`, such as `:8b` or `:30b-a3b`:
```shell
ollama run qwen3:8b
# To exit, type "/bye" and press ENTER
```
You can also access the ollama service via its OpenAI-compatible API.
Please note that you need to (1) keep `ollama serve` running while using the API, and (2) execute `ollama run qwen3:8b` before utilizing this API to ensure that the model checkpoint is prepared.
The API is at `http://localhost:11434/v1/` by default.
For additional details, please visit [ollama.ai](https://ollama.com/).
### LMStudio
Qwen3 has already been supported by [lmstudio.ai](https://lmstudio.ai/). You can directly use LMStudio with our GGUF files.
### MLX-LM
If you are running on Apple Silicon, [`mlx-lm`](https://github.com/ml-explore/mlx-lm) also supports Qwen3 (`mlx-lm>=0.24.0`).
Look for models ending with MLX on HuggingFace Hub.
<!-- ### OpenVINO
Qwen2.5 has already been supported by [OpenVINO toolkit](https://github.com/openvinotoolkit). You can install and run this [chatbot example](https://github.com/OpenVINO-dev-contest/Qwen2.openvino) with Intel CPU, integrated GPU or discrete GPU. -->
<!-- ### Text generation web UI
You can directly use [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui) for creating a web UI demo. If you use GGUF, remember to install the latest wheel of `llama.cpp` with the support of Qwen2.5. -->
<!-- ### llamafile
Clone [`llamafile`](https://github.com/Mozilla-Ocho/llamafile), run source install, and then create your own llamafile with the GGUF file following the guide [here](https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file#creating-llamafiles). You are able to run one line of command, say `./qwen.llamafile`, to create a demo. -->
## Deploy Qwen3
Qwen3 is supported by multiple inference frameworks.
Here we demonstrate the usage of `SGLang` and `vLLM`.
You can also find Qwen3 models from various inference providers, e.g., [Alibaba Cloud Model Studio](https://www.alibabacloud.com/en/product/modelstudio).
### SGLang
[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
SGLang could be used to launch a server with OpenAI-compatible API service.
`sglang>=0.4.6.post1` is required.
It is as easy as
```shell
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000 --reasoning-parser qwen3
```
An OpenAI-compatible API will be available at `http://localhost:30000/v1`.
### vLLM
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
`vllm>=0.8.4` is required.
```shell
vllm serve Qwen/Qwen3-8B --port 8000 --enable-reasoning-parser --reasoning-parser deepseek_r1
```
An OpenAI-compatible API will be available at `http://localhost:8000/v1`.
### MindIE
For depolyment on Ascend NPUs, please visit [Modelers](https://modelers.cn/) and search for Qwen3.
<!--
### OpenLLM
[OpenLLM](https://github.com/bentoml/OpenLLM) allows you to easily run Qwen2.5 as OpenAI-compatible APIs. You can start a model server using `openllm serve`. For example:
```bash
openllm serve qwen2.5:7b
```
The server is active at `http://localhost:3000/`, providing OpenAI-compatible APIs. You can create an OpenAI client to call its chat API. For more information, refer to [our documentation](https://qwen.readthedocs.io/en/latest/deployment/openllm.html). -->
## Build with Qwen3
### Tool Use
For tool use capabilities, we recommend taking a look at [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent), which provides a wrapper around these APIs to support tool use or function calling with MCP support.
Tool use with Qwen3 can also be conducted with SGLang, vLLM, Transformers, llama.cpp, Ollama, etc.
Follow guides in our documentation to see how to enable the support.
### Finetuning
We advise you to use training frameworks, including [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), [unsloth](https://github.com/unslothai/unsloth), [Swift](https://github.com/modelscope/swift), [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory), etc., to finetune your models with SFT, DPO, GRPO, etc.
## License Agreement
All our open-source models are licensed under Apache 2.0.
You can find the license files in the respective Hugging Face repositories.
## Citation
If you find our work helpful, feel free to give us a cite.
```
@article{qwen2.5,
title = {Qwen2.5 Technical Report},
author = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
journal = {arXiv preprint arXiv:2412.15115},
year = {2024}
}
@article{qwen2,
title = {Qwen2 Technical Report},
author = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
journal = {arXiv preprint arXiv:2407.10671},
year = {2024}
}
```
## Contact Us
If you are interested to leave a message to either our research team or product team, join our [Discord](https://discord.gg/z3GAxXZ9Ce) or [WeChat groups](assets/wechat.png)!
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
ENV DEBIAN_FRONTEND=noninteractive
# RUN yum update && yum install -y git cmake wget build-essential
# RUN source /opt/dtk-dtk25.04/env.sh
# # 安装pip相关依赖
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
transformers>=4.51.0
ARG CUDA_VERSION=12.1.0
ARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
FROM ${from} as base
RUN <<EOF
apt update -y && apt upgrade -y && apt install -y --no-install-recommends \
git \
git-lfs \
python3 \
python3-pip \
python3-dev \
wget \
vim \
&& rm -rf /var/lib/apt/lists/*
EOF
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN git lfs install
FROM base as dev
WORKDIR /
RUN mkdir -p /data/shared/Qwen
WORKDIR /data/shared/Qwen/
FROM dev as bundle_req
RUN pip install --no-cache-dir networkx==3.1
RUN pip3 install --no-cache-dir torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install --no-cache-dir transformers==4.40.2 accelerate tiktoken einops scipy
FROM bundle_req as bundle_finetune
ARG BUNDLE_FINETUNE=true
RUN <<EOF
if [ "$BUNDLE_FINETUNE" = "true" ]; then
cd /data/shared/Qwen
# Full-finetune / LoRA.
pip3 install --no-cache-dir "deepspeed==0.14.2" "peft==0.11.1"
# Q-LoRA.
apt update -y && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends \
libopenmpi-dev openmpi-bin \
&& rm -rf /var/lib/apt/lists/*
pip3 install --no-cache-dir "optimum==1.20.0" "auto-gptq==0.7.1" "autoawq==0.2.5" mpi4py
fi
EOF
FROM bundle_finetune as bundle_vllm
ARG BUNDLE_VLLM=true
RUN <<EOF
if [ "$BUNDLE_VLLM" = "true" ]; then
cd /data/shared/Qwen
pip3 install --no-cache-dir vllm==0.4.3 "fschat[model_worker,webui]==0.2.36"
fi
EOF
FROM bundle_vllm as bundle_flash_attention
ARG BUNDLE_FLASH_ATTENTION=true
RUN <<EOF
if [ "$BUNDLE_FLASH_ATTENTION" = "true" ]; then
pip3 install --no-cache-dir flash-attn==2.5.8 --no-build-isolation
fi
EOF
FROM bundle_flash_attention as final
COPY ../examples/sft/* ./
COPY ../examples/demo/* ./
EXPOSE 80
#!/usr/bin/env bash
#
# This script will automatically pull docker image from DockerHub, and start a container to run the Qwen-Chat cli-demo.
IMAGE_NAME=qwenllm/qwen:2-cu121
QWEN_CHECKPOINT_PATH=/path/to/Qwen-Instruct
CONTAINER_NAME=qwen2
function usage() {
echo '
Usage: bash docker/docker_cli_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Instruct] [-n CONTAINER_NAME]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-i | --image-name )
shift
IMAGE_NAME=$1
;;
-c | --checkpoint )
shift
QWEN_CHECKPOINT_PATH=$1
;;
-n | --container-name )
shift
CONTAINER_NAME=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
exit 1
fi
sudo docker pull ${IMAGE_NAME} || {
echo "Pulling image ${IMAGE_NAME} failed, exit."
exit 1
}
sudo docker run --gpus all --rm --name ${CONTAINER_NAME} \
--mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Instruct \
-it ${IMAGE_NAME} \
python cli_demo.py -c /data/shared/Qwen/Qwen-Instruct/
\ No newline at end of file
#!/usr/bin/env bash
#
# This script will automatically pull docker image from DockerHub, and start a daemon container to run the Qwen-Chat web-demo.
IMAGE_NAME=qwenllm/qwen:2-cu121
QWEN_CHECKPOINT_PATH=/path/to/Qwen-Instruct
PORT=8901
CONTAINER_NAME=qwen2
function usage() {
echo '
Usage: bash docker/docker_web_demo.sh [-i IMAGE_NAME] -c [/path/to/Qwen-Instruct] [-n CONTAINER_NAME] [--port PORT]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-i | --image-name )
shift
IMAGE_NAME=$1
;;
-c | --checkpoint )
shift
QWEN_CHECKPOINT_PATH=$1
;;
-n | --container-name )
shift
CONTAINER_NAME=$1
;;
--port )
shift
PORT=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
if [ ! -e ${QWEN_CHECKPOINT_PATH}/config.json ]; then
echo "Checkpoint config.json file not found in ${QWEN_CHECKPOINT_PATH}, exit."
exit 1
fi
sudo docker pull ${IMAGE_NAME} || {
echo "Pulling image ${IMAGE_NAME} failed, exit."
exit 1
}
sudo docker run --gpus all -d --restart always --name ${CONTAINER_NAME} \
-v /var/run/docker.sock:/var/run/docker.sock -p ${PORT}:80 \
--mount type=bind,source=${QWEN_CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-Instruct \
-it ${IMAGE_NAME} \
python web_demo.py --server-port 80 --server-name 0.0.0.0 -c /data/shared/Qwen/Qwen-Instruct/ && {
echo "Successfully started web demo. Open 'http://localhost:${PORT}' to try!
Run \`docker logs ${CONTAINER_NAME}\` to check demo status.
Run \`docker rm -f ${CONTAINER_NAME}\` to stop and remove the demo."
}
\ No newline at end of file
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
# Qwen Documentation
This is the source of the documentation at <https://qwen.readthedocs.io>.
## Quick Start
We use `sphinx` to manage the documentation and use the `furo` theme.
To get started, simply run
```bash
pip install -r requirements-docs.txt
```
Then run `make html` or `sphinx-build -M html source build` and it will compile the docs and put it under the `build/html` directory.
## Translation
The documentation is available in both English and Simplified Chinese. We use
`sphinx-intl` to work with Sphinx translation flow, following [this article](https://www.sphinx-doc.org/en/master/usage/advanced/intl.html).
You need to install the Python package `sphinx-intl` before starting.
1. After updating the English documentation, run `make gettext`, and the pot files will be placed in the `build/gettext` directory. `make gettext` can be slow if the doc is long.
2. Use the generated pot files to update the po files:
```bash
sphinx-intl update -p build/gettext -l zh_CN -w 0
```
3. Translate po files at `locales\zh_CN\LC_MESSAGES`. Pay attention to fuzzy matches (messages after `#, fuzzy`). Please be careful not to break reST notation.
4. Build translated document: `make -e SPHINXOPTS="-D language='zh_CN'" html` or `sphinx-build -M html source build -D language=zh_CN`
## Auto Build
```bash
pip install sphinx-autobuild
```
To autobuild the default version:
```bash
sphinx-autobuild source build/html
```
To autobuild the translated version:
```bash
sphinx-autobuild source build/html -D language=zh_CN --watch locales/zh_CN
```
By default, the doc is at `http://127.0.0.1:8000`
\ No newline at end of file
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2024, Qwen Team
# This file is distributed under the same license as the Qwen package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/deployment/openllm.rst:2 986ea00cb5af4a0d82f974ed79a82430
msgid "OpenLLM"
msgstr "OpenLLM"
#: ../../Qwen/source/deployment/openllm.rst:5 78be03fbdccb429892b03bf84596411b
msgid "To be updated for Qwen3."
msgstr "仍需为Qwen3更新。"
#: ../../Qwen/source/deployment/openllm.rst:7 a001f11d1c5440188121d20b3baf59db
msgid "OpenLLM allows developers to run Qwen2.5 models of different sizes as OpenAI-compatible APIs with a single command. It features a built-in chat UI, state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Qwen2.5. Visit `the OpenLLM repository <https://github.com/bentoml/OpenLLM/>`_ to learn more."
msgstr "OpenLLM 允许开发者通过一个命令运行不同大小的 Qwen2.5 模型,提供 OpenAI 兼容的 API。它具有内置的聊天 UI,先进的推理后端,以及简化的工作流程来使用 Qwen2.5 创建企业级云部署。访问 `OpenLLM 仓库 <https://github.com/bentoml/OpenLLM/>`_ 了解更多信息。"
#: ../../Qwen/source/deployment/openllm.rst:10 229f89c3be65442bbe15905d75a0d13d
msgid "Installation"
msgstr "安装"
#: ../../Qwen/source/deployment/openllm.rst:12 79421f700fbc426cb6ce9841aff67503
msgid "Install OpenLLM using ``pip``."
msgstr "使用 ``pip`` 安装 OpenLLM。"
#: ../../Qwen/source/deployment/openllm.rst:18 69cfd6fe2e274173ad4065be91b71472
msgid "Verify the installation and display the help information:"
msgstr "验证安装并显示帮助信息:"
#: ../../Qwen/source/deployment/openllm.rst:25 503cae99b14c4ef4b322b8ec0bd2d32d
msgid "Quickstart"
msgstr "快速开始"
#: ../../Qwen/source/deployment/openllm.rst:27 0ea788c801404d8780404611c87644b0
msgid "Before you run any Qwen2.5 model, ensure your model repository is up to date by syncing it with OpenLLM's latest official repository."
msgstr "在运行任何 Qwen2.5 模型之前,确保您的模型仓库与 OpenLLM 的最新官方仓库同步。"
#: ../../Qwen/source/deployment/openllm.rst:33 8852ff46ecdb45b2bfc9885bbfaacb02
msgid "List the supported Qwen2.5 models:"
msgstr "列出支持的 Qwen2.5 模型:"
#: ../../Qwen/source/deployment/openllm.rst:39 3e4f6c11396844adb30d4e5812339484
msgid "The results also display the required GPU resources and supported platforms:"
msgstr "结果还会显示所需的 GPU 资源和支持的平台:"
#: ../../Qwen/source/deployment/openllm.rst:57 ac4c0db02f5249d5882940820779db9a
msgid "To start a server with one of the models, use ``openllm serve`` like this:"
msgstr "要使用其中一个模型来启动服务器,请使用 ``openllm serve`` 命令,例如:"
#: ../../Qwen/source/deployment/openllm.rst:63 0a1d3ec35c684e3bb3e971c916aa9be7
msgid "By default, the server starts at ``http://localhost:3000/``."
msgstr "默认情况下,服务器启动在 http://localhost:3000/。"
#: ../../Qwen/source/deployment/openllm.rst:66 2e787de9a62f4342bdf8f88ee0df5379
msgid "Interact with the model server"
msgstr "与模型服务器交互"
#: ../../Qwen/source/deployment/openllm.rst:68 b22802ad9027458bb30ea0da665fea36
msgid "With the model server up and running, you can call its APIs in the following ways:"
msgstr "服务器运行后,可以通过以下方式调用其 API:"
#: ../../Qwen/source/deployment/openllm.rst 76214ea690094930899d6f2eddcc1454
msgid "CURL"
msgstr "CURL"
#: ../../Qwen/source/deployment/openllm.rst:74 42775a3df58f474782d29f2f82707bd9
msgid "Send an HTTP request to its ``/generate`` endpoint via CURL:"
msgstr "通过 CURL 向其 ``/generate`` 端点发送 HTTP 请求:"
#: ../../Qwen/source/deployment/openllm.rst 4f0ff3eee2ab49dda5a72bd611a9d45e
msgid "Python client"
msgstr "Python 客户端"
#: ../../Qwen/source/deployment/openllm.rst:91 ce2e11a46e434798947b1e74ce82a19c
msgid "Call the OpenAI-compatible endpoints with frameworks and tools that support the OpenAI API protocol. Here is an example:"
msgstr "使用支持 OpenAI API 协议的框架和工具来调用。例如:"
#: ../../Qwen/source/deployment/openllm.rst 107921d1a855430ca70c8c163d37c7f2
msgid "Chat UI"
msgstr "聊天 UI"
#: ../../Qwen/source/deployment/openllm.rst:118
#: b92df2759cd54c2b8316e2a160ede656
msgid "OpenLLM provides a chat UI at the ``/chat`` endpoint for the LLM server at http://localhost:3000/chat."
msgstr "OpenLLM 为 LLM 服务器提供的聊天 UI 位于 ``/chat`` 端点,地址为 http://localhost:3000/chat。"
#: ../../Qwen/source/deployment/openllm.rst:123
#: 0d3fa679178f443caf9c87623001be1f
msgid "Model repository"
msgstr "模型仓库"
#: ../../Qwen/source/deployment/openllm.rst:125
#: 54d6a9bdcc064aeb95a23b60d3d575ab
msgid "A model repository in OpenLLM represents a catalog of available LLMs. You can add your own repository to OpenLLM with custom Qwen2.5 variants for your specific needs. See our `documentation to learn details <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_."
msgstr "OpenLLM 中的模型仓库表示可用的 LLM 目录。您可以为 OpenLLM 添加自定义的 Qwen2.5 模型仓库,以满足您的特定需求。请参阅 `我们的文档 <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_ 了解详细信息。"
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2024, Qwen Team
# This file is distributed under the same license as the Qwen package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/deployment/sglang.md:1 4886c9be510e44ba968bba79c7e01e2b
msgid "SGLang"
msgstr ""
#: ../../Qwen/source/deployment/sglang.md:3 fa388b3c599c454bbe22dc7c831723c1
msgid "[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models."
msgstr "[SGLang](https://github.com/sgl-project/sglang) 是一个用于大型语言模型和视觉语言模型的快速推理框架。"
#: ../../Qwen/source/deployment/sglang.md:5 43fe1ab3622b4d619de1ba451ff5b5c4
msgid "To learn more about SGLang, please refer to the [documentation](https://docs.sglang.ai/)."
msgstr "要了解更多关于 SGLang 的信息,请参阅[官方文档](https://docs.sglang.ai/)。"
#: ../../Qwen/source/deployment/sglang.md:7 4e7093847f104f5c91bf12495db0e2df
msgid "Environment Setup"
msgstr "环境配置"
#: ../../Qwen/source/deployment/sglang.md:9 404501b6bb754a01afa398ce270f4ad6
msgid "By default, you can install `sglang` with pip in a clean environment:"
msgstr "默认情况下,你可以通过 pip 在新环境中安装 `sglang` : "
#: ../../Qwen/source/deployment/sglang.md:15 8794cc70acd141eeaef4717a190b11f4
msgid "Please note that `sglang` relies on `flashinfer-python` and has strict dependencies on `torch` and its CUDA versions. Check the note in the official document for installation ([link](https://docs.sglang.ai/start/install.html)) for more help."
msgstr "请留意预构建的 `sglang` 依赖 `flashinfer-python`,并对`torch`和其CUDA版本有强依赖。请查看[官方文档](https://docs.sglang.ai/start/install.html)中的注意事项以获取有关安装的帮助。"
#: ../../Qwen/source/deployment/sglang.md:18 06e04edfe3094363bcbc5b8758c8b16c
msgid "API Service"
msgstr "API 服务"
#: ../../Qwen/source/deployment/sglang.md:20 5969d8121d8a4af99d790844c4b348c5
msgid "It is easy to build an OpenAI-compatible API service with SGLang, which can be deployed as a server that implements OpenAI API protocol. By default, it starts the server at `http://localhost:30000`. You can specify the address with `--host` and `--port` arguments. Run the command as shown below:"
msgstr "借助 SGLang ,构建一个与OpenAI API兼容的API服务十分简便,该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下,它将在 `http://localhost:30000` 启动服务器。您可以通过 `--host` 和 `--port` 参数来自定义地址。请按照以下所示运行命令:"
#: ../../Qwen/source/deployment/sglang.md:28 32a52bb639634b9b9c196696dc20e2c5
msgid "By default, if the `--model-path` does not point to a valid local directory, it will download the model files from the HuggingFace Hub. To download model from ModelScope, set the following before running the above command:"
msgstr "默认情况下,如果模型未指向有效的本地目录,它将从 HuggingFace Hub 下载模型文件。要从 ModelScope 下载模型,请在运行上述命令之前设置以下内容:"
#: ../../Qwen/source/deployment/sglang.md:34 cd33984af3e045549668c8ad682f7612
msgid "For distrbiuted inference with tensor parallelism, it is as simple as"
msgstr "对于使用张量并行的分布式推理,操作非常简单:"
#: ../../Qwen/source/deployment/sglang.md:38 4db95581ffd046a9b6d532933403d985
msgid "The above command will use tensor parallelism on 4 GPUs. You should change the number of GPUs according to your demand."
msgstr "上述命令将在 4 块 GPU 上使用张量并行。您应根据需求调整 GPU 的数量。"
#: ../../Qwen/source/deployment/sglang.md:41 765cc12e934b4ab6881f0a71693fcc3d
msgid "Basic Usage"
msgstr "基本用法"
#: ../../Qwen/source/deployment/sglang.md:43 51032557dac94cb3b14c3842076192a8
msgid "Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:"
msgstr "然后,您可以利用 [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) 来与Qwen进行对话:"
#: ../../Qwen/source/deployment/sglang.md 0708e8d2e6a44e94956e44f3a83bb4d8
#: 3bfc1bfe04ea4b49bfd1d5c6b5af52d7
msgid "curl"
msgstr ""
#: ../../Qwen/source/deployment/sglang.md 3b62fc6e456d44a6ba9cc8f5519fc3c6
#: ab964c5641584f7a9ef4252ecf0428cb
msgid "Python"
msgstr ""
#: ../../Qwen/source/deployment/sglang.md:63
#: ../../Qwen/source/deployment/sglang.md:130 18da82bbe0db4a59aa430b68b91db904
#: a2fccb4d7c164911a35e5ff6f30d98df
msgid "You can use the API client with the `openai` Python SDK as shown below:"
msgstr "或者您可以如下面所示使用 `openai` Python SDK中的 API 客户端:"
#: ../../Qwen/source/deployment/sglang.md:91 2bff40bb9f104cf2b19e6cf8169bf18d
msgid "While the default sampling parameters would work most of the time for thinking mode, it is recommended to adjust the sampling parameters according to your application, and always pass the sampling parameters to the API."
msgstr "虽然默认的采样参数在大多数情况下适用于思考模式,但建议根据您的应用调整采样参数,并始终将采样参数传递给 API。"
#: ../../Qwen/source/deployment/sglang.md:97 e10b0bbcaa7c4e54a59f7a30fa8760ef
msgid "Thinking & Non-Thinking Modes"
msgstr "思考与非思考模式"
#: ../../Qwen/source/deployment/sglang.md:100 ff0a121d43d5494597e8fc3b832f4893
msgid "This feature has not been released. For more information, please see this [pull request](https://github.com/sgl-project/sglang/pull/5551)."
msgstr "此功能尚未发布。更多信息,请参阅此[pull request](https://github.com/sgl-project/sglang/pull/5551)。"
#: ../../Qwen/source/deployment/sglang.md:104 8ba3c8c378ed4df7acb28f04e41bf067
msgid "Qwen3 models will think before respond. This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think."
msgstr "Qwen3 模型会在回复前进行思考。这种行为可以通过硬开关(完全禁用思考)或软开关(模型遵循用户关于是否应该思考的指令)来控制。"
#: ../../Qwen/source/deployment/sglang.md:107 dcc39b3925704aee927b220cbf9b341d
msgid "The hard switch is availabe in SGLang through the following configuration to the API call. To disable thinking, use"
msgstr "硬开关在 vLLM 中可以通过以下 API 调用配置使用。要禁用思考,请使用"
#: ../../Qwen/source/deployment/sglang.md:159 952c90f4f1c84daba9cb66bfeb32725f
msgid "It is recommended to set sampling parameters differently for thinking and non-thinking modes."
msgstr "建议为思考模式和非思考模式分别设置不同的采样参数。"
#: ../../Qwen/source/deployment/sglang.md:162 750cee1281d74246bc7cf47ac9e0d502
msgid "Parsing Thinking Content"
msgstr "解析思考内容"
#: ../../Qwen/source/deployment/sglang.md:164 4f1f6c5d59134ea1bf6a625cd5081c51
msgid "SGLang supports parsing the thinking content from the model generation into structured messages:"
msgstr "SGLang 支持将模型生成的思考内容解析为结构化消息:"
#: ../../Qwen/source/deployment/sglang.md:169 0517d0a9cf694f6caabcbe69e3e1e845
msgid "The response message will have a field named `reasoning_content` in addition to `content`, containing the thinking content generated by the model."
msgstr "响应消息除了包含 `content` 字段外,还会有一个名为 `reasoning_content` 的字段,其中包含模型生成的思考内容。"
#: ../../Qwen/source/deployment/sglang.md:172 0225706aa7fe441c82d34f81b348fd42
msgid "Please note that this feature is not OpenAI API compatible."
msgstr "请注意,此功能与 OpenAI API 规范不一致。"
#: ../../Qwen/source/deployment/sglang.md:175 45a5f606e86543c08eacf7686b5a2def
msgid "Parsing Tool Calls"
msgstr "解析工具调用"
#: ../../Qwen/source/deployment/sglang.md:177 0aa3be18c7a5476cb915d6686c58387d
msgid "SGLang supports parsing the tool calling content from the model generation into structured messages:"
msgstr "SGLang 支持将模型生成的工具调用内容解析为结构化消息:"
#: ../../Qwen/source/deployment/sglang.md:182 dc096c7fb79c4b9ca0dd2c9cdd7ec890
msgid "For more information, please refer to [our guide on Function Calling](../framework/function_call.md)."
msgstr "详细信息,请参阅[函数调用的指南](../framework/function_call.md#vllm)。"
#: ../../Qwen/source/deployment/sglang.md:184 a58bba52efc44663af792d859bd3b410
msgid "Structured/JSON Output"
msgstr "结构化/JSON输出"
#: ../../Qwen/source/deployment/sglang.md:186 518f257e9d6d4080b41f980467573f7f
msgid "SGLang supports structured/JSON output. Please refer to [SGLang's documentation](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API). Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt."
msgstr "SGLang 支持结构化/JSON 输出。请参阅[SGLan文档](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API)。此外,还建议在系统消息或您的提示中指示模型生成特定格式。"
#: ../../Qwen/source/deployment/sglang.md:190 3a6d08a831584d6b8392da2650e8bf0b
msgid "Serving Quantized models"
msgstr "部署量化模型"
#: ../../Qwen/source/deployment/sglang.md:192 b2fe212a02b84349940f4c0c30cde88d
msgid "Qwen3 comes with two types of pre-quantized models, FP8 and AWQ."
msgstr "Qwen3 提供了两种类型的预量化模型:FP8 和 AWQ。"
#: ../../Qwen/source/deployment/sglang.md:194 efc85fc46a564483bdb872dbf5d61f3c
msgid "The command serving those models are the same as the original models except for the name change:"
msgstr "部署这些模型的命令与原始模型相同,只是名称有所更改:"
#: ../../Qwen/source/deployment/sglang.md:203 11a6d1bb983d4e60a55f5d579f1eb76b
msgid "Context Length"
msgstr "上下文长度"
#: ../../Qwen/source/deployment/sglang.md:205 de0293719e06477fbde6afc533973b1a
msgid "The context length for Qwen3 models in pretraining is up to 32,768 tokenns. To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied. We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts."
msgstr "Qwen3 模型在预训练中的上下文长度最长为 32,768 个 token。为了处理显著超过 32,768 个 token 的上下文长度,应应用 RoPE 缩放技术。我们已经验证了 [YaRN](https://arxiv.org/abs/2309.00071) 的性能,这是一种增强模型长度外推的技术,可确保在长文本上的最佳性能。"
#: ../../Qwen/source/deployment/sglang.md:209 0ee16aabbc794331a329e52ab2ca40e7
msgid "SGLang supports YaRN, which can be configured as"
msgstr "SGLang 支持 YaRN,可以配置为"
#: ../../Qwen/source/deployment/sglang.md:215 c3ba0a9b3502462795dfd887912e9357
msgid "SGLang implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.** We advise adding the `rope_scaling` configuration only when processing long contexts is required. It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0."
msgstr "SGLang 实现了静态 YaRN,这意味着无论输入长度如何,缩放因子都保持不变,**这可能会对较短文本的性能产生影响。** 我们建议仅在需要处理长上下文时添加 `rope_scaling` 配置。还建议根据需要调整 `factor`。例如,如果您的应用程序的典型上下文长度为 65,536 个 token,则最好将 `factor` 设置为 2.0。"
#: ../../Qwen/source/deployment/sglang.md:221 398c3e38c94e446aa9922dd04dce609c
msgid "The default `max_position_embeddings` in `config.json` is set to 40,960, which is used by SGLang. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance."
msgstr "`config.json` 中的默认 `max_position_embeddings` 被设置为 40,960,SGLang 将使用该值。此分配包括为输出保留 32,768 个 token,为典型提示保留 8,192 个 token,这足以应对大多数涉及短文本处理的场景,并为模型思考留出充足空间。如果平均上下文长度不超过 32,768 个 token,我们不建议在此场景中启用 YaRN,因为这可能会降低模型性能。"
# Copyright (C) 2024, Qwen Team, Alibaba Group.
# This file is distributed under the same license as the Qwen package.
#
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/deployment/skypilot.rst:2 795ad4f30e27494d93675f71bb1a5cc4
msgid "SkyPilot"
msgstr ""
#: ../../Qwen/source/deployment/skypilot.rst:5 aad807db94a24d868c9c1b364b47e152
msgid "To be updated for Qwen3."
msgstr "仍需为Qwen3更新。"
#: ../../Qwen/source/deployment/skypilot.rst:8 d6bbf736584f4bbfa9c300d50a2ed669
msgid "What is SkyPilot"
msgstr "SkyPilot 是什么"
#: ../../Qwen/source/deployment/skypilot.rst:10
#: b66facae41bf493880e43044e2915a45
msgid "SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, the highest GPU availability, and managed execution. Its features include:"
msgstr "SkyPilot 是一个可以在任何云上运行 LLM 、 AI 应用以及批量任务的框架,旨在实现最大程度的成本节省、最高的 GPU 可用性以及受管理的执行过程。其特性包括:"
#: ../../Qwen/source/deployment/skypilot.rst:14
#: 621f021163c549d0aadb1c911a3a3ef5
msgid "Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds."
msgstr "通过跨区域和跨云充分利用多个资源池,以获得最佳的 GPU 可用性。"
#: ../../Qwen/source/deployment/skypilot.rst:16
#: ea1723c3b5be454cad3219836f4386d8
msgid "Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups."
msgstr "把费用降到最低—— SkyPilot 在各区域和云平台中为您挑选最便宜的资源。无需任何托管解决方案的额外加价。"
#: ../../Qwen/source/deployment/skypilot.rst:18
#: e479693ecf08411ca35d8d0727c8f441
msgid "Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint"
msgstr "将服务扩展到多个副本上,所有副本通过单一 endpoint 对外提供服务"
#: ../../Qwen/source/deployment/skypilot.rst:20
#: 1f9cdd2ae2544d1faa8a4c463ee0e42c
msgid "Everything stays in your cloud account (your VMs & buckets)"
msgstr "所有内容均保存在您的云账户中(包括您的虚拟机和 bucket )"
#: ../../Qwen/source/deployment/skypilot.rst:21
#: 5bb9b617764942d989e5093463a359f0
msgid "Completely private - no one else sees your chat history"
msgstr "完全私密 - 没有其他人能看到您的聊天记录"
#: ../../Qwen/source/deployment/skypilot.rst:24
#: cf0c456ac72f40ac98790c11dc243317
msgid "Install SkyPilot"
msgstr "安装 SkyPilot"
#: ../../Qwen/source/deployment/skypilot.rst:26
#: 78d86c1fa8104b138b01aed640b262fc
msgid "We advise you to follow the `instruction <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ to install SkyPilot. Here we provide a simple example of using ``pip`` for the installation as shown below."
msgstr "我们建议您按照 `指示 <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ 安装 SkyPilot 。以下为您提供了一个使用 ``pip`` 进行安装的简单示例:"
#: ../../Qwen/source/deployment/skypilot.rst:38
#: a7c88265bf404f55b85388c81a240199
msgid "After that, you need to verify cloud access with a command like:"
msgstr "随后,您需要用如下命令确认是否能使用云:"
#: ../../Qwen/source/deployment/skypilot.rst:44
#: 72025dfba0144f63a720f6da0dd39bfa
msgid "For more information, check the `official document <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ and see if you have set up your cloud accounts correctly."
msgstr "若需更多信息,请查阅官方文档,确认您的云账户设置是否正确无误。"
#: ../../Qwen/source/deployment/skypilot.rst:47
#: 61be006061554e5ea40d55497e11e192
msgid "Alternatively, you can also use the official docker image with SkyPilot master branch automatically cloned by running:"
msgstr "或者,您也可以使用官方提供的 docker 镜像,可以自动克隆 SkyPilot 的主分支:"
#: ../../Qwen/source/deployment/skypilot.rst:63
#: 4ae89fb44c6643a3a82fca5cee622af4
msgid "Running Qwen2.5-72B-Instruct with SkyPilot"
msgstr "使用 SkyPilot 运行 Qwen2.5-72B-Instruct "
#: ../../Qwen/source/deployment/skypilot.rst:65
#: 1bc4973c2eb745689ded0af54ba33e0e
msgid "Start serving Qwen2.5-72B-Instruct on a single instance with any available GPU in the list specified in `serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__ with a vLLM-powered OpenAI-compatible endpoint:"
msgstr "`serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__ 中列出了支持的 GPU 。您可使用配备这类 GPU 的单个运算实例来部署 Qwen2.5-72B-Instruct 服务。该服务由 vLLM 搭建,并与 OpenAI API 兼容。以下为部署方法:"
#: ../../Qwen/source/deployment/skypilot.rst:74
#: ../../Qwen/source/deployment/skypilot.rst:123
#: ac3692ed16974facbd58b6886cd111af b325de015e7b4bb0a91491d3f7418792
msgid "**Before launching, make sure you have changed Qwen/Qwen2-72B-Instruct to Qwen/Qwen2.5-72B-Instruct in the YAML file.**"
msgstr "**在启动之前,请先将 YAML 文件中的 Qwen/Qwen2-72B-Instruct 修改为 Qwen/Qwen2.5-72B-Instruct。**"
#: ../../Qwen/source/deployment/skypilot.rst:76
#: 6046b3c86fae4a43878fbadbeb33fbd8
msgid "Send a request to the endpoint for completion:"
msgstr "向该 endpoint 发送续写请求:"
#: ../../Qwen/source/deployment/skypilot.rst:90
#: 2ec56c2028a94f568fd2c1a65063d25a
msgid "Send a request for chat completion:"
msgstr "向该 endpoint 发送对话续写请求"
#: ../../Qwen/source/deployment/skypilot.rst:112
#: c8e140ddfd914ff5a460621a7ca1891e
msgid "Scale up the service with SkyPilot Serve"
msgstr "使用 SkyPilot Serve 扩展服务规模"
#: ../../Qwen/source/deployment/skypilot.rst:114
#: 0db304ab396d45adb6017d78cd1ee4a2
msgid "With `SkyPilot Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__, a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running:"
msgstr "使用 `SkyPilot Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__ 扩展 Qwen 的服务规模非常容易,只需运行:"
#: ../../Qwen/source/deployment/skypilot.rst:125
#: 25bbbf9e49be44d3899074ff97202d71
msgid "This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed."
msgstr "这将启动服务,使用多个副本部署在最经济的可用位置和加速器上。 SkyServe 将自动管理这些副本,监控其健康状况,根据负载进行自动伸缩,并在必要时重启它们。"
#: ../../Qwen/source/deployment/skypilot.rst:130
#: bda628bab7ef41a0918dc4b80a9b3cfe
msgid "A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas."
msgstr "将返回一个 endpoint ,所有发送至该endpoint的请求都将被路由至就绪状态的副本。"
#: ../../Qwen/source/deployment/skypilot.rst:133
#: b232dbbdcf674d56bcf9c0331c020864
msgid "To check the status of the service, run:"
msgstr "运行如下命令检查服务的状态:"
#: ../../Qwen/source/deployment/skypilot.rst:139
#: 556b854caf7243fb93f253ebe2dc9033
msgid "After a while, you will see the following output:"
msgstr "很快,您将看到如下输出:"
#: ../../Qwen/source/deployment/skypilot.rst:152
#: 5a6055c5a42c4b2db6693c1095688de8
msgid "As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the availability of the service while minimizing the cost."
msgstr "如下所示:该服务现由两个副本提供支持,一个位于 Azure 平台,另一个位于 GCP 平台。同时,已为服务选择云服务商提供的 **最经济实惠** 的加速器类型。这样既最大限度地提升了服务的可用性,又尽可能降低了成本。"
#: ../../Qwen/source/deployment/skypilot.rst:157
#: a18533d33dc54a1091ded0b4bba0a1eb
msgid "To access the model, we use a ``curl -L`` command (``-L`` to follow redirect) to send the request to the endpoint:"
msgstr "要访问模型,我们使用带有 ``curl -L`` (用于跟随重定向),将请求发送到 endpoint :"
#: ../../Qwen/source/deployment/skypilot.rst:182
#: 34cd50fd79e24d8895075f7841b025e4
msgid "Accessing Qwen2.5 with Chat GUI"
msgstr "使用 Chat GUI 调用 Qwen2.5"
#: ../../Qwen/source/deployment/skypilot.rst:184
#: ca6994cda1cb469e83ce8c026bb67e42
msgid "It is also possible to access the Qwen2.5 service with GUI by connecting a `FastChat GUI server <https://github.com/lm-sys/FastChat>`__ to the endpoint launched above (see `gui.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/gui.yaml>`__)."
msgstr "可以通过 `FastChat <https://github.com/lm-sys/FastChat>`__ 来使用 GUI 调用 Qwen2.5 的服务:"
#: ../../Qwen/source/deployment/skypilot.rst:188
#: 99a63e55ab5c46258c20ab89cdfa39dc
msgid "Start the Chat Web UI:"
msgstr "开启一个 Chat Web UI"
#: ../../Qwen/source/deployment/skypilot.rst:194
#: e61593a092c146f8a06af896d6af17f2
msgid "**Before launching, make sure you have changed Qwen/Qwen1.5-72B-Chat to Qwen/Qwen2.5-72B-Instruct in the YAML file.**"
msgstr "**在启动之前,请先将 YAML 文件中的 Qwen/Qwen1.5-72B-Chat 修改为 Qwen/Qwen2.5-72B-Instruct。**"
#: ../../Qwen/source/deployment/skypilot.rst:196
#: 9631068a8b424aa8af6dc6911daac7a9
msgid "Then, we can access the GUI at the returned gradio link:"
msgstr "随后,我们可以通过返回的 gradio 链接来访问 GUI :"
#: ../../Qwen/source/deployment/skypilot.rst:202
#: 1464a56dcd06404aafbe6d7d2c72212b
msgid "Note that you may get better results by using a different temperature and top_p value."
msgstr "你可以通过使用不同的温度和 top_p 值来尝试取得更好的结果。"
#: ../../Qwen/source/deployment/skypilot.rst:205
#: d257f49d835e4c12b28bc680bb78a9cb
msgid "Summary"
msgstr "总结"
#: ../../Qwen/source/deployment/skypilot.rst:207
#: 06b9684a19774eaba4f69862332c5166
msgid "With SkyPilot, it is easy for you to deploy Qwen2.5 on any cloud. We advise you to read the official doc for more usages and updates. Check `this <https://skypilot.readthedocs.io/>`__ out!"
msgstr "通过 SkyPilot ,你可以轻松地在任何云上部署 Qwen2.5 。我们建议您阅读 `官方文档 <https://skypilot.readthedocs.io/>`__ 了解更多用法和最新进展。"
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2024, Qwen Team
# This file is distributed under the same license as the Qwen package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
#
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/deployment/tgi.rst:2 2abcc96f9deb4b9187ac9d88fc69e929
msgid "TGI"
msgstr ""
#: ../../Qwen/source/deployment/tgi.rst:5 2d124d7cb95f47388aa48c662932ef9b
msgid "To be updated for Qwen3."
msgstr "仍需为Qwen3更新。"
#: ../../Qwen/source/deployment/tgi.rst:7 4e5d299c4fdd46d5aba38c9af5765792
msgid "Hugging Face's Text Generation Inference (TGI) is a production-ready framework specifically designed for deploying and serving large language models (LLMs) for text generation tasks. It offers a seamless deployment experience, powered by a robust set of features:"
msgstr "Hugging Face 的 Text Generation Inference (TGI) 是一个专为部署大规模语言模型 (Large Language Models, LLMs) 而设计的生产级框架。TGI提供了流畅的部署体验,并稳定支持如下特性:"
#: ../../Qwen/source/deployment/tgi.rst:9 ecd4fc11a95140959915d062791ceba1
msgid "`Speculative Decoding <Speculative Decoding_>`_: Accelerates generation speeds."
msgstr "`推测解码 (Speculative Decoding) <Speculative Decoding_>`_ :提升生成速度。"
#: ../../Qwen/source/deployment/tgi.rst:10 84590a56416348bf85b3f296cf57e257
msgid "`Tensor Parallelism`_: Enables efficient deployment across multiple GPUs."
msgstr "张量并行 (`Tensor Parallelism`_) :高效多卡部署。"
#: ../../Qwen/source/deployment/tgi.rst:11 a996d6ecd7b94c5cb9752d370f29a9b1
msgid "`Token Streaming`_: Allows for the continuous generation of text."
msgstr "流式生成 (`Token Streaming`_) :支持持续性生成文本。"
#: ../../Qwen/source/deployment/tgi.rst:12 8f591c045ba34f4581bb19652db9f9b3
msgid "Versatile Device Support: Works seamlessly with `AMD`_, `Gaudi`_ and `AWS Inferentia`_."
msgstr "灵活的硬件支持:与 `AMD`_ , `Gaudi`_ 和 `AWS Inferentia`_ 无缝衔接。"
#: ../../Qwen/source/deployment/tgi.rst:21 5e8a98b91fc146e0b581422faa683a18
msgid "Installation"
msgstr "安装"
#: ../../Qwen/source/deployment/tgi.rst:23 684ef25bfb0e460999d6dcccce41b85f
msgid "The easiest way to use TGI is via the TGI docker image. In this guide, we show how to use TGI with docker."
msgstr "通过 TGI docker 镜像使用 TGI 轻而易举。本文将主要介绍 TGI 的 docker 用法。"
#: ../../Qwen/source/deployment/tgi.rst:25 c563fa3eccb04d00a477c1d2e8b15c38
msgid "It's possible to run it locally via Conda or build locally. Please refer to `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_ and `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ for detailed instructions."
msgstr "也可通过 Conda 实机安装或搭建服务。请参考 `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_ 与 `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ 以了解详细说明。"
#: ../../Qwen/source/deployment/tgi.rst:28 b55fc58ff4cb472abca08296409c7837
msgid "Deploy Qwen2.5 with TGI"
msgstr "通过 TGI 部署 Qwen2.5"
#: ../../Qwen/source/deployment/tgi.rst:30 586a8425ec5d413592fd7daf579c7e87
msgid "**Find a Qwen2.5 Model:** Choose a model from `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_."
msgstr "**选定 Qwen2.5 模型:** 从 `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_ 中挑选模型。"
#: ../../Qwen/source/deployment/tgi.rst:31 50fcab8da35941eca308786979dbaf38
msgid "**Deployment Command:** Run the following command in your terminal, replacing ``model`` with your chosen Qwen2.5 model ID and ``volume`` with the path to your local data directory:"
msgstr "**部署TGI服务:** 在终端中运行以下命令,注意替换 ``model`` 为选定的 Qwen2.5 模型 ID 、 ``volume`` 为本地的数据路径: "
#: ../../Qwen/source/deployment/tgi.rst:42 2a800533a7d84bdeab1da0976b0cab53
msgid "Using TGI API"
msgstr "使用 TGI API"
#: ../../Qwen/source/deployment/tgi.rst:44 f05d1ec08140452782d0659543fad7d1
msgid "Once deployed, the model will be available on the mapped port (8080)."
msgstr "一旦成功部署,API 将于选定的映射端口 (8080) 提供服务。"
#: ../../Qwen/source/deployment/tgi.rst:46 f265dc1522b049c98ba31fd5d255c50f
msgid "TGI comes with a handy API for streaming response:"
msgstr "TGI 提供了简单直接的 API 支持流式生成:"
#: ../../Qwen/source/deployment/tgi.rst:54 e9cc4c0571b74bd08b2a59347503e653
msgid "It's also available on OpenAI style API:"
msgstr "也可使用 OpenAI 风格的 API 使用 TGI :"
#: ../../Qwen/source/deployment/tgi.rst:73 5dc7e9c74fc04483ba8e5dcdd7052020
msgid "The model field in the JSON is not used by TGI, you can put anything."
msgstr "JSON 中的 model 字段不会被 TGI 识别,您可传入任意值。"
#: ../../Qwen/source/deployment/tgi.rst:75 d60f837152014cda8baebc90d65d1cc0
#, python-format
msgid "Refer to the `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ for a complete API reference."
msgstr "完整 API 文档,请查阅 `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ 。"
#: ../../Qwen/source/deployment/tgi.rst:77 b59564031e5548088aef828f9753e337
msgid "You can also use Python API:"
msgstr "你也可以使用 Python 访问 API :"
#: ../../Qwen/source/deployment/tgi.rst:106 62646cecb024479ebfeca5f3063e7322
msgid "Quantization for Performance"
msgstr "量化"
#: ../../Qwen/source/deployment/tgi.rst:108 4a8d39bf37be4820afb230f9a977b431
msgid "Data-dependent quantization (GPTQ and AWQ)"
msgstr "依赖数据的量化方案( GPTQ 与 AWQ )"
#: ../../Qwen/source/deployment/tgi.rst:110 ef2b18f47e4f4f7ebb017be628cb0be9
msgid "Both GPTQ and AWQ models are data-dependent. The official quantized models can be found from `the Qwen2.5 collection`_ and you can also quantize models with your own dataset to make it perform better on your use case."
msgstr "GPTQ 与 AWQ 均依赖数据进行量化。我们提供了预先量化好的模型,请于 `the Qwen2.5 collection`_ 查找。你也可以使用自己的数据集自行量化,以在你的场景中取得更好效果。"
#: ../../Qwen/source/deployment/tgi.rst:112 53d94278a2e3409abb9980ebc7c96c24
msgid "The following shows the command to start TGI with Qwen2.5-7B-Instruct-GPTQ-Int4:"
msgstr "以下是通过 TGI 部署 Qwen2.5-7B-Instruct-GPTQ-Int4 的指令:"
#: ../../Qwen/source/deployment/tgi.rst:122 68ff8a07d0eb40cfa67d79e01adea070
msgid "If the model is quantized with AWQ, e.g. Qwen/Qwen2.5-7B-Instruct-AWQ, please use ``--quantize awq``."
msgstr "如果模型是 AWQ 量化的,如 Qwen/Qwen2.5-7B-Instruct-AWQ ,请使用 ``--quantize awq`` 。"
#: ../../Qwen/source/deployment/tgi.rst:124 b4c3b82b1f2a43a8a02383fd0afbda5f
msgid "Data-agnostic quantization"
msgstr "不依赖数据的量化方案"
#: ../../Qwen/source/deployment/tgi.rst:126 7a6b89c94b72407482b96790f5bbd272
msgid "EETQ on the other side is not data dependent and can be used with any model. Note that we're passing in the original model (instead of a quantized model) with the ``--quantize eetq`` flag."
msgstr "EETQ 是一种不依赖数据的量化方案,可直接用于任意模型。请注意,我们需要传入原始模型,并使用 ``--quantize eetq`` 标志。"
#: ../../Qwen/source/deployment/tgi.rst:138 763166da65924887b3bba99ea4d2baab
msgid "Multi-Accelerators Deployment"
msgstr "多卡部署"
#: ../../Qwen/source/deployment/tgi.rst:140 ddcfcff947894f168c7945ae9c42a579
msgid "Use the ``--num-shard`` flag to specify the number of accelerators. Please also use ``--shm-size 1g`` to enable shared memory for optimal NCCL performance (`reference <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__):"
msgstr "使用 ``--num-shard`` 指定卡书数量。 请务必传入 ``--shm-size 1g`` 让 NCCL 发挥最好性能 (`说明 <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__) :"
#: ../../Qwen/source/deployment/tgi.rst:151 520c46fb404c4ec9bf89280e4a71f1e8
msgid "Speculative Decoding"
msgstr "推测性解码 (Speculative Decoding)"
#: ../../Qwen/source/deployment/tgi.rst:153 74c6b65f76b74d56ad109af9da11f66e
msgid "Speculative decoding can reduce the time per token by speculating on the next token. Use the ``--speculative-decoding`` flag, setting the value to the number of tokens to speculate on (default: 0 for no speculation):"
msgstr "推测性解码 (Speculative Decoding) 通过预先推测下一 token 来节约每 token 需要的时间。使用 ``--speculative-decoding`` 设定预先推测 token 的数量 (默认为0,表示不预先推测):"
#: ../../Qwen/source/deployment/tgi.rst:164 dee05ee0fb1a4f2da42b250192d943f5
msgid "The overall performance of speculative decoding highly depends on the type of task. It works best for code or highly repetitive text."
msgstr "推测性解码的加速效果依赖于任务类型,对于代码或重复性较高的文本生成任务,提速更明显。"
#: ../../Qwen/source/deployment/tgi.rst:166 731f300bc1174589901dd5feb26e8b2f
msgid "More context on speculative decoding can be found `here <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__."
msgstr "更多说明可查阅 `此文档 <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__ 。"
#: ../../Qwen/source/deployment/tgi.rst:170 65a7d5553dd145398f9705c1ee6c28f0
msgid "Zero-Code Deployment with HF Inference Endpoints"
msgstr "使用 HF Inference Endpoints 零代码部署"
#: ../../Qwen/source/deployment/tgi.rst:172 721c3a7578f846ae8e21e595923e17e7
msgid "For effortless deployment, leverage Hugging Face Inference Endpoints:"
msgstr "使用 Hugging Face Inference Endpoints 不费吹灰之力:"
#: ../../Qwen/source/deployment/tgi.rst:174 7741607488d94a9f8be2ffcb6a5322fb
msgid "**GUI interface:** `<https://huggingface.co/inference-endpoints/dedicated>`__"
msgstr ""
#: ../../Qwen/source/deployment/tgi.rst:175 02ff4520e66f4a42828483da7d25445f
msgid "**Coding interface:** `<https://huggingface.co/blog/tgi-messages-api>`__"
msgstr ""
#: ../../Qwen/source/deployment/tgi.rst:177 d35f9dd4bc96400cb6c7584012d2df49
msgid "Once deployed, the endpoint can be used as usual."
msgstr "一旦部署成功,服务使用与本地无异。"
#: ../../Qwen/source/deployment/tgi.rst:181 61c1b825bbf24be2aaaeb99de3f0660e
msgid "Common Issues"
msgstr "常见问题"
#: ../../Qwen/source/deployment/tgi.rst:183 b55a2d286fc24dbe92b79ab5c010c7af
msgid "Qwen2.5 supports long context lengths, so carefully choose the values for ``--max-batch-prefill-tokens``, ``--max-total-tokens``, and ``--max-input-tokens`` to avoid potential out-of-memory (OOM) issues. If an OOM occurs, you'll receive an error message upon startup. The following shows an example to modify those parameters:"
msgstr "Qwen2.5 支持长上下文,谨慎设定 ``--max-batch-prefill-tokens`` , ``--max-total-tokens`` 和 ``--max-input-tokens`` 以避免 out-of-memory (OOM) 。如 OOM ,你将在启动 TGI 时收到错误提示。以下为修改这些参数的示例:"
This diff is collapsed.
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2024, Qwen Team
# This file is distributed under the same license as the Qwen package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
#
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/framework/Langchain.rst:2 6f9b66430d9c495592b1e275fdfd7c9e
msgid "Langchain"
msgstr ""
#: ../../Qwen/source/framework/Langchain.rst:5 1205af46f88e4d6681003403109385c3
msgid "To be updated for Qwen3."
msgstr "仍需为Qwen3更新。"
#: ../../Qwen/source/framework/Langchain.rst:7 115ee7b1c8404629a8f98175264cc114
msgid "This guide helps you build a question-answering application based on a local knowledge base using ``Qwen2.5-7B-Instruct`` with ``langchain``. The goal is to establish a knowledge base Q&A solution."
msgstr "本教程旨在帮助您利用 ``Qwen2.5-7B-Instruct`` 与 ``langchain`` ,基于本地知识库构建问答应用。目标是建立一个知识库问答解决方案。"
#: ../../Qwen/source/framework/Langchain.rst:12
#: 7257b95612fb423bb9ca73212fd12a02
msgid "Basic Usage"
msgstr "基础用法"
#: ../../Qwen/source/framework/Langchain.rst:14
#: fecf7a682dcc4c15a53da1f7cdf145e5
msgid "The implementation process of this project includes loading files -> reading text -> segmenting text -> vectorizing text -> vectorizing questions -> matching the top k most similar text vectors with the question vectors -> incorporating the matched text as context along with the question into the prompt -> submitting to the Qwen2.5-7B-Instruct to generate an answer. Below is an example:"
msgstr "您可以仅使用您的文档配合 ``langchain`` 来构建一个问答应用。该项目的实现流程包括加载文件 -> 阅读文本 -> 文本分段 -> 文本向量化 -> 问题向量化 -> 将最相似的前k个文本向量与问题向量匹配 -> 将匹配的文本作为上下文连同问题一起纳入提示 -> 提交给Qwen2.5-7B-Instruct生成答案。以下是一个示例:"
#: ../../Qwen/source/framework/Langchain.rst:98
#: 6ad1ebd2ef4a49f9aa66cfdf777e1290
msgid "After loading the Qwen2.5-7B-Instruct model, you should specify the txt file for retrieval."
msgstr "加载Qwen2.5-7B-Instruct模型后,您可以指定需要用于知识库问答的txt文件。"
#: ../../Qwen/source/framework/Langchain.rst:274
#: 00467b1e4e294a26b9f49886633331e0
msgid "Next Step"
msgstr "下一步"
#: ../../Qwen/source/framework/Langchain.rst:276
#: 15ed906687054af78545290ba0746380
msgid "Now you can chat with Qwen2.5 use your own document. Continue to read the documentation and try to figure out more advanced usages of model retrieval!"
msgstr "现在,您可以在您自己的文档上与Qwen2.5进行交流。继续阅读文档,尝试探索模型检索的更多高级用法!"
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2024, Qwen Team
# This file is distributed under the same license as the Qwen package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2024.
#
msgid ""
msgstr ""
"Project-Id-Version: Qwen \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-04-28 19:42+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.17.0\n"
#: ../../Qwen/source/framework/LlamaIndex.rst:2
#: 2e41f8696c20488d8593b670c6361edf
msgid "LlamaIndex"
msgstr "LlamaIndex"
#: ../../Qwen/source/framework/LlamaIndex.rst:5
#: 20b3836fd391457bb00bf75b61e23e0d
msgid "To be updated for Qwen3."
msgstr "仍需为Qwen3更新。"
#: ../../Qwen/source/framework/LlamaIndex.rst:7
#: 86d9e6f0684749aab40a9824cd026fa3
msgid "To connect Qwen2.5 with external data, such as documents, web pages, etc., we offer a tutorial on `LlamaIndex <https://www.llamaindex.ai/>`__. This guide helps you quickly implement retrieval-augmented generation (RAG) using LlamaIndex with Qwen2.5."
msgstr "为了实现 Qwen2.5 与外部数据(例如文档、网页等)的连接,我们提供了 `LlamaIndex <https://www.llamaindex.ai/>`__ 的详细教程。本指南旨在帮助用户利用 LlamaIndex 与 Qwen2.5 快速部署检索增强生成(RAG)技术。"
#: ../../Qwen/source/framework/LlamaIndex.rst:11
#: 71ed222858054687a5b33222bb6ac086
msgid "Preparation"
msgstr "环境准备"
#: ../../Qwen/source/framework/LlamaIndex.rst:13
#: 161d9153d6484dd5a1f1bdb340847814
msgid "To implement RAG, we advise you to install the LlamaIndex-related packages first."
msgstr "为实现检索增强生成(RAG),我们建议您首先安装与 LlamaIndex 相关的软件包。"
#: ../../Qwen/source/framework/LlamaIndex.rst:16
#: a8d6acb1001a42c88185b971ae2de3bf
msgid "The following is a simple code snippet showing how to do this:"
msgstr "以下是一个简单的代码示例:"
#: ../../Qwen/source/framework/LlamaIndex.rst:25
#: e441d3b8fb6d4a13b52e1560ef250b16
msgid "Set Parameters"
msgstr "设置参数"
#: ../../Qwen/source/framework/LlamaIndex.rst:27
#: c2481804c3f34c7f883eed92ffa3111e
msgid "Now we can set up LLM, embedding model, and the related configurations. Qwen2.5-Instruct supports conversations in multiple languages, including English and Chinese. You can use the ``bge-base-en-v1.5`` model to retrieve from English documents, and you can download the ``bge-base-zh-v1.5`` model to retrieve from Chinese documents. You can also choose ``bge-large`` or ``bge-small`` as the embedding model or modify the context window size or text chunk size depending on your computing resources. Qwen2.5 model families support a maximum of 32K context window size (up to 128K for 7B, 14B, 32B, and 72B, requiring extra configuration)"
msgstr "现在,我们可以设置语言模型和向量模型。Qwen2.5-Instruct支持包括英语和中文在内的多种语言对话。您可以使用 ``bge-base-en-v1.5`` 模型来检索英文文档,下载 ``bge-base-zh-v1.5`` 模型以检索中文文档。根据您的计算资源,您还可以选择 ``bge-large`` 或 ``bge-small`` 作为向量模型,或调整上下文窗口大小或文本块大小。Qwen2.5模型系列支持最大32K上下文窗口大小(7B 、14B 、32B 及 72B可扩展支持 128K 上下文,但需要额外配置)"
#: ../../Qwen/source/framework/LlamaIndex.rst:85
#: 74c35d5a03734c289d162dfa3813ada6
msgid "Build Index"
msgstr "构建索引"
#: ../../Qwen/source/framework/LlamaIndex.rst:87
#: c49859d4ea5f49dba1fa2263f3ae284d
msgid "Now we can build index from documents or websites."
msgstr "现在我们可以从文档或网站构建索引。"
#: ../../Qwen/source/framework/LlamaIndex.rst:89
#: b460d000037e4266a4d9f43d38f1f9b0
msgid "The following code snippet demonstrates how to build an index for files (regardless of whether they are in PDF or TXT format) in a local folder named 'document'."
msgstr "以下代码片段展示了如何为本地名为'document'的文件夹中的文件(无论是PDF格式还是TXT格式)构建索引。"
#: ../../Qwen/source/framework/LlamaIndex.rst:102
#: a416d18b227940e29fac1f59851ff8c4
msgid "The following code snippet demonstrates how to build an index for the content in a list of websites."
msgstr "以下代码片段展示了如何为一系列网站的内容构建索引。"
#: ../../Qwen/source/framework/LlamaIndex.rst:118
#: 487cf928d048424fa1b50438f701137c
msgid "To save and load the index, you can use the following code snippet."
msgstr "要保存和加载已构建的索引,您可以使用以下代码示例。"
#: ../../Qwen/source/framework/LlamaIndex.rst:132
#: c68419c4318d46e891f5df9191be6d2d
msgid "RAG"
msgstr "检索增强(RAG)"
#: ../../Qwen/source/framework/LlamaIndex.rst:134
#: 8ad20a8f43fe496084a40f963ba97440
msgid "Now you can perform queries, and Qwen2.5 will answer based on the content of the indexed documents."
msgstr "现在您可以输入查询,Qwen2.5 将基于索引文档的内容提供答案。"
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment