v1.0

4d49d792 · chenzk · 4d49d792 · 4d49d792 · 4d49d792 · 4d49d792
Commit 4d49d792 authored Jun 30, 2025 by chenzk
19 changed files
--- a/docs/transformers_deployment_guide_cn.md
+++ b/docs/transformers_deployment_guide_cn.md
+# 🚀 MiniMax 模型 Transformers 部署指南
+
+## 📖 简介
+
+本指南将帮助您使用 [Transformers](https://huggingface.co/docs/transformers/index) 库部署 MiniMax-M1 模型。Transformers 是一个广泛使用的深度学习库，提供了丰富的预训练模型和灵活的模型操作接口。
+
+## 🛠️ 环境准备
+
+### 安装 Transformers
+
+```bash
+pip install transformers torch accelerate
+```
+
+## 📋 基本使用示例
+
+预训练模型可以按照以下方式使用：
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+messages = [
+    {"role": "user", "content": [{"type": "text", "text": "What is your favourite condiment?"}]},
+    {"role": "assistant", "content": [{"type": "text", "text": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}]},
+    {"role": "user", "content": [{"type": "text", "text": "Do you have mayonnaise recipes?"}]}
+]
+
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+
+model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
+
+generation_config = GenerationConfig(
+    max_new_tokens=20,
+    eos_token_id=tokenizer.eos_token_id,
+    use_cache=True,
+)
+
+generated_ids = model.generate(**model_inputs, generation_config=generation_config)
+
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+
+## ⚡ 性能优化
+
+### 使用 Flash Attention 加速
+
+上面的代码片段展示了不使用任何优化技巧的推理过程。但通过利用 [Flash Attention](../perf_train_gpu_one#flash-attention-2)，可以大幅加速模型，因为它提供了模型内部使用的注意力机制的更快实现。
+
+首先，确保安装最新版本的 Flash Attention 2：
+
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+
+还要确保您拥有与 Flash-Attention 2 兼容的硬件。在[Flash Attention 官方仓库](https://github.com/Dao-AILab/flash-attention)的官方文档中了解更多信息。此外，请确保以半精度（例如 `torch.float16`）加载模型。
+
+要使用 Flash Attention-2 加载和运行模型，请参考以下代码片段：
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+prompt = "My favourite condiment is"
+
+model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
+response = tokenizer.batch_decode(generated_ids)[0]
+print(response)
+```
+
+## 📮 获取支持
+
+如果您在部署 MiniMax-M1 模型过程中遇到任何问题：
+- 请查看我们的官方文档
+- 通过官方渠道联系我们的技术支持团队
+- 在我们的 GitHub 仓库提交 Issue
+
+我们会持续优化 Transformers 上的部署体验，欢迎您的反馈！
--- a/docs/transformers_deployment_guide_pt-br.md
+++ b/docs/transformers_deployment_guide_pt-br.md
+# 🚀 Guia de Deploy do Modelo MiniMax com Transformers
+
+[Transformers中文版部署指南](./transformers_deployment_guide_cn.md)
+
+## 📖 Introdução
+
+Este guia irá te ajudar a fazer o deploy do modelo MiniMax-M1 utilizando a biblioteca [Transformers](https://huggingface.co/docs/transformers/index). O Transformers é uma biblioteca de deep learning amplamente utilizada, que oferece uma vasta coleção de modelos pré-treinados e interfaces flexíveis para operação dos modelos.
+
+## 🛠️ Configuração do Ambiente
+
+### Instalando o Transformers
+
+```bash
+pip install transformers torch accelerate
+```
+
+## 📋 Exemplo de Uso Básico
+
+O modelo pré-treinado pode ser utilizado da seguinte maneira:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+messages = [
+    {"role": "user", "content": [{"type": "text", "text": "What is your favourite condiment?"}]},
+    {"role": "assistant", "content": [{"type": "text", "text": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}]},
+    {"role": "user", "content": [{"type": "text", "text": "Do you have mayonnaise recipes?"}]}
+]
+
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+
+model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
+
+generation_config = GenerationConfig(
+    max_new_tokens=20,
+    eos_token_id=tokenizer.eos_token_id,
+    use_cache=True,
+)
+
+generated_ids = model.generate(**model_inputs, generation_config=generation_config)
+
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+
+## ⚡ Otimização de Desempenho
+
+### Acelerando com Flash Attention
+
+O exemplo acima mostra uma inferência sem nenhum tipo de otimização. No entanto, é possível acelerar significativamente o modelo utilizando [Flash Attention](../perf_train_gpu_one#flash-attention-2), que é uma implementação mais rápida do mecanismo de atenção usado no modelo.
+
+Primeiro, certifique-se de instalar a versão mais recente do Flash Attention 2:
+
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+
+Além disso, é necessário que seu hardware seja compatível com o Flash Attention 2. Consulte mais informações na documentação oficial do [repositório do Flash Attention](https://github.com/Dao-AILab/flash-attention). Também é recomendado carregar seu modelo em meia precisão (por exemplo, `torch.float16`).
+
+Para carregar e executar um modelo utilizando Flash Attention 2, utilize o exemplo abaixo:
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+prompt = "My favourite condiment is"
+
+model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
+response = tokenizer.batch_decode(generated_ids)[0]
+print(response)
+```
+
+## 📮 Suporte
+
+Se você encontrar qualquer problema durante o deploy do modelo MiniMax-M1:
+
+* Verifique nossa documentação oficial
+* Entre em contato com nossa equipe de suporte técnico pelos canais oficiais
+* Abra uma Issue no nosso repositório no GitHub
+
+Estamos continuamente otimizando a experiência de deploy no Transformers e valorizamos muito seu feedback!
--- a/docs/vllm_deployment_guide.md
+++ b/docs/vllm_deployment_guide.md
+# 🚀 MiniMax Models vLLM Deployment Guide
+
+[vLLM中文版部署指南](./vllm_deployment_guide_cn.md)
+
+## 📖 Introduction
+
+We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:
+
+- 🔥 Outstanding service throughput performance
+- ⚡ Efficient and intelligent memory management
+- 📦 Powerful batch request processing capability
+- ⚙️ Deeply optimized underlying performance
+
+The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.
+
+## 💾 Obtaining MiniMax Models
+
+### MiniMax-M1 Model Obtaining
+
+You can download the model from our official HuggingFace repository: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k), [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k)
+
+Download command:
+```
+pip install -U huggingface-hub
+huggingface-cli download MiniMaxAI/MiniMax-M1-40k
+# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
+
+# If you encounter network issues, you can set a proxy
+export HF_ENDPOINT=https://hf-mirror.com
+```
+
+Or download using git:
+
+```bash
+git lfs install
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
+```
+
+⚠️ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files.
+
+## 🛠️ Deployment Options
+
+### Option 1: Deploy Using Docker (Recommended)
+
+To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.
+
+⚠️ **Version Requirements**: 
+- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
+- If you are using a Docker image with vLLM version lower than the required version, you will need to:
+  1. Update to the latest vLLM code
+  2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
+- Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration:
+  1. Open `config.json`
+  2. Change `config['architectures'] = ["MiniMaxM1ForCausalLM"]` to `config['architectures'] = ["MiniMaxText01ForCausalLM"]`
+
+1. Get the container image:
+```bash
+docker pull vllm/vllm-openai:v0.8.3
+```
+
+2. Run the container:
+```bash
+# Set environment variables
+IMAGE=vllm/vllm-openai:v0.8.3
+MODEL_DIR=<model storage path>
+CODE_DIR=<code path>
+NAME=MiniMaxImage
+
+# Docker run configuration
+DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
+
+# Start the container
+sudo docker run -it \
+    -v $MODEL_DIR:$MODEL_DIR \
+    -v $CODE_DIR:$CODE_DIR \
+    --name $NAME \
+    $DOCKER_RUN_CMD \
+    $IMAGE /bin/bash
+```
+
+
+### Option 2: Direct Installation of vLLM
+
+If your environment meets the following requirements:
+
+- CUDA 12.1
+- PyTorch 2.1
+
+You can directly install vLLM
+
+Installation command:
+```bash
+pip install vllm
+```
+
+💡 If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)
+
+## 🚀 Starting the Service
+
+### Launch MiniMax-M1 Service
+
+```bash
+export SAFETENSORS_FAST_GPU=1
+export VLLM_USE_V1=0
+python3 -m vllm.entrypoints.openai.api_server \
+--model <model storage path> \
+--tensor-parallel-size 8 \
+--trust-remote-code \
+--quantization experts_int8  \
+--max_model_len 4096 \
+--dtype bfloat16
+```
+
+### API Call Example
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "MiniMaxAI/MiniMax-M1",
+        "messages": [
+            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
+            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
+        ]
+    }'
+```
+
+## ❗ Common Issues
+
+### Module Loading Problems
+If you encounter the following error:
+```
+import vllm._C  # noqa
+ModuleNotFoundError: No module named 'vllm._C'
+```
+
+Or
+
+```
+MiniMax-M1 model is not currently supported
+```
+
+We provide two solutions:
+
+#### Solution 1: Copy Dependency Files
+```bash
+cd <working directory>
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm 
+cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
+```
+
+#### Solution 2: Install from Source
+```bash
+cd <working directory>
+git clone https://github.com/vllm-project/vllm.git
+
+cd vllm/
+pip install -e .
+```
+
+## 📮 Getting Support
+
+If you encounter any issues while deploying MiniMax-M1 model:
+- Please check our official documentation
+- Contact our technical support team through official channels
+- Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository
+
+We will continuously optimize the deployment experience of this model and welcome your feedback!
--- a/docs/vllm_deployment_guide_cn.md
+++ b/docs/vllm_deployment_guide_cn.md
+# 🚀 MiniMax 模型 vLLM 部署指南
+
+## 📖 简介
+
+我们推荐使用 [vLLM](https://docs.vllm.ai/en/latest/) 来部署 [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) 模型。经过我们的测试，vLLM 在部署这个模型时表现出色，具有以下特点：
+
+- 🔥 卓越的服务吞吐量性能
+- ⚡ 高效智能的内存管理机制
+- 📦 强大的批量请求处理能力
+- ⚙️ 深度优化的底层性能
+
+MiniMax-M1 模型可在单台配备8个H800或8个H20 GPU的服务器上高效运行。在硬件配置方面，搭载8个H800 GPU的服务器可处理长达200万token的上下文输入，而配备8个H20 GPU的服务器则能够支持高达500万token的超长上下文处理能力。
+
+## 💾 获取 MiniMax 模型
+
+### MiniMax-M1 模型获取
+
+您可以从我们的官方 HuggingFace 仓库下载模型：[MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k)、[MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k)
+
+下载命令：
+```
+pip install -U huggingface-hub
+huggingface-cli download MiniMaxAI/MiniMax-M1-40k
+# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
+
+# 如果遇到网络问题，可以设置代理
+export HF_ENDPOINT=https://hf-mirror.com
+```
+
+或者使用 git 下载：
+
+```bash
+git lfs install
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
+```
+
+⚠️ **重要提示**：请确保系统已安装 [Git LFS](https://git-lfs.github.com/)，这对于完整下载模型权重文件是必需的。
+
+## 🛠️ 部署方案
+
+### 方案一：使用 Docker 部署（推荐）
+
+为确保部署环境的一致性和稳定性，我们推荐使用 Docker 进行部署。
+
+⚠️ **版本要求**：
+- 基础要求：vLLM 版本必须 ≥ 0.8.3，以确保对 MiniMax-M1 模型的完整支持
+- 特殊说明：如果使用 vLLM 0.8.3 至 0.9.2 之间的版本，需要修改模型配置文件：
+  - 打开 `config.json`
+  - 将 `config['architectures'] = ["MiniMaxM1ForCausalLM"]` 修改为 `config['architectures'] = ["MiniMaxText01ForCausalLM"]`
+
+1. 获取容器镜像：
+```bash
+docker pull vllm/vllm-openai:v0.8.3
+```
+
+2. 运行容器：
+```bash
+# 设置环境变量
+IMAGE=vllm/vllm-openai:v0.8.3
+MODEL_DIR=<模型存放路径>
+CODE_DIR=<代码路径>
+NAME=MiniMaxImage
+
+# Docker运行配置
+DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
+
+# 启动容器
+sudo docker run -it \
+    -v $MODEL_DIR:$MODEL_DIR \
+    -v $CODE_DIR:$CODE_DIR \
+    --name $NAME \
+    $DOCKER_RUN_CMD \
+    $IMAGE /bin/bash
+```
+
+
+### 方案二：直接安装 vLLM
+
+如果您的环境满足以下要求：
+
+- CUDA 12.1
+- PyTorch 2.1
+
+可以直接安装 vLLM
+
+安装命令：
+```bash
+pip install vllm
+```
+
+💡 如果您使用其他环境配置，请参考 [vLLM 安装指南](https://docs.vllm.ai/en/latest/getting_started/installation.html)
+
+## 🚀 启动服务
+
+### 启动 MiniMax-M1 服务
+
+```bash
+export SAFETENSORS_FAST_GPU=1
+export VLLM_USE_V1=0
+python3 -m vllm.entrypoints.openai.api_server \
+--model <模型存放路径> \
+--tensor-parallel-size 8 \
+--trust-remote-code \
+--quantization experts_int8  \
+--max_model_len 4096 \
+--dtype bfloat16
+```
+
+### API 调用示例
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "MiniMaxAI/MiniMax-M1",
+        "messages": [
+            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
+            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
+        ]
+    }'
+```
+
+## ❗ 常见问题
+
+### 模块加载问题
+如果遇到以下错误：
+```
+import vllm._C  # noqa
+ModuleNotFoundError: No module named 'vllm._C'
+```
+
+或
+
+```
+当前并不支持 MiniMax-M1 模型
+```
+
+我们提供两种解决方案：
+
+#### 解决方案一：复制依赖文件
+```bash
+cd <工作目录>
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm 
+cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
+```
+
+#### 解决方案二：从源码安装
+```bash
+cd <工作目录>
+git clone https://github.com/vllm-project/vllm.git
+
+cd vllm/
+pip install -e .
+```
+
+## 📮 获取支持
+
+如果您在部署 MiniMax-M1 模型过程中遇到任何问题：
+- 请查看我们的官方文档
+- 通过官方渠道联系我们的技术支持团队
+- 在我们的 GitHub 仓库提交 [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues)
+
+我们会持续优化模型的部署体验，欢迎您的反馈！
--- a/docs/vllm_deployment_guide_pt-br.md
+++ b/docs/vllm_deployment_guide_pt-br.md
+# 🚀 Guia de Deploy dos Modelos MiniMax com vLLM
+
+[vLLM中文版部署指南](./vllm_deployment_guide_cn.md)
+
+## 📖 Introdução
+
+Recomendamos utilizar o [vLLM](https://docs.vllm.ai/en/latest/) para fazer o deploy do modelo [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k). Com base nos nossos testes, o vLLM apresenta excelente desempenho ao executar este modelo, oferecendo as seguintes vantagens:
+
+- 🔥 Desempenho excepcional em throughput de serviço
+- ⚡ Gerenciamento de memória eficiente e inteligente
+- 📦 Capacidade robusta de processamento de requisições em lote
+- ⚙️ Otimização profunda de desempenho em baixo nível
+
+O modelo MiniMax-M1 pode ser executado de forma eficiente em um servidor único equipado com 8 GPUs H800 ou 8 GPUs H20. Em termos de configuração de hardware, um servidor com 8 GPUs H800 consegue processar entradas de contexto com até 2 milhões de tokens, enquanto um servidor equipado com 8 GPUs H20 suporta contextos ultra longos de até 5 milhões de tokens.
+
+## 💾 Obtendo os Modelos MiniMax
+
+### Download do Modelo MiniMax-M1
+
+Você pode baixar o modelo diretamente do nosso repositório oficial no HuggingFace: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) ou [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k).
+
+Comando para download:
+```
+pip install -U huggingface-hub
+huggingface-cli download MiniMaxAI/MiniMax-M1-40k
+
+# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
+
+# Se você encontrar problemas de rede, pode configurar um proxy
+
+export HF\_ENDPOINT=[https://hf-mirror.com](https://hf-mirror.com)
+```
+
+Ou faça o download usando git:
+
+```bash
+git lfs install
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
+```
+
+⚠️ **Atenção Importante**: Certifique-se de que o [Git LFS](https://git-lfs.github.com/) está instalado no seu sistema, pois ele é necessário para baixar completamente os arquivos de pesos do modelo.
+
+## 🛠️ Opções de Deploy
+
+### Opção 1: Deploy Utilizando Docker (Recomendado)
+
+Para garantir consistência e estabilidade no ambiente de deployment, recomendamos utilizar Docker.
+
+⚠️ **Requisitos de Versão**:
+
+* O modelo MiniMax-M1 requer vLLM na versão 0.8.3 ou superior para suporte completo.
+* Se estiver utilizando uma imagem Docker com vLLM em versão inferior à necessária, será preciso:
+
+  1. Atualizar para a versão mais recente do vLLM.
+  2. Recompilar o vLLM a partir do código-fonte (consulte as instruções na Solução 2 da seção de Problemas Comuns).
+* Nota especial: Para versões do vLLM entre 0.8.3 e 0.9.2, é necessário modificar a configuração do modelo:
+
+  1. Abra o arquivo `config.json`.
+  2. Altere `config['architectures'] = ["MiniMaxM1ForCausalLM"]` para `config['architectures'] = ["MiniMaxText01ForCausalLM"]`.
+
+1. Obtenha a imagem do container:
+
+```bash
+docker pull vllm/vllm-openai:v0.8.3
+```
+
+2. Execute o container:
+
+```bash
+# Defina variáveis de ambiente
+IMAGE=vllm/vllm-openai:v0.8.3
+MODEL_DIR=<caminho onde estão os modelos>
+CODE_DIR=<caminho onde está o código>
+NAME=MiniMaxImage
+
+# Configuração do Docker run
+DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
+
+# Inicie o container
+sudo docker run -it \
+    -v $MODEL_DIR:$MODEL_DIR \
+    -v $CODE_DIR:$CODE_DIR \
+    --name $NAME \
+    $DOCKER_RUN_CMD \
+    $IMAGE /bin/bash
+```
+
+### Opção 2: Instalação Direta do vLLM
+
+Se o seu ambiente possuir os seguintes requisitos:
+
+* CUDA 12.1
+* PyTorch 2.1
+
+Você pode instalar o vLLM diretamente com:
+
+```bash
+pip install vllm
+```
+
+💡 Se você estiver utilizando outra configuração de ambiente, consulte o [Guia de Instalação do vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html).
+
+## 🚀 Inicializando o Serviço
+
+### Iniciando o Serviço com MiniMax-M1
+
+```bash
+export SAFETENSORS_FAST_GPU=1
+export VLLM_USE_V1=0
+python3 -m vllm.entrypoints.openai.api_server \
+--model <caminho onde estão os modelos> \
+--tensor-parallel-size 8 \
+--trust-remote-code \
+--quantization experts_int8  \
+--max_model_len 4096 \
+--dtype bfloat16
+```
+
+### Exemplo de Chamada via API
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "MiniMaxAI/MiniMax-M1",
+        "messages": [
+            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
+            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
+        ]
+    }'
+```
+
+## ❗ Problemas Comuns
+
+### Problemas ao Carregar Módulos
+
+Se você encontrar o erro:
+
+```
+import vllm._C  # noqa
+ModuleNotFoundError: No module named 'vllm._C'
+```
+
+Ou
+
+```
+MiniMax-M1 model is not currently supported
+```
+
+Disponibilizamos duas soluções:
+
+#### Solução 1: Copiar Arquivos de Dependência
+
+```bash
+cd <diretório de trabalho>
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm 
+cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
+```
+
+#### Solução 2: Instalar a partir do Código-Fonte
+
+```bash
+cd <diretório de trabalho>
+git clone https://github.com/vllm-project/vllm.git
+
+cd vllm/
+pip install -e .
+```
+
+## 📮 Suporte
+
+Se você tiver qualquer problema durante o deploy do modelo MiniMax-M1:
+
+* Consulte nossa documentação oficial
+* Entre em contato com nossa equipe de suporte técnico pelos canais oficiais
+* Abra uma [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) no nosso repositório do GitHub
+
+Estamos constantemente otimizando a experiência de deploy deste modelo e valorizamos muito seu feedback!
--- a/figures/MiniMaxLogo-Dark.png
+++ b/figures/MiniMaxLogo-Dark.png
--- a/figures/MiniMaxLogo-Light.png
+++ b/figures/MiniMaxLogo-Light.png
--- a/figures/TextBench.png
+++ b/figures/TextBench.png
--- a/figures/wechat-qrcode.jpeg
+++ b/figures/wechat-qrcode.jpeg
--- a/icon.png
+++ b/icon.png
--- a/infer_vllm.py
+++ b/infer_vllm.py
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+from multiprocessing import freeze_support
+
+
+if __name__ == '__main__':
+    freeze_support()
+    # Initialize the tokenizer
+    tokenizer = AutoTokenizer.from_pretrained("MiniMax/MiniMax-M1-40k")
+
+    # Pass the default decoding hyperparameters of Qwen3-8B.
+    # max_tokens is for the maximum length for generation.
+    sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
+
+    # Input the model name or path. Can be GPTQ or AWQ models.
+    llm = LLM(
+        model="MiniMax/MiniMax-M1-40k",
+        distributed_executor_backend="ray",
+        tensor_parallel_size=16,
+        max_model_len=4096,
+        dtype="bfloat16",
+        enforce_eager=True,
+        gpu_memory_utilization=0.99,
+        trust_remote_code=True,
+    )
+
+    # Prepare your prompts
+    prompt = "美国的国土面积多大?"
+    messages = [
+        {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
+        {"role": "user", "content": [{"type": "text", "text": prompt}]}
+    ]
+
+    text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+        enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
+    )
+
+    # generate outputs
+    outputs = llm.generate([text], sampling_params)
+
+    # Print the outputs.
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Generated text: {generated_text!r}")
--- a/main.py
+++ b/main.py
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig
+import torch
+import argparse
+
+"""
+ usage:
+    export SAFETENSORS_FAST_GPU=1
+    python main.py --quant_type int8 --world_size 8 --model_id <model_path>
+"""
+
+def generate_quanto_config(hf_config: AutoConfig, quant_type: str):
+    QUANT_TYPE_MAP = {
+        "default": None,
+        "int8": QuantoConfig(
+            weights="int8",
+            modules_to_not_convert=[
+                "lm_head",
+                "embed_tokens",
+            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
+            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
+        ),
+    }
+    return QUANT_TYPE_MAP[quant_type]
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--quant_type", type=str, default="default", choices=["default", "int8"])
+    parser.add_argument("--model_id", type=str, required=True)
+    parser.add_argument("--world_size", type=int, required=True)
+    return parser.parse_args()
+
+
+def check_params(args, hf_config: AutoConfig):
+    if args.quant_type == "int8":
+        assert args.world_size >= 8, "int8 weight-only quantization requires at least 8 GPUs"
+
+    assert hf_config.num_hidden_layers % args.world_size == 0, f"num_hidden_layers({hf_config.num_hidden_layers}) must be divisible by world_size({args.world_size})"
+
+
+@torch.no_grad()
+def main():
+    args = parse_args()
+    print("\n=============== Argument ===============")
+    for key in vars(args):
+        print(f"{key}: {vars(args)[key]}")
+    print("========================================")
+
+    model_id = args.model_id
+
+    hf_config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
+    check_params(args, hf_config)
+    quantization_config = generate_quanto_config(hf_config, args.quant_type)
+ 
+    device_map = {
+        'model.embed_tokens': 'cuda:0',
+        'model.norm': f'cuda:{args.world_size - 1}',
+        'lm_head': f'cuda:{args.world_size - 1}'
+    }
+    layers_per_device = hf_config.num_hidden_layers // args.world_size
+    for i in range(args.world_size):
+        for j in range(layers_per_device):
+            device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
+
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    message = [
+        {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
+        {"role": "user", "content": [{"type": "text", "text": "Hello, what is the weather today?"}]}
+    ]
+    tools = [
+        {"name": "get_location", "description": "Get the location of the user.", "parameters": {"type": "object", "properties": {}}},
+        {"name": "get_weather", "description": "Get the weather of a city.", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The name of the city"}}}},
+        {"name": "get_news", "description": "Get the news.", "parameters": {"type": "object", "properties": {"domain": {"type": "string", "description": "The domain of the news"}}}}
+    ]
+    text = tokenizer.apply_chat_template(
+        message,
+        tools,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    model_inputs = tokenizer(text, return_tensors="pt").to("cuda")
+    quantized_model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        torch_dtype="bfloat16",
+        device_map=device_map,
+        quantization_config=quantization_config,
+        trust_remote_code=True,
+        offload_buffers=True,
+    )
+    generation_config = GenerationConfig(
+        max_new_tokens=20,
+        eos_token_id=200020,
+        use_cache=True,
+    )
+    generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
+    print(f"generated_ids: {generated_ids}")
+    generated_ids = [
+        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    print(response)
+
+if __name__ == "__main__":
+    main()
+
+
--- a/merges.txt
+++ b/merges.txt
--- a/model.properties
+++ b/model.properties
+# 模型编码
+modelCode=1636
+# 模型名称
+modelName=MiniMax-M1_vllm
+# 模型描述
+modelDescription=MiniMax M1拥有超长的上下文能力，100万token输入，8万token输出，足以媲美Gemini 2.5 Pro的开源模型。
+# 应用场景
+appScenario=推理,对话问答,制造,广媒,金融,能源,医疗,家居,教育
+# 框架类型
+frameType=vllm
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
--- a/modeling_minimax_m1.py
+++ b/modeling_minimax_m1.py
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
+{
+  "add_prefix_space": false,
+  "bos_token": "<beginning_of_sentence>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<end_of_sentence>",
+  "model_max_length": 40960000,
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<end_of_document>",
+  "chat_template": "{{ '<begin_of_document>' -}}{% set ns = namespace(system_prompt='') -%}{% for message in messages -%}{% if message['role'] == 'system' -%}{% set ns.system_prompt = ns.system_prompt + message['content'][0]['text'] -%}{% endif -%}{%- endfor -%}{% if ns.system_prompt != '' -%}{{ '<beginning_of_sentence>system ai_setting=assistant\n' + ns.system_prompt + '<end_of_sentence>\n' -}}{%- endif -%}{% if tools -%}{{ '<beginning_of_sentence>system tool_setting=tools\nYou are provided with these tools:\n<tools>\n' -}}{% for tool in tools -%}{{ tool | tojson ~ '\n' -}}{%- endfor -%}{{ '</tools>\n\nIf you need to call tools, please respond with <tool_calls></tool_calls> XML tags, and provide tool-name and json-object of arguments, following the format below:\n<tool_calls>\n{''name'': <tool-name-1>, ''arguments'': <args-json-object-1>}\n...\n</tool_calls><end_of_sentence>\n' -}}{%- endif -%}{% for message in messages -%}{% if message['role'] == 'user' -%}{{ '<beginning_of_sentence>user name=user\n' + message['content'][0]['text'] + '<end_of_sentence>\n' -}}{% elif message['role'] == 'assistant' -%}{{ '<beginning_of_sentence>ai name=assistant\n' -}}{% for content in message['content'] | selectattr('type', 'equalto', 'text') -%}{{ content['text'] -}}{%- endfor -%}{{ '<end_of_sentence>\n' -}}{% elif message['role'] == 'tool' -%}{{ '<beginning_of_sentence>tool name=tools\n' }} {%- for content in message['content'] -%}{{- 'tool name: ' + content['name'] + '\n' + 'tool result: ' + content['text'] + '\n\n' -}} {%- endfor -%}{{- '<end_of_sentence>\n' -}}{% endif -%}{%- endfor -%}{% if add_generation_prompt -%}{{ '<beginning_of_sentence>ai name=assistant\n' -}}{%- endif -%}"
+}
--- a/vocab.json
+++ b/vocab.json