Commit 4d49d792 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
# 🚀 MiniMax 模型 Transformers 部署指南
## 📖 简介
本指南将帮助您使用 [Transformers](https://huggingface.co/docs/transformers/index) 库部署 MiniMax-M1 模型。Transformers 是一个广泛使用的深度学习库,提供了丰富的预训练模型和灵活的模型操作接口。
## 🛠️ 环境准备
### 安装 Transformers
```bash
pip install transformers torch accelerate
```
## 📋 基本使用示例
预训练模型可以按照以下方式使用:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
MODEL_PATH = "{MODEL_PATH}"
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
messages = [
{"role": "user", "content": [{"type": "text", "text": "What is your favourite condiment?"}]},
{"role": "assistant", "content": [{"type": "text", "text": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}]},
{"role": "user", "content": [{"type": "text", "text": "Do you have mayonnaise recipes?"}]}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
generation_config = GenerationConfig(
max_new_tokens=20,
eos_token_id=tokenizer.eos_token_id,
use_cache=True,
)
generated_ids = model.generate(**model_inputs, generation_config=generation_config)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
## ⚡ 性能优化
### 使用 Flash Attention 加速
上面的代码片段展示了不使用任何优化技巧的推理过程。但通过利用 [Flash Attention](../perf_train_gpu_one#flash-attention-2),可以大幅加速模型,因为它提供了模型内部使用的注意力机制的更快实现。
首先,确保安装最新版本的 Flash Attention 2:
```bash
pip install -U flash-attn --no-build-isolation
```
还要确保您拥有与 Flash-Attention 2 兼容的硬件。在[Flash Attention 官方仓库](https://github.com/Dao-AILab/flash-attention)的官方文档中了解更多信息。此外,请确保以半精度(例如 `torch.float16`)加载模型。
要使用 Flash Attention-2 加载和运行模型,请参考以下代码片段:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "{MODEL_PATH}"
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
prompt = "My favourite condiment is"
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
```
## 📮 获取支持
如果您在部署 MiniMax-M1 模型过程中遇到任何问题:
- 请查看我们的官方文档
- 通过官方渠道联系我们的技术支持团队
- 在我们的 GitHub 仓库提交 Issue
我们会持续优化 Transformers 上的部署体验,欢迎您的反馈!
# 🚀 Guia de Deploy do Modelo MiniMax com Transformers
[Transformers中文版部署指南](./transformers_deployment_guide_cn.md)
## 📖 Introdução
Este guia irá te ajudar a fazer o deploy do modelo MiniMax-M1 utilizando a biblioteca [Transformers](https://huggingface.co/docs/transformers/index). O Transformers é uma biblioteca de deep learning amplamente utilizada, que oferece uma vasta coleção de modelos pré-treinados e interfaces flexíveis para operação dos modelos.
## 🛠️ Configuração do Ambiente
### Instalando o Transformers
```bash
pip install transformers torch accelerate
```
## 📋 Exemplo de Uso Básico
O modelo pré-treinado pode ser utilizado da seguinte maneira:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
MODEL_PATH = "{MODEL_PATH}"
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
messages = [
{"role": "user", "content": [{"type": "text", "text": "What is your favourite condiment?"}]},
{"role": "assistant", "content": [{"type": "text", "text": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}]},
{"role": "user", "content": [{"type": "text", "text": "Do you have mayonnaise recipes?"}]}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
generation_config = GenerationConfig(
max_new_tokens=20,
eos_token_id=tokenizer.eos_token_id,
use_cache=True,
)
generated_ids = model.generate(**model_inputs, generation_config=generation_config)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
## ⚡ Otimização de Desempenho
### Acelerando com Flash Attention
O exemplo acima mostra uma inferência sem nenhum tipo de otimização. No entanto, é possível acelerar significativamente o modelo utilizando [Flash Attention](../perf_train_gpu_one#flash-attention-2), que é uma implementação mais rápida do mecanismo de atenção usado no modelo.
Primeiro, certifique-se de instalar a versão mais recente do Flash Attention 2:
```bash
pip install -U flash-attn --no-build-isolation
```
Além disso, é necessário que seu hardware seja compatível com o Flash Attention 2. Consulte mais informações na documentação oficial do [repositório do Flash Attention](https://github.com/Dao-AILab/flash-attention). Também é recomendado carregar seu modelo em meia precisão (por exemplo, `torch.float16`).
Para carregar e executar um modelo utilizando Flash Attention 2, utilize o exemplo abaixo:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "{MODEL_PATH}"
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
prompt = "My favourite condiment is"
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
```
## 📮 Suporte
Se você encontrar qualquer problema durante o deploy do modelo MiniMax-M1:
* Verifique nossa documentação oficial
* Entre em contato com nossa equipe de suporte técnico pelos canais oficiais
* Abra uma Issue no nosso repositório no GitHub
Estamos continuamente otimizando a experiência de deploy no Transformers e valorizamos muito seu feedback!
# 🚀 MiniMax Models vLLM Deployment Guide
[vLLM中文版部署指南](./vllm_deployment_guide_cn.md)
## 📖 Introduction
We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:
- 🔥 Outstanding service throughput performance
- ⚡ Efficient and intelligent memory management
- 📦 Powerful batch request processing capability
- ⚙️ Deeply optimized underlying performance
The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.
## 💾 Obtaining MiniMax Models
### MiniMax-M1 Model Obtaining
You can download the model from our official HuggingFace repository: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k), [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k)
Download command:
```
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
# If you encounter network issues, you can set a proxy
export HF_ENDPOINT=https://hf-mirror.com
```
Or download using git:
```bash
git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
```
⚠️ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files.
## 🛠️ Deployment Options
### Option 1: Deploy Using Docker (Recommended)
To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.
⚠️ **Version Requirements**:
- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
- If you are using a Docker image with vLLM version lower than the required version, you will need to:
1. Update to the latest vLLM code
2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
- Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration:
1. Open `config.json`
2. Change `config['architectures'] = ["MiniMaxM1ForCausalLM"]` to `config['architectures'] = ["MiniMaxText01ForCausalLM"]`
1. Get the container image:
```bash
docker pull vllm/vllm-openai:v0.8.3
```
2. Run the container:
```bash
# Set environment variables
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<model storage path>
CODE_DIR=<code path>
NAME=MiniMaxImage
# Docker run configuration
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
# Start the container
sudo docker run -it \
-v $MODEL_DIR:$MODEL_DIR \
-v $CODE_DIR:$CODE_DIR \
--name $NAME \
$DOCKER_RUN_CMD \
$IMAGE /bin/bash
```
### Option 2: Direct Installation of vLLM
If your environment meets the following requirements:
- CUDA 12.1
- PyTorch 2.1
You can directly install vLLM
Installation command:
```bash
pip install vllm
```
💡 If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)
## 🚀 Starting the Service
### Launch MiniMax-M1 Service
```bash
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <model storage path> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 4096 \
--dtype bfloat16
```
### API Call Example
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M1",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
```
## ❗ Common Issues
### Module Loading Problems
If you encounter the following error:
```
import vllm._C # noqa
ModuleNotFoundError: No module named 'vllm._C'
```
Or
```
MiniMax-M1 model is not currently supported
```
We provide two solutions:
#### Solution 1: Copy Dependency Files
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
```
#### Solution 2: Install from Source
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm/
pip install -e .
```
## 📮 Getting Support
If you encounter any issues while deploying MiniMax-M1 model:
- Please check our official documentation
- Contact our technical support team through official channels
- Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository
We will continuously optimize the deployment experience of this model and welcome your feedback!
# 🚀 MiniMax 模型 vLLM 部署指南
## 📖 简介
我们推荐使用 [vLLM](https://docs.vllm.ai/en/latest/) 来部署 [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) 模型。经过我们的测试,vLLM 在部署这个模型时表现出色,具有以下特点:
- 🔥 卓越的服务吞吐量性能
- ⚡ 高效智能的内存管理机制
- 📦 强大的批量请求处理能力
- ⚙️ 深度优化的底层性能
MiniMax-M1 模型可在单台配备8个H800或8个H20 GPU的服务器上高效运行。在硬件配置方面,搭载8个H800 GPU的服务器可处理长达200万token的上下文输入,而配备8个H20 GPU的服务器则能够支持高达500万token的超长上下文处理能力。
## 💾 获取 MiniMax 模型
### MiniMax-M1 模型获取
您可以从我们的官方 HuggingFace 仓库下载模型:[MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k)[MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k)
下载命令:
```
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
# 如果遇到网络问题,可以设置代理
export HF_ENDPOINT=https://hf-mirror.com
```
或者使用 git 下载:
```bash
git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
```
⚠️ **重要提示**:请确保系统已安装 [Git LFS](https://git-lfs.github.com/),这对于完整下载模型权重文件是必需的。
## 🛠️ 部署方案
### 方案一:使用 Docker 部署(推荐)
为确保部署环境的一致性和稳定性,我们推荐使用 Docker 进行部署。
⚠️ **版本要求**
- 基础要求:vLLM 版本必须 ≥ 0.8.3,以确保对 MiniMax-M1 模型的完整支持
- 特殊说明:如果使用 vLLM 0.8.3 至 0.9.2 之间的版本,需要修改模型配置文件:
- 打开 `config.json`
-`config['architectures'] = ["MiniMaxM1ForCausalLM"]` 修改为 `config['architectures'] = ["MiniMaxText01ForCausalLM"]`
1. 获取容器镜像:
```bash
docker pull vllm/vllm-openai:v0.8.3
```
2. 运行容器:
```bash
# 设置环境变量
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<模型存放路径>
CODE_DIR=<代码路径>
NAME=MiniMaxImage
# Docker运行配置
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
# 启动容器
sudo docker run -it \
-v $MODEL_DIR:$MODEL_DIR \
-v $CODE_DIR:$CODE_DIR \
--name $NAME \
$DOCKER_RUN_CMD \
$IMAGE /bin/bash
```
### 方案二:直接安装 vLLM
如果您的环境满足以下要求:
- CUDA 12.1
- PyTorch 2.1
可以直接安装 vLLM
安装命令:
```bash
pip install vllm
```
💡 如果您使用其他环境配置,请参考 [vLLM 安装指南](https://docs.vllm.ai/en/latest/getting_started/installation.html)
## 🚀 启动服务
### 启动 MiniMax-M1 服务
```bash
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <模型存放路径> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 4096 \
--dtype bfloat16
```
### API 调用示例
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M1",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
```
## ❗ 常见问题
### 模块加载问题
如果遇到以下错误:
```
import vllm._C # noqa
ModuleNotFoundError: No module named 'vllm._C'
```
```
当前并不支持 MiniMax-M1 模型
```
我们提供两种解决方案:
#### 解决方案一:复制依赖文件
```bash
cd <工作目录>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
```
#### 解决方案二:从源码安装
```bash
cd <工作目录>
git clone https://github.com/vllm-project/vllm.git
cd vllm/
pip install -e .
```
## 📮 获取支持
如果您在部署 MiniMax-M1 模型过程中遇到任何问题:
- 请查看我们的官方文档
- 通过官方渠道联系我们的技术支持团队
- 在我们的 GitHub 仓库提交 [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues)
我们会持续优化模型的部署体验,欢迎您的反馈!
# 🚀 Guia de Deploy dos Modelos MiniMax com vLLM
[vLLM中文版部署指南](./vllm_deployment_guide_cn.md)
## 📖 Introdução
Recomendamos utilizar o [vLLM](https://docs.vllm.ai/en/latest/) para fazer o deploy do modelo [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k). Com base nos nossos testes, o vLLM apresenta excelente desempenho ao executar este modelo, oferecendo as seguintes vantagens:
- 🔥 Desempenho excepcional em throughput de serviço
- ⚡ Gerenciamento de memória eficiente e inteligente
- 📦 Capacidade robusta de processamento de requisições em lote
- ⚙️ Otimização profunda de desempenho em baixo nível
O modelo MiniMax-M1 pode ser executado de forma eficiente em um servidor único equipado com 8 GPUs H800 ou 8 GPUs H20. Em termos de configuração de hardware, um servidor com 8 GPUs H800 consegue processar entradas de contexto com até 2 milhões de tokens, enquanto um servidor equipado com 8 GPUs H20 suporta contextos ultra longos de até 5 milhões de tokens.
## 💾 Obtendo os Modelos MiniMax
### Download do Modelo MiniMax-M1
Você pode baixar o modelo diretamente do nosso repositório oficial no HuggingFace: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) ou [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k).
Comando para download:
```
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
# Se você encontrar problemas de rede, pode configurar um proxy
export HF\_ENDPOINT=[https://hf-mirror.com](https://hf-mirror.com)
```
Ou faça o download usando git:
```bash
git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
```
⚠️ **Atenção Importante**: Certifique-se de que o [Git LFS](https://git-lfs.github.com/) está instalado no seu sistema, pois ele é necessário para baixar completamente os arquivos de pesos do modelo.
## 🛠️ Opções de Deploy
### Opção 1: Deploy Utilizando Docker (Recomendado)
Para garantir consistência e estabilidade no ambiente de deployment, recomendamos utilizar Docker.
⚠️ **Requisitos de Versão**:
* O modelo MiniMax-M1 requer vLLM na versão 0.8.3 ou superior para suporte completo.
* Se estiver utilizando uma imagem Docker com vLLM em versão inferior à necessária, será preciso:
1. Atualizar para a versão mais recente do vLLM.
2. Recompilar o vLLM a partir do código-fonte (consulte as instruções na Solução 2 da seção de Problemas Comuns).
* Nota especial: Para versões do vLLM entre 0.8.3 e 0.9.2, é necessário modificar a configuração do modelo:
1. Abra o arquivo `config.json`.
2. Altere `config['architectures'] = ["MiniMaxM1ForCausalLM"]` para `config['architectures'] = ["MiniMaxText01ForCausalLM"]`.
1. Obtenha a imagem do container:
```bash
docker pull vllm/vllm-openai:v0.8.3
```
2. Execute o container:
```bash
# Defina variáveis de ambiente
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<caminho onde estão os modelos>
CODE_DIR=<caminho onde está o código>
NAME=MiniMaxImage
# Configuração do Docker run
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
# Inicie o container
sudo docker run -it \
-v $MODEL_DIR:$MODEL_DIR \
-v $CODE_DIR:$CODE_DIR \
--name $NAME \
$DOCKER_RUN_CMD \
$IMAGE /bin/bash
```
### Opção 2: Instalação Direta do vLLM
Se o seu ambiente possuir os seguintes requisitos:
* CUDA 12.1
* PyTorch 2.1
Você pode instalar o vLLM diretamente com:
```bash
pip install vllm
```
💡 Se você estiver utilizando outra configuração de ambiente, consulte o [Guia de Instalação do vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html).
## 🚀 Inicializando o Serviço
### Iniciando o Serviço com MiniMax-M1
```bash
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <caminho onde estão os modelos> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 4096 \
--dtype bfloat16
```
### Exemplo de Chamada via API
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M1",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
```
## ❗ Problemas Comuns
### Problemas ao Carregar Módulos
Se você encontrar o erro:
```
import vllm._C # noqa
ModuleNotFoundError: No module named 'vllm._C'
```
Ou
```
MiniMax-M1 model is not currently supported
```
Disponibilizamos duas soluções:
#### Solução 1: Copiar Arquivos de Dependência
```bash
cd <diretório de trabalho>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
```
#### Solução 2: Instalar a partir do Código-Fonte
```bash
cd <diretório de trabalho>
git clone https://github.com/vllm-project/vllm.git
cd vllm/
pip install -e .
```
## 📮 Suporte
Se você tiver qualquer problema durante o deploy do modelo MiniMax-M1:
* Consulte nossa documentação oficial
* Entre em contato com nossa equipe de suporte técnico pelos canais oficiais
* Abra uma [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) no nosso repositório do GitHub
Estamos constantemente otimizando a experiência de deploy deste modelo e valorizamos muito seu feedback!
icon.png

53.8 KB

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from multiprocessing import freeze_support
if __name__ == '__main__':
freeze_support()
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMax/MiniMax-M1-40k")
# Pass the default decoding hyperparameters of Qwen3-8B.
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(
model="MiniMax/MiniMax-M1-40k",
distributed_executor_backend="ray",
tensor_parallel_size=16,
max_model_len=4096,
dtype="bfloat16",
enforce_eager=True,
gpu_memory_utilization=0.99,
trust_remote_code=True,
)
# Prepare your prompts
prompt = "美国的国土面积多大?"
messages = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": prompt}]}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
# generate outputs
outputs = llm.generate([text], sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Generated text: {generated_text!r}")
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig
import torch
import argparse
"""
usage:
export SAFETENSORS_FAST_GPU=1
python main.py --quant_type int8 --world_size 8 --model_id <model_path>
"""
def generate_quanto_config(hf_config: AutoConfig, quant_type: str):
QUANT_TYPE_MAP = {
"default": None,
"int8": QuantoConfig(
weights="int8",
modules_to_not_convert=[
"lm_head",
"embed_tokens",
] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
+ [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
),
}
return QUANT_TYPE_MAP[quant_type]
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--quant_type", type=str, default="default", choices=["default", "int8"])
parser.add_argument("--model_id", type=str, required=True)
parser.add_argument("--world_size", type=int, required=True)
return parser.parse_args()
def check_params(args, hf_config: AutoConfig):
if args.quant_type == "int8":
assert args.world_size >= 8, "int8 weight-only quantization requires at least 8 GPUs"
assert hf_config.num_hidden_layers % args.world_size == 0, f"num_hidden_layers({hf_config.num_hidden_layers}) must be divisible by world_size({args.world_size})"
@torch.no_grad()
def main():
args = parse_args()
print("\n=============== Argument ===============")
for key in vars(args):
print(f"{key}: {vars(args)[key]}")
print("========================================")
model_id = args.model_id
hf_config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
check_params(args, hf_config)
quantization_config = generate_quanto_config(hf_config, args.quant_type)
device_map = {
'model.embed_tokens': 'cuda:0',
'model.norm': f'cuda:{args.world_size - 1}',
'lm_head': f'cuda:{args.world_size - 1}'
}
layers_per_device = hf_config.num_hidden_layers // args.world_size
for i in range(args.world_size):
for j in range(layers_per_device):
device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
tokenizer = AutoTokenizer.from_pretrained(model_id)
message = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Hello, what is the weather today?"}]}
]
tools = [
{"name": "get_location", "description": "Get the location of the user.", "parameters": {"type": "object", "properties": {}}},
{"name": "get_weather", "description": "Get the weather of a city.", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The name of the city"}}}},
{"name": "get_news", "description": "Get the news.", "parameters": {"type": "object", "properties": {"domain": {"type": "string", "description": "The domain of the news"}}}}
]
text = tokenizer.apply_chat_template(
message,
tools,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")
quantized_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="bfloat16",
device_map=device_map,
quantization_config=quantization_config,
trust_remote_code=True,
offload_buffers=True,
)
generation_config = GenerationConfig(
max_new_tokens=20,
eos_token_id=200020,
use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
if __name__ == "__main__":
main()
This diff is collapsed.
# 模型编码
modelCode=1636
# 模型名称
modelName=MiniMax-M1_vllm
# 模型描述
modelDescription=MiniMax M1拥有超长的上下文能力,100万token输入,8万token输出,足以媲美Gemini 2.5 Pro的开源模型。
# 应用场景
appScenario=推理,对话问答,制造,广媒,金融,能源,医疗,家居,教育
# 框架类型
frameType=vllm
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
{
"add_prefix_space": false,
"bos_token": "<beginning_of_sentence>",
"clean_up_tokenization_spaces": false,
"eos_token": "<end_of_sentence>",
"model_max_length": 40960000,
"tokenizer_class": "GPT2Tokenizer",
"unk_token": "<end_of_document>",
"chat_template": "{{ '<begin_of_document>' -}}{% set ns = namespace(system_prompt='') -%}{% for message in messages -%}{% if message['role'] == 'system' -%}{% set ns.system_prompt = ns.system_prompt + message['content'][0]['text'] -%}{% endif -%}{%- endfor -%}{% if ns.system_prompt != '' -%}{{ '<beginning_of_sentence>system ai_setting=assistant\n' + ns.system_prompt + '<end_of_sentence>\n' -}}{%- endif -%}{% if tools -%}{{ '<beginning_of_sentence>system tool_setting=tools\nYou are provided with these tools:\n<tools>\n' -}}{% for tool in tools -%}{{ tool | tojson ~ '\n' -}}{%- endfor -%}{{ '</tools>\n\nIf you need to call tools, please respond with <tool_calls></tool_calls> XML tags, and provide tool-name and json-object of arguments, following the format below:\n<tool_calls>\n{''name'': <tool-name-1>, ''arguments'': <args-json-object-1>}\n...\n</tool_calls><end_of_sentence>\n' -}}{%- endif -%}{% for message in messages -%}{% if message['role'] == 'user' -%}{{ '<beginning_of_sentence>user name=user\n' + message['content'][0]['text'] + '<end_of_sentence>\n' -}}{% elif message['role'] == 'assistant' -%}{{ '<beginning_of_sentence>ai name=assistant\n' -}}{% for content in message['content'] | selectattr('type', 'equalto', 'text') -%}{{ content['text'] -}}{%- endfor -%}{{ '<end_of_sentence>\n' -}}{% elif message['role'] == 'tool' -%}{{ '<beginning_of_sentence>tool name=tools\n' }} {%- for content in message['content'] -%}{{- 'tool name: ' + content['name'] + '\n' + 'tool result: ' + content['text'] + '\n\n' -}} {%- endfor -%}{{- '<end_of_sentence>\n' -}}{% endif -%}{%- endfor -%}{% if add_generation_prompt -%}{{ '<beginning_of_sentence>ai name=assistant\n' -}}{%- endif -%}"
}
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment