README.md

# NVIDIA-Nemotron-3-Super-120B-A12B-BF16_vllm

# NVIDIA-Nemotron-3-Super-120B-A12B-BF16

## 论文

[NVIDIA Nemotron-3 Series Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)

## 模型简介

Nemotron-3-Super-120B-A12B-BF16 是由英伟达 (NVIDIA) 训练的大语言模型 (LLM)，旨在提供强大的智能体 (Agentic)、推理及对话能力。该模型针对协作智能体和高负载工作场景（如 IT 工单自动化）进行了深度优化。与该系列的其他模型类似，它在响应用户查询或任务时，会采取“先生成推理轨迹 (Reasoning Trace)，后给出最终回复”的模式。此外，模型的推理能力可以通过聊天模板中的标志位 (Flag) 进行灵活配置。

在架构方面，该模型采用了混合潜变量混合专家 (Latent Mixture-of-Experts, LatentMoE) 架构，通过交替堆叠 Mamba-2 层、MoE 层以及精选的注意力 (Attention) 层实现。与 Nano 版本不同，Super 模型引入了多 Token 预测 (Multi-Token Prediction, MTP) 层，从而在提升文本生成质量的同时显著加快了生成速度。为了最大化计算效率，该模型在训练过程中采用了 NVFP4 量化技术。

该模型拥有 12B 激活参数，总参数量达 120B。目前支持包括英语、法语、德语、意大利语、日语、西班牙语和中文在内的多种语言。该模型已具备商用能力。

## 环境依赖

| **软件**     | **版本**                                  |
| ------------ | ----------------------------------------- |
| DTK          | 26.04                                     |
| python       | 3.10.12                                   |
| transformers | 5.2.0.dev0                                |
| vllm         | 0.15.1+das.opt1.alpha.dtk2604            |
| triton       | 3.3.0+das.opt2.dtk2604.torch291.20260210.g1329924c |
| torch        | 22.9.0+das.opt1.dtk2604.20260206.g275d08c2 |
| torch        | 1.26.1 |
当前仅支持定制镜像: `harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.15.1-ubuntu22.04-dtk26.04-0130-py3.10-20260220`

- 挂载地址`-v` 根据实际模型情况修改

Bash

```
docker run -it --shm-size 200g \
                --network=host \
                --name <name> \
                --privileged \
                --device=/dev/kfd \
                --device=/dev/dri \
                --device=/dev/mkfd \
                --group-add video \
                --cap-add=SYS_PTRACE \
                --security-opt seccomp=unconfined \
                -u root \
                -v /opt/hyhal/:/opt/hyhal/:ro \
                harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.15.1-ubuntu22.04-dtk26.04-0130-py3.10-20260220 bash
```

关于本项目 DCU 显卡所需的特殊深度学习库，numpy、transformers 库需要替换安装：

Bash

```
pip uninstall vllm
pip uninstall numpy
pip install vllm-0.15.1+das.opt1.alpha.dtk2604-cp310-cp310-linux_x86_64.whl
pip install numpy==1.26.1
```

## 数据集

暂无

## 训练

暂无

## 推理

### vllm

#### 单机推理（建议 8 卡）

**注意**：对于 120B 参数量的 BF16 模型，单机推理建议至少使用 8 张 K100 AI。使用时需添加 `--disable-custom-all-reduce` 参数。

Bash

```
## serve启动
export VLLM_USE_NN=0
export VLLM_ENABLE_MOE_FUSED_GATE=0

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --served-model-name nemotron \
    --dtype bfloat16 \
    --trust-remote-code \
    --mamba-ssm-cache-dtype float32 \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 \
    --port 8000 \
    --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --reasoning-parser super_v3 \
    --reasoning-parser-plugin super_v3_reasoning_parser.py

## client访问
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [
      {"role": "user", "content": "帮我查下北京天气，顺便把结果翻译成英文。"},
      {"role": "assistant", "tool_calls": [{"id": "chatcmpl-tool-a3ba5e50a56e4f3b", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"北京\"}"}}]},
      {"role": "tool", "tool_call_id": "chatcmpl-tool-a3ba5e50a56e4f3b", "content": "{\"weather\": \"晴朗\", \"temperature\": \"25度\"}"}
    ],
    "tools": [
      {"type": "function", "function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}}},
      {"type": "function", "function": {"name": "translate", "parameters": {"type": "object", "properties": {"text": {"type": "string"}, "target_lang": {"type": "string"}}}}}
    ]
  }'
```


## 效果展示

### 精度

DCU 与 GPU 精度一致，推理框架：vllm。

## 预训练权重

| **模型名称**                    | **权重大小** | **DCU型号**   | **最低卡数需求** | **下载地址**                                                 |
| ------------------------------- | ------------ | ------------- | ---------------- | ------------------------------------------------------------ |
| Nemotron-3-Super-120B-A12B      | 120B         | BW1000 | 8                | [Hugging Face](https://www.google.com/search?q=https://huggingface.co/nvidia/nemotron-3-super-120b) |           |

## 源码仓库及问题反馈

- [https://developer.sourcefind.cn/codes/modelzoo/nemotron3_vllm](https://www.google.com/search?q=https://developer.sourcefind.cn/codes/modelzoo/nemotron3_vllm)

## 参考资料

- https://github.com/vllm-project/vllm
- [https://build.nvidia.com/nvidia/nemotron-3-super-120b](https://www.google.com/search?q=https://build.nvidia.com/nvidia/nemotron-3-super-120b)

------

**建议操作：**

1. 确认镜像名称中的后缀是否需要修改为 `nemotron3_120b`。
2. 确认单机 8 卡 TP=8 时，K100 AI 的显存是否足以承载 120B BF16 模型（通常需要约 240GB 显存，K100 AI 单卡 80GB 则 8 卡充足）。
3. 如果模型有特殊的 `reasoning-parser`，请在 `vllm serve` 命令中添加。

需要我为你生成其他型号（如 Nemotron-3-8B）的配置吗？