Update README.md

feafde59 · raojy · 092e680e · feafde59
Commit feafde59 authored Mar 20, 2026 by raojy 💬
Hide whitespace changes
Inline Side-by-side

Showing with 142 additions and 0 deletions

README.md README.md +142 -0

No files found.
--- a/README.md
+++ b/README.md
 # NVIDIA-Nemotron-3-Super-120B-A12B-BF16_vllm

+# NVIDIA-Nemotron-3-Super-120B-A12B-BF16
+
+## 论文
+
+[NVIDIA Nemotron-3 Series Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)
+
+## 模型简介
+
+Nemotron-3-Super-120B-A12B-BF16 是由英伟达 (NVIDIA) 训练的大语言模型 (LLM)，旨在提供强大的智能体 (Agentic)、推理及对话能力。该模型针对协作智能体和高负载工作场景（如 IT 工单自动化）进行了深度优化。与该系列的其他模型类似，它在响应用户查询或任务时，会采取“先生成推理轨迹 (Reasoning Trace)，后给出最终回复”的模式。此外，模型的推理能力可以通过聊天模板中的标志位 (Flag) 进行灵活配置。
+
+在架构方面，该模型采用了混合潜变量混合专家 (Latent Mixture-of-Experts, LatentMoE) 架构，通过交替堆叠 Mamba-2 层、MoE 层以及精选的注意力 (Attention) 层实现。与 Nano 版本不同，Super 模型引入了多 Token 预测 (Multi-Token Prediction, MTP) 层，从而在提升文本生成质量的同时显著加快了生成速度。为了最大化计算效率，该模型在训练过程中采用了 NVFP4 量化技术。
+
+该模型拥有 12B 激活参数，总参数量达 120B。目前支持包括英语、法语、德语、意大利语、日语、西班牙语和中文在内的多种语言。该模型已具备商用能力。
+
+## 环境依赖
+
+| **软件**     | **版本**                                  |
+| ------------ | ----------------------------------------- |
+| DTK          | 26.04                                     |
+| python       | 3.10.12                                   |
+| transformers | 5.2.0.dev0                                |
+| vllm         | 0.15.1+das.opt1.alpha.dtk2604            |
+| triton       | 3.3.0+das.opt2.dtk2604.torch291.20260210.g1329924c |
+| torch        | 22.9.0+das.opt1.dtk2604.20260206.g275d08c2 |
+| torch        | 1.26.1 |
+当前仅支持定制镜像: `harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.15.1-ubuntu22.04-dtk26.04-0130-py3.10-20260220`
+
+- 挂载地址`-v` 根据实际模型情况修改
+
+Bash
+
+```
+docker run -it --shm-size 200g \
+                --network=host \
+                --name <name> \
+                --privileged \
+                --device=/dev/kfd \
+                --device=/dev/dri \
+                --device=/dev/mkfd \
+                --group-add video \
+                --cap-add=SYS_PTRACE \
+                --security-opt seccomp=unconfined \
+                -u root \
+                -v /opt/hyhal/:/opt/hyhal/:ro \
+                harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.15.1-ubuntu22.04-dtk26.04-0130-py3.10-20260220 bash
+```
+
+关于本项目 DCU 显卡所需的特殊深度学习库，numpy、transformers 库需要替换安装：
+
+Bash
+
+```
+pip uninstall vllm
+pip uninstall numpy
+pip install vllm-0.15.1+das.opt1.alpha.dtk2604-cp310-cp310-linux_x86_64.whl
+pip install numpy==1.26.1
+```
+
+## 数据集
+
+暂无
+
+## 训练
+
+暂无
+
+## 推理
+
+### vllm
+
+#### 单机推理（建议 8 卡）
+
+**注意**：对于 120B 参数量的 BF16 模型，单机推理建议至少使用 8 张 K100 AI。使用时需添加 `--disable-custom-all-reduce` 参数。
+
+Bash
+
+```
+## serve启动
+export VLLM_USE_NN=0
+export VLLM_ENABLE_MOE_FUSED_GATE=0
+
+vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
+    --served-model-name nemotron \
+    --dtype bfloat16 \
+    --trust-remote-code \
+    --mamba-ssm-cache-dtype float32 \
+    --tensor-parallel-size 8 \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --tool-call-parser qwen3_coder \
+    --enable-auto-tool-choice \
+    --reasoning-parser super_v3 \
+    --reasoning-parser-plugin super_v3_reasoning_parser.py
+
+## client访问
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nemotron",
+    "messages": [
+      {"role": "user", "content": "帮我查下北京天气，顺便把结果翻译成英文。"},
+      {"role": "assistant", "tool_calls": [{"id": "chatcmpl-tool-a3ba5e50a56e4f3b", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"北京\"}"}}]},
+      {"role": "tool", "tool_call_id": "chatcmpl-tool-a3ba5e50a56e4f3b", "content": "{\"weather\": \"晴朗\", \"temperature\": \"25度\"}"}
+    ],
+    "tools": [
+      {"type": "function", "function": {"name": "get_weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}}},
+      {"type": "function", "function": {"name": "translate", "parameters": {"type": "object", "properties": {"text": {"type": "string"}, "target_lang": {"type": "string"}}}}}
+    ]
+  }'
+```
+
+
+## 效果展示
+
+### 精度
+
+DCU 与 GPU 精度一致，推理框架：vllm。
+
+## 预训练权重
+
+| **模型名称**                    | **权重大小** | **DCU型号**   | **最低卡数需求** | **下载地址**                                                 |
+| ------------------------------- | ------------ | ------------- | ---------------- | ------------------------------------------------------------ |
+| Nemotron-3-Super-120B-A12B      | 120B         | BW1000 | 8                | [Hugging Face](https://www.google.com/search?q=https://huggingface.co/nvidia/nemotron-3-super-120b) |           |
+
+## 源码仓库及问题反馈
+
+- [https://developer.sourcefind.cn/codes/modelzoo/nemotron3_vllm](https://www.google.com/search?q=https://developer.sourcefind.cn/codes/modelzoo/nemotron3_vllm)
+
+## 参考资料
+
+- https://github.com/vllm-project/vllm
+- [https://build.nvidia.com/nvidia/nemotron-3-super-120b](https://www.google.com/search?q=https://build.nvidia.com/nvidia/nemotron-3-super-120b)
+
+------
+
+**建议操作：**
+
+1. 确认镜像名称中的后缀是否需要修改为 `nemotron3_120b`。
+2. 确认单机 8 卡 TP=8 时，K100 AI 的显存是否足以承载 120B BF16 模型（通常需要约 240GB 显存，K100 AI 单卡 80GB 则 8 卡充足）。
+3. 如果模型有特殊的 `reasoning-parser`，请在 `vllm serve` 命令中添加。
+
+需要我为你生成其他型号（如 Nemotron-3-8B）的配置吗？