model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (#9301)

4cd08dc5 · Netanel Haber · GitHub · f92b729d · 4cd08dc5 · 4cd08dc5
Unverified Commit 4cd08dc5 authored Aug 26, 2025 by Netanel Haber Committed by GitHub Aug 26, 2025
Showing with 7 additions and 0 deletions

docs/supported_models/generative_models.md docs/supported_models/generative_models.md +1 -0

test/srt/models/test_generation_models.py test/srt/models/test_generation_models.py +6 -0

No files found.
--- a/docs/supported_models/generative_models.md
+++ b/docs/supported_models/generative_models.md
@@ -52,3 +52,4 @@ in the GitHub search bar.
 | **Granite 3.0, 3.1** (IBM)               | `ibm-granite/granite-3.1-8b-instruct`                          | IBM's open dense foundation models optimized for reasoning, code, and business AI use cases. Integrated with Red Hat and watsonx systems. |
 | **Granite 3.0 MoE** (IBM)               | `ibm-granite/granite-3.0-3b-a800m-instruct`                          | IBM’s Mixture-of-Experts models offering strong performance with cost-efficiency. MoE expert routing designed for enterprise deployment at scale. |
 | **Llama Nemotron Super** (v1, v1.5, NVIDIA) | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, `nvidia/Llama-3_3-Nemotron-Super-49B-v1_5` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family builds on the strongest open models in the ecosystem by enhancing them with greater accuracy, efficiency, and transparency using NVIDIA open synthetic datasets, advanced techniques, and tools. This enables the creation of practical, right-sized, and high-performing AI agents. |
+| **Llama Nemotron Ultra** (v1, NVIDIA) | `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family builds on the strongest open models in the ecosystem by enhancing them with greater accuracy, efficiency, and transparency using NVIDIA open synthetic datasets, advanced techniques, and tools. This enables the creation of practical, right-sized, and high-performing AI agents. |
--- a/test/srt/models/test_generation_models.py
+++ b/test/srt/models/test_generation_models.py
@@ -83,6 +83,12 @@ ALL_MODELS = [
        trust_remote_code=True,
        skip_long_prompt=True,
    ),
+    ModelCase(
+        "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1",
+        tp_size=8,
+        trust_remote_code=True,
+        skip_long_prompt=True,
+    ),
 ]

 TORCH_DTYPES = [torch.float16]