Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
7e63ef82
Commit
7e63ef82
authored
Jan 21, 2026
by
zhuwenwen
Browse files
Merge tag 'v0.14.0' into v0.14.0-dev
parents
8cbcac5d
b17039bc
Changes
681
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
160 additions
and
0 deletions
+160
-0
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutedsl-deepep-ll.yaml
...ep/Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutedsl-deepep-ll.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml
...factor-dp-ep/Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt
tests/evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt
+10
-0
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-CT-vllm-cutlass.yaml
...nfigs/moe-refactor/Llama-4-Scout-Fp8-CT-vllm-cutlass.yaml
+5
-0
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-cutlass.yaml
...s/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-cutlass.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-trtllm.yaml
...gs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-trtllm.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-marlin.yaml
...nfigs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-marlin.yaml
+7
-0
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-triton.yaml
...nfigs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-triton.yaml
+5
-0
tests/evals/gsm8k/configs/moe-refactor/Mixtral-8x7B-Fp8-AutoFp8-fi-cutlass.yaml
...igs/moe-refactor/Mixtral-8x7B-Fp8-AutoFp8-fi-cutlass.yaml
+9
-0
tests/evals/gsm8k/configs/moe-refactor/Mixtral-8x7B-Fp8-AutoFp8-triton.yaml
...configs/moe-refactor/Mixtral-8x7B-Fp8-AutoFp8-triton.yaml
+5
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm.yaml
...figs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-fi-cutlass.yaml
...gs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-fi-cutlass.yaml
+10
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-fi-trtllm.yaml
...igs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-fi-trtllm.yaml
+10
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-marlin.yaml
...onfigs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-marlin.yaml
+9
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-triton.yaml
...onfigs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-triton.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-deepgemm.yaml
...igs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-deepgemm.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-fi-cutlass.yaml
...s/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-fi-cutlass.yaml
+10
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-marlin.yaml
...nfigs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-marlin.yaml
+9
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-triton.yaml
...nfigs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-triton.yaml
+8
-0
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Channel-marlin.yaml
...igs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Channel-marlin.yaml
+7
-0
No files found.
Too many changes to show.
To preserve performance only
681 of 681+
files are displayed.
Plain diff
Email patch
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutedsl-deepep-ll.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
nvidia/Qwen3-30B-A3B-NVFP4"
accuracy_threshold
:
0.88
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--data-parallel-size
2
--enable-expert-parallel
--all2all-backend
deepep_low_latency"
env
:
VLLM_USE_FLASHINFER_MOE_FP4
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
masked_gemm"
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
nvidia/Qwen3-30B-A3B-NVFP4"
accuracy_threshold
:
0.88
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--data-parallel-size
2
--enable-expert-parallel"
env
:
VLLM_USE_FLASHINFER_MOE_FP4
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
throughput"
tests/evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt
0 → 100644
View file @
7e63ef82
Qwen3-30B-A3B-NvFp4-CT-fi-cutedsl-deepep-ll.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutedsl-deepep-ll.yaml
Qwen3-30B-A3B-NvFp4-CT-fi-cutlass.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm-deepep-ht.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm-deepep-ll.yaml
Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm.yaml
Qwen3-30B-A3B-Fp8-CT-Block-deepgemm-deepep-ht.yaml
Qwen3-30B-A3B-Fp8-CT-Block-deepgemm-deepep-ll.yaml
Qwen3-30B-A3B-Fp8-CT-Block-deepgemm.yaml
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-CT-vllm-cutlass.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
accuracy_threshold
:
0.92
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-cutlass.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold
:
0.92
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_FLASHINFER_MOE_FP8
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
throughput"
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-fi-trtllm.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold
:
0.92
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_FLASHINFER_MOE_FP8
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
latency"
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-marlin.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold
:
0.92
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_TEST_FORCE_FP8_MARLIN
:
"
1"
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-Fp8-ModelOpt-triton.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"
accuracy_threshold
:
0.92
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
tests/evals/gsm8k/configs/moe-refactor/Mixtral-8x7B-Fp8-AutoFp8-fi-cutlass.yaml
0 → 100644
View file @
7e63ef82
# TODO(rob): enable
# model_name: "amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV"
# accuracy_threshold: 0.62
# num_questions: 1319
# num_fewshot: 5
# server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 2"
# env:
# VLLM_USE_FLASHINFER_MOE_FP8: "1"
# VLLM_FLASHINFER_MOE_BACKEND: "throughput"
tests/evals/gsm8k/configs/moe-refactor/Mixtral-8x7B-Fp8-AutoFp8-triton.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV"
accuracy_threshold
:
0.62
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold
:
0.88
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
1"
VLLM_USE_DEEP_GEMM_MOE
:
"
1"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-fi-cutlass.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold
:
0.88
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
0"
VLLM_USE_DEEP_GEMM_MOE
:
"
0"
VLLM_USE_FLASHINFER_MOE_FP8
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
throughput"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-fi-trtllm.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold
:
0.88
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
0"
VLLM_USE_DEEP_GEMM_MOE
:
"
0"
VLLM_USE_FLASHINFER_MOE_FP8
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
latency"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-marlin.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold
:
0.88
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
0"
VLLM_USE_DEEP_GEMM_MOE
:
"
0"
VLLM_TEST_FORCE_FP8_MARLIN
:
"
1"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-AutoFp8-triton.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
accuracy_threshold
:
0.88
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
0"
VLLM_USE_DEEP_GEMM_MOE
:
"
0"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-deepgemm.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold
:
0.85
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
1"
VLLM_USE_DEEP_GEMM_MOE
:
"
1"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-fi-cutlass.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold
:
0.85
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
0"
VLLM_USE_DEEP_GEMM_MOE
:
"
0"
VLLM_USE_FLASHINFER_MOE_FP8
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
throughput"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-marlin.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold
:
0.85
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
0"
VLLM_USE_DEEP_GEMM_MOE
:
"
0"
VLLM_TEST_FORCE_FP8_MARLIN
:
"
1"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Block-triton.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
RedHatAI/Qwen3-30B-A3B-FP8-block"
accuracy_threshold
:
0.85
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_USE_DEEP_GEMM
:
"
0"
VLLM_USE_DEEP_GEMM_MOE
:
"
0"
tests/evals/gsm8k/configs/moe-refactor/Qwen3-30B-A3B-Fp8-CT-Channel-marlin.yaml
0 → 100644
View file @
7e63ef82
model_name
:
"
RedHatAI/Qwen3-30B-A3B-FP8-dynamic"
accuracy_threshold
:
0.85
num_questions
:
1319
num_fewshot
:
5
server_args
:
"
--enforce-eager
--max-model-len
8192
--tensor-parallel-size
2"
env
:
VLLM_TEST_FORCE_FP8_MARLIN
:
"
1"
Prev
1
…
16
17
18
19
20
21
22
23
24
…
35
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment