Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
raojy
vllm_017
Commits
3b50924c
Commit
3b50924c
authored
Mar 27, 2026
by
raojy
Browse files
raw_vllm
parent
fbeb8a6f
Pipeline
#3455
canceled with stages
Changes
144
Pipelines
1
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
247 additions
and
0 deletions
+247
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
...eta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
...igs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
...ta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
...ite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-3.2-1B-Instruct-FP8-compressed-tensors.yaml
...gs/Meta-Llama-3.2-1B-Instruct-FP8-compressed-tensors.yaml
+11
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
...s/Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml
...nfigs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
.../configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
+11
-0
.buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
.buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Mixtral-8x22B-Instruct-v0.1-FP8-Dynamic.yaml
...ness/configs/Mixtral-8x22B-Instruct-v0.1-FP8-Dynamic.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1-FP8.yaml
...-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1-FP8.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
...e/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
+12
-0
.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml
...-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml
+15
-0
.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.yaml
...l-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.yaml
+19
-0
.buildkite/lm-eval-harness/configs/Qwen1.5-MoE-W4A16-compressed-tensors.yaml
...harness/configs/Qwen1.5-MoE-W4A16-compressed-tensors.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-FP8W8.yaml
...te/lm-eval-harness/configs/Qwen2-1.5B-Instruct-FP8W8.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
.../configs/Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml
...dkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml
+12
-0
.buildkite/lm-eval-harness/configs/Qwen2.5-1.5B-Instruct.yaml
...ldkite/lm-eval-harness/configs/Qwen2.5-1.5B-Instruct.yaml
+11
-0
No files found.
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name
:
"
nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.764
-
name
:
"
exact_match,flexible-extract"
value
:
0.764
limit
:
250
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name
:
"
nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.728
-
name
:
"
exact_match,flexible-extract"
value
:
0.728
limit
:
250
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name
:
"
nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.758
-
name
:
"
exact_match,flexible-extract"
value
:
0.759
limit
:
1000
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
0 → 100644
View file @
3b50924c
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5
model_name
:
"
meta-llama/Meta-Llama-3-8B-Instruct"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.756
-
name
:
"
exact_match,flexible-extract"
value
:
0.752
limit
:
250
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name
:
"
HandH1998/QQQ-Llama-3-8b-g128"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.419
-
name
:
"
exact_match,flexible-extract"
value
:
0.416
limit
:
1000
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Meta-Llama-3.2-1B-Instruct-FP8-compressed-tensors.yaml
0 → 100644
View file @
3b50924c
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Llama-3.2-1B-Instruct-FP8 -b "auto" -l 1319 -f 5 -t 1
model_name
:
"
RedHatAI/Llama-3.2-1B-Instruct-FP8"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.335
-
name
:
"
exact_match,flexible-extract"
value
:
0.323
limit
:
1319
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name
:
"
neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.356
-
name
:
"
exact_match,flexible-extract"
value
:
0.358
limit
:
1000
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml
0 → 100644
View file @
3b50924c
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 100 -t 8
model_name
:
"
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
backend
:
"
vllm-vlm"
tasks
:
-
name
:
"
chartqa"
metrics
:
-
name
:
"
relaxed_accuracy,none"
# TODO(zhewenl): model card is 0.90, but the actual score is 0.80.
value
:
0.80
limit
:
100
num_fewshot
:
0
.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
0 → 100644
View file @
3b50924c
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 250 -t 8 -f 5
model_name
:
"
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
tasks
:
-
name
:
"
mmlu_pro"
metrics
:
-
name
:
"
exact_match,custom-extract"
value
:
0.80
limit
:
250
# will run on 250 * 14 subjects = 3500 samples
num_fewshot
:
5
rtol
:
0.05
.buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name
:
"
mgoin/Minitron-4B-Base-FP8"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.231
-
name
:
"
exact_match,flexible-extract"
value
:
0.22
limit
:
1000
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Mixtral-8x22B-Instruct-v0.1-FP8-Dynamic.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name
:
"
neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.86
-
name
:
"
exact_match,flexible-extract"
value
:
0.86
limit
:
250
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1-FP8.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name
:
"
neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.624
-
name
:
"
exact_match,flexible-extract"
value
:
0.624
limit
:
250
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
0 → 100644
View file @
3b50924c
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name
:
"
mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.616
-
name
:
"
exact_match,flexible-extract"
value
:
0.632
limit
:
250
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml
0 → 100644
View file @
3b50924c
model_name
:
"
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.695
-
name
:
"
exact_match,flexible-extract"
value
:
0.447
limit
:
1319
num_fewshot
:
5
max_model_len
:
262144
enforce_eager
:
false
apply_chat_template
:
true
fewshot_as_multiturn
:
true
trust_remote_code
:
true
\ No newline at end of file
.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.yaml
0 → 100644
View file @
3b50924c
model_name
:
"
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.7142
-
name
:
"
exact_match,flexible-extract"
value
:
0.4579
env_vars
:
VLLM_USE_FLASHINFER_MOE_FP8
:
"
1"
VLLM_FLASHINFER_MOE_BACKEND
:
"
throughput"
limit
:
1319
num_fewshot
:
5
max_model_len
:
262144
kv_cache_dtype
:
fp8
enforce_eager
:
false
apply_chat_template
:
true
fewshot_as_multiturn
:
true
trust_remote_code
:
true
.buildkite/lm-eval-harness/configs/Qwen1.5-MoE-W4A16-compressed-tensors.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 -b auto -l 1319 -f 5 -t 1
model_name
:
"
nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.30
-
name
:
"
exact_match,flexible-extract"
value
:
0.465
limit
:
1319
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-FP8W8.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name
:
"
nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.578
-
name
:
"
exact_match,flexible-extract"
value
:
0.585
limit
:
1000
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name
:
"
neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.593
-
name
:
"
exact_match,flexible-extract"
value
:
0.588
limit
:
1000
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml
0 → 100644
View file @
3b50924c
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name
:
"
Qwen/Qwen2-57B-A14B-Instruct"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.792
-
name
:
"
exact_match,flexible-extract"
value
:
0.824
limit
:
250
num_fewshot
:
5
.buildkite/lm-eval-harness/configs/Qwen2.5-1.5B-Instruct.yaml
0 → 100644
View file @
3b50924c
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2.5-1.5B-Instruct -b auto -l 1319 -f 5 -t 1
model_name
:
"
Qwen/Qwen2.5-1.5B-Instruct"
tasks
:
-
name
:
"
gsm8k"
metrics
:
-
name
:
"
exact_match,strict-match"
value
:
0.54
-
name
:
"
exact_match,flexible-extract"
value
:
0.59
limit
:
1319
num_fewshot
:
5
Prev
1
2
3
4
5
6
…
8
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment