eval_log.log 9.46 KB
Newer Older
sunzhq2's avatar
sunzhq2 committed
1
2
3
2026-04-14 11:48:52 - evalscope - INFO: Running with native backend
2026-04-14 11:48:52 - evalscope - INFO: Dump task config to ./evalscope-data/configs/task_config.yaml
2026-04-14 11:48:52 - evalscope - INFO: {
sunzhq2's avatar
init  
sunzhq2 committed
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
    "model": "text_generation",
    "model_id": "qwen3-8B",
    "model_args": {},
    "model_task": "text_generation",
    "chat_template": null,
    "datasets": [
        "math_500"
    ],
    "dataset_args": {
        "math_500": {
            "name": "math_500",
            "dataset_id": "AI-ModelScope/MATH-500",
            "output_types": [
                "generation"
            ],
            "subset_list": [
                "Level 1",
                "Level 2",
                "Level 3",
                "Level 4",
                "Level 5"
            ],
            "default_subset": "default",
            "few_shot_num": 0,
            "few_shot_random": false,
            "train_split": null,
            "eval_split": "test",
            "prompt_template": "{question}\nPlease reason step by step, and put your final answer within \\boxed{{}}.",
            "few_shot_prompt_template": null,
            "system_prompt": null,
            "query_template": null,
            "pretty_name": "MATH-500",
            "description": "\n## Overview\n\nMATH-500 is a curated subset of 500 problems from the MATH benchmark, designed to evaluate the mathematical reasoning capabilities of language models. It covers five difficulty levels across various mathematical topics including algebra, geometry, number theory, and calculus.\n\n## Task Description\n\n- **Task Type**: Mathematical Problem Solving\n- **Input**: Mathematical problem statement\n- **Output**: Step-by-step solution with final numerical answer\n- **Difficulty Levels**: Level 1 (easiest) to Level 5 (hardest)\n\n## Key Features\n\n- 500 carefully selected problems from the full MATH dataset\n- Five difficulty levels for fine-grained evaluation\n- Problems cover algebra, geometry, number theory, probability, and more\n- Each problem includes a reference solution\n- Designed for efficient yet comprehensive math evaluation\n\n## Evaluation Notes\n\n- Default configuration uses **0-shot** evaluation\n- Answers should be formatted within `\\boxed{}` for proper extraction\n- Numeric equivalence checking for answer comparison\n- Results can be broken down by difficulty level\n- Commonly used for math reasoning benchmarking due to manageable size\n",
            "paper_url": null,
            "data_statistics": null,
            "sample_example": null,
            "tags": [
                "Math",
                "Reasoning"
            ],
            "filters": null,
            "metric_list": [
                {
                    "acc": {
                        "numeric": true
                    }
                }
            ],
            "aggregation": "mean",
            "shuffle": false,
            "shuffle_choices": false,
            "force_redownload": false,
            "review_timeout": null,
            "extra_params": {},
            "sandbox_config": {}
        }
    },
    "dataset_dir": "/root/.cache/modelscope/hub/datasets",
    "dataset_hub": "modelscope",
    "repeats": 1,
    "generation_config": {
        "batch_size": 1
    },
    "eval_type": "mock_llm",
    "eval_backend": "Native",
    "eval_config": null,
    "limit": null,
    "eval_batch_size": 1,
sunzhq2's avatar
sunzhq2 committed
72
    "use_cache": "./evalscope-data",
sunzhq2's avatar
init  
sunzhq2 committed
73
    "rerun_review": true,
sunzhq2's avatar
sunzhq2 committed
74
    "work_dir": "./evalscope-data",
sunzhq2's avatar
init  
sunzhq2 committed
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
    "no_timestamp": true,
    "enable_progress_tracker": false,
    "ignore_errors": false,
    "debug": false,
    "seed": 42,
    "api_url": null,
    "timeout": null,
    "stream": null,
    "judge_strategy": "auto",
    "judge_worker_num": 1,
    "judge_model_args": {},
    "analysis_report": false,
    "use_sandbox": false,
    "sandbox_type": "docker",
    "sandbox_manager_config": {},
    "evalscope_version": "1.5.2.post1"
}
sunzhq2's avatar
sunzhq2 committed
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
2026-04-14 11:48:52 - evalscope - INFO: Start loading benchmark dataset: math_500
2026-04-14 11:48:53 - evalscope - INFO: Start evaluating 5 subsets of math_500: ['Level 1', 'Level 2', 'Level 3', 'Level 4', 'Level 5']
2026-04-14 11:48:53 - evalscope - INFO: Reusing predictions from ./evalscope-data/predictions/qwen3-8B/math_500_Level 1.jsonl, got 43 predictions, remaining 43 samples
2026-04-14 11:48:53 - evalscope - WARNING: [Rerun review mode] Skipping 43 samples in subset 'Level 1' due to missing cached predictions. They will NOT be inferred.
2026-04-14 11:48:53 - evalscope - INFO: Reusing predictions from ./evalscope-data/predictions/qwen3-8B/math_500_Level 2.jsonl, got 90 predictions, remaining 90 samples
2026-04-14 11:48:53 - evalscope - WARNING: [Rerun review mode] Skipping 90 samples in subset 'Level 2' due to missing cached predictions. They will NOT be inferred.
2026-04-14 11:48:53 - evalscope - INFO: Reusing predictions from ./evalscope-data/predictions/qwen3-8B/math_500_Level 3.jsonl, got 105 predictions, remaining 105 samples
2026-04-14 11:48:53 - evalscope - WARNING: [Rerun review mode] Skipping 105 samples in subset 'Level 3' due to missing cached predictions. They will NOT be inferred.
2026-04-14 11:48:53 - evalscope - INFO: Reusing predictions from ./evalscope-data/predictions/qwen3-8B/math_500_Level 4.jsonl, got 128 predictions, remaining 128 samples
2026-04-14 11:48:53 - evalscope - WARNING: [Rerun review mode] Skipping 128 samples in subset 'Level 4' due to missing cached predictions. They will NOT be inferred.
2026-04-14 11:48:53 - evalscope - INFO: Reusing predictions from ./evalscope-data/predictions/qwen3-8B/math_500_Level 5.jsonl, got 134 predictions, remaining 134 samples
2026-04-14 11:48:53 - evalscope - WARNING: [Rerun review mode] Skipping 134 samples in subset 'Level 5' due to missing cached predictions. They will NOT be inferred.
2026-04-14 11:48:53 - evalscope - INFO: Unified pool: 500 items to process, 0 already fully cached (500 total across all subsets).
2026-04-14 11:48:57 - evalscope - INFO: Evaluating[math_500] 100%| 500/500 [Elapsed: 00:03 < Remaining: 00:00, 96.00it/s]
2026-04-14 11:48:57 - evalscope - INFO: Unified pool finished for math_500.
2026-04-14 11:48:57 - evalscope - INFO: Aggregating scores for subset: Level 1
2026-04-14 11:48:57 - evalscope - INFO: Aggregating scores for subset: Level 2
2026-04-14 11:48:57 - evalscope - INFO: Aggregating scores for subset: Level 3
2026-04-14 11:48:57 - evalscope - INFO: Aggregating scores for subset: Level 4
2026-04-14 11:48:57 - evalscope - INFO: Aggregating scores for subset: Level 5
2026-04-14 11:48:57 - evalscope - INFO: Generating report...
2026-04-14 11:48:57 - evalscope - INFO: 
sunzhq2's avatar
init  
sunzhq2 committed
114
115
116
117
118
119
math_500 report table:
+----------+-----------+----------+----------+-------+---------+---------+
| Model    | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+==========+===========+==========+==========+=======+=========+=========+
| qwen3-8B | math_500  | mean_acc | Level 1  |    43 |  0.9535 | default |
+----------+-----------+----------+----------+-------+---------+---------+
sunzhq2's avatar
sunzhq2 committed
120
| qwen3-8B | math_500  | mean_acc | Level 2  |    90 |  0.9778 | default |
sunzhq2's avatar
init  
sunzhq2 committed
121
122
123
+----------+-----------+----------+----------+-------+---------+---------+
| qwen3-8B | math_500  | mean_acc | Level 3  |   105 |  0.9524 | default |
+----------+-----------+----------+----------+-------+---------+---------+
sunzhq2's avatar
sunzhq2 committed
124
| qwen3-8B | math_500  | mean_acc | Level 4  |   128 |  0.9453 | default |
sunzhq2's avatar
init  
sunzhq2 committed
125
126
127
+----------+-----------+----------+----------+-------+---------+---------+
| qwen3-8B | math_500  | mean_acc | Level 5  |   134 |  0.8881 | default |
+----------+-----------+----------+----------+-------+---------+---------+
sunzhq2's avatar
sunzhq2 committed
128
| qwen3-8B | math_500  | mean_acc | OVERALL  |   500 |  0.938  | -       |
sunzhq2's avatar
init  
sunzhq2 committed
129
130
+----------+-----------+----------+----------+-------+---------+---------+ 

sunzhq2's avatar
sunzhq2 committed
131
132
2026-04-14 11:48:57 - evalscope - INFO: Skipping report analysis (`analysis_report=False`).
2026-04-14 11:48:57 - evalscope - INFO: Dump report to: ./evalscope-data/reports/qwen3-8B/math_500.json 
sunzhq2's avatar
init  
sunzhq2 committed
133

sunzhq2's avatar
sunzhq2 committed
134
135
136
2026-04-14 11:48:57 - evalscope - INFO: Benchmark math_500 evaluation finished.
2026-04-14 11:48:57 - evalscope - INFO: Running[eval] 100%| 1/1 [Elapsed: 00:04 < Remaining: 00:00,  4.20s/benchmark]
2026-04-14 11:48:57 - evalscope - INFO: Overall report table: 
sunzhq2's avatar
init  
sunzhq2 committed
137
138
139
140
141
+----------+-----------+----------+----------+-------+---------+---------+
| Model    | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+==========+===========+==========+==========+=======+=========+=========+
| qwen3-8B | math_500  | mean_acc | Level 1  |    43 |  0.9535 | default |
+----------+-----------+----------+----------+-------+---------+---------+
sunzhq2's avatar
sunzhq2 committed
142
| qwen3-8B | math_500  | mean_acc | Level 2  |    90 |  0.9778 | default |
sunzhq2's avatar
init  
sunzhq2 committed
143
144
145
+----------+-----------+----------+----------+-------+---------+---------+
| qwen3-8B | math_500  | mean_acc | Level 3  |   105 |  0.9524 | default |
+----------+-----------+----------+----------+-------+---------+---------+
sunzhq2's avatar
sunzhq2 committed
146
| qwen3-8B | math_500  | mean_acc | Level 4  |   128 |  0.9453 | default |
sunzhq2's avatar
init  
sunzhq2 committed
147
148
149
+----------+-----------+----------+----------+-------+---------+---------+
| qwen3-8B | math_500  | mean_acc | Level 5  |   134 |  0.8881 | default |
+----------+-----------+----------+----------+-------+---------+---------+
sunzhq2's avatar
sunzhq2 committed
150
| qwen3-8B | math_500  | mean_acc | OVERALL  |   500 |  0.938  | -       |
sunzhq2's avatar
init  
sunzhq2 committed
151
152
+----------+-----------+----------+----------+-------+---------+---------+ 

sunzhq2's avatar
sunzhq2 committed
153
154
155
2026-04-14 11:48:57 - evalscope - INFO: HTML report generated: /data1/sunzhq/tmp/llm-benchmarks/tools/evalscope-data/reports/report.html
2026-04-14 11:48:58 - evalscope - INFO: Finished evaluation for qwen3-8B on ['math_500']
2026-04-14 11:48:58 - evalscope - INFO: Output directory: ./evalscope-data