README.md 6.45 KB
Newer Older
Julen Etxaniz's avatar
Julen Etxaniz committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# mpt-7b

## mpt-7b_arithmetic_5-shot.json
|     Task     |Version|Metric|Value|   |Stderr|
|--------------|------:|------|----:|---|-----:|
|arithmetic_1dc|      0|acc   | 8.10|±  |  0.61|
|arithmetic_2da|      0|acc   |91.80|±  |  0.61|
|arithmetic_2dm|      0|acc   |25.60|±  |  0.98|
|arithmetic_2ds|      0|acc   |78.75|±  |  0.91|
|arithmetic_3da|      0|acc   |29.15|±  |  1.02|
|arithmetic_3ds|      0|acc   |42.80|±  |  1.11|
|arithmetic_4da|      0|acc   | 2.60|±  |  0.36|
|arithmetic_4ds|      0|acc   | 2.60|±  |  0.36|
|arithmetic_5da|      0|acc   | 0.45|±  |  0.15|
|arithmetic_5ds|      0|acc   | 0.20|±  |  0.10|

## mpt-7b_bbh_3-shot.json
|                      Task                      |Version|       Metric        |Value|   |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement                       |      0|multiple_choice_grade|56.32|±  |  3.61|
|bigbench_date_understanding                     |      0|multiple_choice_grade|58.27|±  |  2.57|
|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|36.43|±  |  3.00|
|bigbench_dyck_languages                         |      0|multiple_choice_grade|12.30|±  |  1.04|
|bigbench_formal_fallacies_syllogisms_negation   |      0|multiple_choice_grade|49.92|±  |  0.42|
|bigbench_geometric_shapes                       |      0|multiple_choice_grade|20.33|±  |  2.13|
|                                                |       |exact_str_match      |12.26|±  |  1.73|
|bigbench_hyperbaton                             |      0|multiple_choice_grade|49.36|±  |  0.22|
|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|24.00|±  |  1.91|
|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|16.57|±  |  1.41|
|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|38.67|±  |  2.82|
|bigbench_movie_recommendation                   |      0|multiple_choice_grade|43.80|±  |  2.22|
|bigbench_navigate                               |      0|multiple_choice_grade|48.60|±  |  1.58|
|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|29.85|±  |  1.02|
|bigbench_ruin_names                             |      0|multiple_choice_grade|29.69|±  |  2.16|
|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|17.94|±  |  1.22|
|bigbench_snarks                                 |      0|multiple_choice_grade|53.04|±  |  3.72|
|bigbench_sports_understanding                   |      0|multiple_choice_grade|49.49|±  |  1.59|
|bigbench_temporal_sequences                     |      0|multiple_choice_grade|29.60|±  |  1.44|
|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|19.44|±  |  1.12|
|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|13.43|±  |  0.82|
|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|38.67|±  |  2.82|

## mpt-7b_common_sense_reasoning_0-shot.json
|    Task     |Version| Metric |Value|   |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge|      0|acc     |40.61|±  |  1.44|
|             |       |acc_norm|41.81|±  |  1.44|
|arc_easy     |      0|acc     |74.87|±  |  0.89|
|             |       |acc_norm|70.29|±  |  0.94|
|boolq        |      1|acc     |73.52|±  |  0.77|
|copa         |      0|acc     |85.00|±  |  3.59|
|hellaswag    |      0|acc     |57.24|±  |  0.49|
|             |       |acc_norm|76.12|±  |  0.43|
|mc_taco      |      0|em      |13.51|   |      |
|             |       |f1      |45.48|   |      |
|openbookqa   |      0|acc     |32.00|±  |  2.09|
|             |       |acc_norm|42.60|±  |  2.21|
|piqa         |      0|acc     |79.16|±  |  0.95|
|             |       |acc_norm|80.41|±  |  0.93|
|prost        |      0|acc     |25.73|±  |  0.32|
|             |       |acc_norm|30.12|±  |  0.34|
|swag         |      0|acc     |56.17|±  |  0.35|
|             |       |acc_norm|75.80|±  |  0.30|
|winogrande   |      0|acc     |68.67|±  |  1.30|
|wsc273       |      0|acc     |85.71|±  |  2.12|

## mpt-7b_glue_0-shot.json
|     Task      |Version|Metric|Value|   |Stderr|
|---------------|------:|------|----:|---|-----:|
|cola           |      0|mcc   |-4.41|±  |  3.12|
|mnli           |      0|acc   |37.83|±  |  0.49|
|mnli_mismatched|      0|acc   |37.49|±  |  0.49|
|mrpc           |      0|acc   |62.99|±  |  2.39|
|               |       |f1    |75.61|±  |  1.93|
|qnli           |      0|acc   |51.35|±  |  0.68|
|qqp            |      0|acc   |50.36|±  |  0.25|
|               |       |f1    |54.14|±  |  0.29|
|rte            |      0|acc   |63.90|±  |  2.89|
|sst            |      0|acc   |76.83|±  |  1.43|
|wnli           |      1|acc   |47.89|±  |  5.97|

## mpt-7b_lambada_0-shot.json
|         Task         |Version|Metric|Value |   |Stderr|
|----------------------|------:|------|-----:|---|-----:|
|lambada_openai        |      0|ppl   |  3.87|±  |  0.08|
|                      |       |acc   | 68.35|±  |  0.65|
|lambada_openai_cloze  |      0|ppl   | 26.56|±  |  0.70|
|                      |       |acc   | 39.65|±  |  0.68|
|lambada_openai_mt_de  |      0|ppl   | 70.12|±  |  4.04|
|                      |       |acc   | 33.77|±  |  0.66|
|lambada_openai_mt_en  |      0|ppl   |  3.87|±  |  0.08|
|                      |       |acc   | 68.35|±  |  0.65|
|lambada_openai_mt_es  |      0|ppl   | 67.23|±  |  3.69|
|                      |       |acc   | 36.95|±  |  0.67|
|lambada_openai_mt_fr  |      0|ppl   | 42.93|±  |  2.37|
|                      |       |acc   | 43.02|±  |  0.69|
|lambada_openai_mt_it  |      0|ppl   | 65.76|±  |  3.87|
|                      |       |acc   | 39.20|±  |  0.68|
|lambada_standard      |      0|ppl   |  4.92|±  |  0.11|
|                      |       |acc   | 61.91|±  |  0.68|
|lambada_standard_cloze|      0|ppl   |109.10|±  |  3.04|
|                      |       |acc   | 16.75|±  |  0.52|

## mpt-7b_superglue_0-shot.json
| Task  |Version|Metric|Value|   |Stderr|
|-------|------:|------|----:|---|-----:|
|boolq  |      1|acc   |73.82|±  |  0.77|
|cb     |      1|acc   |41.07|±  |  6.63|
|       |       |f1    |21.27|   |      |
|copa   |      0|acc   |84.00|±  |  3.68|
|multirc|      1|acc   | 0.84|±  |  0.30|
|record |      0|f1    |90.10|±  |  0.29|
|       |       |em    |89.30|±  |  0.31|
|wic    |      0|acc   |48.43|±  |  1.98|
|wsc    |      0|acc   |63.46|±  |  4.74|