README.md 34 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
# llama-7B

## llama-7B_anli_0-shot.json
| Task  |Version|Metric|Value|   |Stderr|
|-------|------:|------|----:|---|-----:|
|anli_r1|      0|acc   |34.80|±  |  1.51|
|anli_r2|      0|acc   |33.70|±  |  1.50|
|anli_r3|      0|acc   |36.58|±  |  1.39|

## llama-7B_arithmetic_5-shot.json
|     Task     |Version|Metric|Value|   |Stderr|
|--------------|------:|------|----:|---|-----:|
|arithmetic_1dc|      0|acc   |    0|±  |     0|
|arithmetic_2da|      0|acc   |    0|±  |     0|
|arithmetic_2dm|      0|acc   |    0|±  |     0|
|arithmetic_2ds|      0|acc   |    0|±  |     0|
|arithmetic_3da|      0|acc   |    0|±  |     0|
|arithmetic_3ds|      0|acc   |    0|±  |     0|
|arithmetic_4da|      0|acc   |    0|±  |     0|
|arithmetic_4ds|      0|acc   |    0|±  |     0|
|arithmetic_5da|      0|acc   |    0|±  |     0|
|arithmetic_5ds|      0|acc   |    0|±  |     0|

## llama-7B_bbh_3-shot.json
|                      Task                      |Version|       Metric        |Value|   |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement                       |      0|multiple_choice_grade|48.42|±  |  3.64|
|bigbench_date_understanding                     |      0|multiple_choice_grade|62.06|±  |  2.53|
|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|35.27|±  |  2.98|
|bigbench_dyck_languages                         |      0|multiple_choice_grade|15.40|±  |  1.14|
|bigbench_formal_fallacies_syllogisms_negation   |      0|multiple_choice_grade|51.35|±  |  0.42|
|bigbench_geometric_shapes                       |      0|multiple_choice_grade|17.83|±  |  2.02|
|                                                |       |exact_str_match      | 0.00|±  |  0.00|
|bigbench_hyperbaton                             |      0|multiple_choice_grade|49.51|±  |  0.22|
|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|29.00|±  |  2.03|
|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|24.57|±  |  1.63|
|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|39.33|±  |  2.83|
|bigbench_movie_recommendation                   |      0|multiple_choice_grade|40.40|±  |  2.20|
|bigbench_navigate                               |      0|multiple_choice_grade|49.50|±  |  1.58|
|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|34.60|±  |  1.06|
|bigbench_ruin_names                             |      0|multiple_choice_grade|29.91|±  |  2.17|
|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|16.53|±  |  1.18|
|bigbench_snarks                                 |      0|multiple_choice_grade|50.83|±  |  3.73|
|bigbench_sports_understanding                   |      0|multiple_choice_grade|50.00|±  |  1.59|
|bigbench_temporal_sequences                     |      0|multiple_choice_grade|27.20|±  |  1.41|
|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|18.24|±  |  1.09|
|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|13.71|±  |  0.82|
|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|39.33|±  |  2.83|

## llama-7B_blimp_0-shot.json
|                          Task                           |Version|Metric|Value|   |Stderr|
|---------------------------------------------------------|------:|------|----:|---|-----:|
|blimp_adjunct_island                                     |      0|acc   | 53.9|±  |  1.58|
|blimp_anaphor_gender_agreement                           |      0|acc   | 44.8|±  |  1.57|
|blimp_anaphor_number_agreement                           |      0|acc   | 65.9|±  |  1.50|
|blimp_animate_subject_passive                            |      0|acc   | 62.6|±  |  1.53|
|blimp_animate_subject_trans                              |      0|acc   | 76.1|±  |  1.35|
|blimp_causative                                          |      0|acc   | 50.8|±  |  1.58|
|blimp_complex_NP_island                                  |      0|acc   | 41.6|±  |  1.56|
|blimp_coordinate_structure_constraint_complex_left_branch|      0|acc   | 68.2|±  |  1.47|
|blimp_coordinate_structure_constraint_object_extraction  |      0|acc   | 62.9|±  |  1.53|
|blimp_determiner_noun_agreement_1                        |      0|acc   | 63.6|±  |  1.52|
|blimp_determiner_noun_agreement_2                        |      0|acc   | 59.8|±  |  1.55|
|blimp_determiner_noun_agreement_irregular_1              |      0|acc   | 57.2|±  |  1.57|
|blimp_determiner_noun_agreement_irregular_2              |      0|acc   | 60.2|±  |  1.55|
|blimp_determiner_noun_agreement_with_adj_2               |      0|acc   | 54.0|±  |  1.58|
|blimp_determiner_noun_agreement_with_adj_irregular_1     |      0|acc   | 56.3|±  |  1.57|
|blimp_determiner_noun_agreement_with_adj_irregular_2     |      0|acc   | 59.1|±  |  1.56|
|blimp_determiner_noun_agreement_with_adjective_1         |      0|acc   | 57.7|±  |  1.56|
|blimp_distractor_agreement_relational_noun               |      0|acc   | 44.1|±  |  1.57|
|blimp_distractor_agreement_relative_clause               |      0|acc   | 31.4|±  |  1.47|
|blimp_drop_argument                                      |      0|acc   | 70.1|±  |  1.45|
|blimp_ellipsis_n_bar_1                                   |      0|acc   | 66.8|±  |  1.49|
|blimp_ellipsis_n_bar_2                                   |      0|acc   | 79.4|±  |  1.28|
|blimp_existential_there_object_raising                   |      0|acc   | 78.8|±  |  1.29|
|blimp_existential_there_quantifiers_1                    |      0|acc   | 68.3|±  |  1.47|
|blimp_existential_there_quantifiers_2                    |      0|acc   | 67.4|±  |  1.48|
|blimp_existential_there_subject_raising                  |      0|acc   | 69.6|±  |  1.46|
|blimp_expletive_it_object_raising                        |      0|acc   | 65.9|±  |  1.50|
|blimp_inchoative                                         |      0|acc   | 42.0|±  |  1.56|
|blimp_intransitive                                       |      0|acc   | 59.2|±  |  1.55|
|blimp_irregular_past_participle_adjectives               |      0|acc   | 42.9|±  |  1.57|
|blimp_irregular_past_participle_verbs                    |      0|acc   | 72.5|±  |  1.41|
|blimp_irregular_plural_subject_verb_agreement_1          |      0|acc   | 65.3|±  |  1.51|
|blimp_irregular_plural_subject_verb_agreement_2          |      0|acc   | 70.0|±  |  1.45|
|blimp_left_branch_island_echo_question                   |      0|acc   | 83.5|±  |  1.17|
|blimp_left_branch_island_simple_question                 |      0|acc   | 74.0|±  |  1.39|
|blimp_matrix_question_npi_licensor_present               |      0|acc   | 11.7|±  |  1.02|
|blimp_npi_present_1                                      |      0|acc   | 53.4|±  |  1.58|
|blimp_npi_present_2                                      |      0|acc   | 53.0|±  |  1.58|
|blimp_only_npi_licensor_present                          |      0|acc   | 81.4|±  |  1.23|
|blimp_only_npi_scope                                     |      0|acc   | 26.6|±  |  1.40|
|blimp_passive_1                                          |      0|acc   | 70.2|±  |  1.45|
|blimp_passive_2                                          |      0|acc   | 70.3|±  |  1.45|
|blimp_principle_A_c_command                              |      0|acc   | 39.0|±  |  1.54|
|blimp_principle_A_case_1                                 |      0|acc   | 98.5|±  |  0.38|
|blimp_principle_A_case_2                                 |      0|acc   | 55.4|±  |  1.57|
|blimp_principle_A_domain_1                               |      0|acc   | 96.2|±  |  0.60|
|blimp_principle_A_domain_2                               |      0|acc   | 64.6|±  |  1.51|
|blimp_principle_A_domain_3                               |      0|acc   | 50.1|±  |  1.58|
|blimp_principle_A_reconstruction                         |      0|acc   | 67.3|±  |  1.48|
|blimp_regular_plural_subject_verb_agreement_1            |      0|acc   | 64.5|±  |  1.51|
|blimp_regular_plural_subject_verb_agreement_2            |      0|acc   | 70.5|±  |  1.44|
|blimp_sentential_negation_npi_licensor_present           |      0|acc   | 94.0|±  |  0.75|
|blimp_sentential_negation_npi_scope                      |      0|acc   | 58.8|±  |  1.56|
|blimp_sentential_subject_island                          |      0|acc   | 60.6|±  |  1.55|
|blimp_superlative_quantifiers_1                          |      0|acc   | 61.2|±  |  1.54|
|blimp_superlative_quantifiers_2                          |      0|acc   | 56.1|±  |  1.57|
|blimp_tough_vs_raising_1                                 |      0|acc   | 29.8|±  |  1.45|
|blimp_tough_vs_raising_2                                 |      0|acc   | 76.8|±  |  1.34|
|blimp_transitive                                         |      0|acc   | 69.8|±  |  1.45|
|blimp_wh_island                                          |      0|acc   | 27.5|±  |  1.41|
|blimp_wh_questions_object_gap                            |      0|acc   | 67.0|±  |  1.49|
|blimp_wh_questions_subject_gap                           |      0|acc   | 72.0|±  |  1.42|
|blimp_wh_questions_subject_gap_long_distance             |      0|acc   | 74.6|±  |  1.38|
|blimp_wh_vs_that_no_gap                                  |      0|acc   | 84.8|±  |  1.14|
|blimp_wh_vs_that_no_gap_long_distance                    |      0|acc   | 81.2|±  |  1.24|
|blimp_wh_vs_that_with_gap                                |      0|acc   | 23.9|±  |  1.35|
|blimp_wh_vs_that_with_gap_long_distance                  |      0|acc   | 22.7|±  |  1.33|

## llama-7B_common_sense_reasoning_0-shot.json
|    Task     |Version| Metric |Value|   |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge|      0|acc     |38.23|±  |  1.42|
|             |       |acc_norm|41.38|±  |  1.44|
|arc_easy     |      0|acc     |67.38|±  |  0.96|
|             |       |acc_norm|52.48|±  |  1.02|
|boolq        |      1|acc     |73.06|±  |  0.78|
|copa         |      0|acc     |84.00|±  |  3.68|
|hellaswag    |      0|acc     |56.39|±  |  0.49|
|             |       |acc_norm|72.98|±  |  0.44|
|mc_taco      |      0|em      |11.26|   |      |
|             |       |f1      |48.27|   |      |
|openbookqa   |      0|acc     |28.20|±  |  2.01|
|             |       |acc_norm|42.40|±  |  2.21|
|piqa         |      0|acc     |78.18|±  |  0.96|
|             |       |acc_norm|77.42|±  |  0.98|
|prost        |      0|acc     |25.69|±  |  0.32|
|             |       |acc_norm|28.03|±  |  0.33|
|swag         |      0|acc     |55.47|±  |  0.35|
|             |       |acc_norm|66.87|±  |  0.33|
|winogrande   |      0|acc     |66.93|±  |  1.32|
|wsc273       |      0|acc     |80.95|±  |  2.38|

## llama-7B_glue_0-shot.json
|     Task      |Version|Metric|Value|   |Stderr|
|---------------|------:|------|----:|---|-----:|
|cola           |      0|mcc   | 0.00|±  |  0.00|
|mnli           |      0|acc   |34.40|±  |  0.48|
|mnli_mismatched|      0|acc   |35.72|±  |  0.48|
|mrpc           |      0|acc   |68.38|±  |  2.30|
|               |       |f1    |81.22|±  |  1.62|
|qnli           |      0|acc   |49.57|±  |  0.68|
|qqp            |      0|acc   |36.84|±  |  0.24|
|               |       |f1    |53.81|±  |  0.26|
|rte            |      0|acc   |53.07|±  |  3.00|
|sst            |      0|acc   |52.98|±  |  1.69|
|wnli           |      1|acc   |56.34|±  |  5.93|

## llama-7B_gsm8k_8-shot.json
|Task |Version|Metric|Value|   |Stderr|
|-----|------:|------|----:|---|-----:|
|gsm8k|      0|acc   | 8.04|±  |  0.75|

## llama-7B_human_alignment_0-shot.json
|                 Task                  |Version|       Metric        | Value |   |Stderr|
|---------------------------------------|------:|---------------------|------:|---|-----:|
|crows_pairs_english_age                |      0|likelihood_difference| 594.23|±  | 79.03|
|                                       |       |pct_stereotype       |  51.65|±  |  5.27|
|crows_pairs_english_autre              |      0|likelihood_difference|1101.14|±  |589.08|
|                                       |       |pct_stereotype       |  45.45|±  | 15.75|
|crows_pairs_english_disability         |      0|likelihood_difference| 966.97|±  |113.86|
|                                       |       |pct_stereotype       |  66.15|±  |  5.91|
|crows_pairs_english_gender             |      0|likelihood_difference| 791.74|±  | 55.02|
|                                       |       |pct_stereotype       |  53.12|±  |  2.79|
|crows_pairs_english_nationality        |      0|likelihood_difference| 676.26|±  | 58.69|
|                                       |       |pct_stereotype       |  53.70|±  |  3.40|
|crows_pairs_english_physical_appearance|      0|likelihood_difference| 451.26|±  | 69.32|
|                                       |       |pct_stereotype       |  50.00|±  |  5.93|
|crows_pairs_english_race_color         |      0|likelihood_difference| 624.65|±  | 32.39|
|                                       |       |pct_stereotype       |  46.65|±  |  2.22|
|crows_pairs_english_religion           |      0|likelihood_difference| 721.96|±  | 75.92|
|                                       |       |pct_stereotype       |  66.67|±  |  4.49|
|crows_pairs_english_sexual_orientation |      0|likelihood_difference| 830.48|±  | 84.28|
|                                       |       |pct_stereotype       |  62.37|±  |  5.05|
|crows_pairs_english_socioeconomic      |      0|likelihood_difference| 640.16|±  | 54.20|
|                                       |       |pct_stereotype       |  56.84|±  |  3.60|
|crows_pairs_french_age                 |      0|likelihood_difference|1193.96|±  |153.77|
|                                       |       |pct_stereotype       |  35.56|±  |  5.07|
|crows_pairs_french_autre               |      0|likelihood_difference| 751.20|±  |209.58|
|                                       |       |pct_stereotype       |  61.54|±  | 14.04|
|crows_pairs_french_disability          |      0|likelihood_difference|1014.77|±  |139.07|
|                                       |       |pct_stereotype       |  42.42|±  |  6.13|
|crows_pairs_french_gender              |      0|likelihood_difference|1179.90|±  | 87.14|
|                                       |       |pct_stereotype       |  52.02|±  |  2.79|
|crows_pairs_french_nationality         |      0|likelihood_difference|1041.65|±  | 90.66|
|                                       |       |pct_stereotype       |  40.71|±  |  3.09|
|crows_pairs_french_physical_appearance |      0|likelihood_difference| 704.51|±  | 94.84|
|                                       |       |pct_stereotype       |  55.56|±  |  5.90|
|crows_pairs_french_race_color          |      0|likelihood_difference|1204.89|±  | 73.32|
|                                       |       |pct_stereotype       |  43.26|±  |  2.31|
|crows_pairs_french_religion            |      0|likelihood_difference| 958.53|±  | 87.50|
|                                       |       |pct_stereotype       |  43.48|±  |  4.64|
|crows_pairs_french_sexual_orientation  |      0|likelihood_difference| 760.58|±  | 79.39|
|                                       |       |pct_stereotype       |  67.03|±  |  4.96|
|crows_pairs_french_socioeconomic       |      0|likelihood_difference| 980.84|±  |101.51|
|                                       |       |pct_stereotype       |  52.04|±  |  3.58|
|ethics_cm                              |      0|acc                  |  56.91|±  |  0.79|
|ethics_deontology                      |      0|acc                  |  50.58|±  |  0.83|
|                                       |       |em                   |   0.22|   |      |
|ethics_justice                         |      0|acc                  |  49.96|±  |  0.96|
|                                       |       |em                   |   0.15|   |      |
|ethics_utilitarianism                  |      0|acc                  |  49.81|±  |  0.72|
|ethics_utilitarianism_original         |      0|acc                  |  95.86|±  |  0.29|
|ethics_virtue                          |      0|acc                  |  20.98|±  |  0.58|
|                                       |       |em                   |   0.00|   |      |
|toxigen                                |      0|acc                  |  43.09|±  |  1.62|
|                                       |       |acc_norm             |  43.19|±  |  1.62|

## llama-7B_lambada_0-shot.json
|         Task         |Version|Metric|  Value   |   | Stderr  |
|----------------------|------:|------|---------:|---|--------:|
|lambada_openai        |      0|ppl   |2817465.09|±  |138319.09|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_cloze  |      0|ppl   | 255777.71|±  | 11345.77|
|                      |       |acc   |      0.04|±  |     0.03|
|lambada_openai_mt_de  |      0|ppl   |1805613.68|±  | 97892.79|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_en  |      0|ppl   |2817465.09|±  |138319.09|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_es  |      0|ppl   |3818890.45|±  |197999.05|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_fr  |      0|ppl   |2111186.12|±  |111724.43|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_it  |      0|ppl   |3653680.57|±  |197082.99|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_standard      |      0|ppl   |2460346.86|±  | 81216.57|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_standard_cloze|      0|ppl   |6710057.24|±  |169833.91|
|                      |       |acc   |      0.00|±  |     0.00|

## llama-7B_mathematical_reasoning_0-shot.json
|          Task           |Version| Metric |Value|   |Stderr|
|-------------------------|------:|--------|----:|---|-----:|
|drop                     |      1|em      | 4.27|±  |  0.21|
|                         |       |f1      |12.16|±  |  0.25|
|gsm8k                    |      0|acc     | 0.00|±  |  0.00|
|math_algebra             |      1|acc     | 1.68|±  |  0.37|
|math_asdiv               |      0|acc     | 0.00|±  |  0.00|
|math_counting_and_prob   |      1|acc     | 1.69|±  |  0.59|
|math_geometry            |      1|acc     | 0.84|±  |  0.42|
|math_intermediate_algebra|      1|acc     | 0.66|±  |  0.27|
|math_num_theory          |      1|acc     | 0.74|±  |  0.37|
|math_prealgebra          |      1|acc     | 1.26|±  |  0.38|
|math_precalc             |      1|acc     | 0.37|±  |  0.26|
|mathqa                   |      0|acc     |26.77|±  |  0.81|
|                         |       |acc_norm|27.87|±  |  0.82|

## llama-7B_mathematical_reasoning_few_shot_5-shot.json
|          Task           |Version| Metric |Value|   |Stderr|
|-------------------------|------:|--------|----:|---|-----:|
|drop                     |      1|em      | 1.24|±  |  0.11|
|                         |       |f1      | 2.10|±  |  0.13|
|gsm8k                    |      0|acc     | 0.00|±  |  0.00|
|math_algebra             |      1|acc     | 0.00|±  |  0.00|
|math_counting_and_prob   |      1|acc     | 0.00|±  |  0.00|
|math_geometry            |      1|acc     | 0.00|±  |  0.00|
|math_intermediate_algebra|      1|acc     | 0.00|±  |  0.00|
|math_num_theory          |      1|acc     | 0.00|±  |  0.00|
|math_prealgebra          |      1|acc     | 0.11|±  |  0.11|
|math_precalc             |      1|acc     | 0.00|±  |  0.00|
|mathqa                   |      0|acc     |28.21|±  |  0.82|
|                         |       |acc_norm|28.78|±  |  0.83|

## llama-7B_mmlu_5-shot.json
|                      Task                       |Version| Metric |Value|   |Stderr|
|-------------------------------------------------|------:|--------|----:|---|-----:|
|hendrycksTest-abstract_algebra                   |      0|acc     |23.00|±  |  4.23|
|                                                 |       |acc_norm|26.00|±  |  4.41|
|hendrycksTest-anatomy                            |      0|acc     |38.52|±  |  4.20|
|                                                 |       |acc_norm|28.15|±  |  3.89|
|hendrycksTest-astronomy                          |      0|acc     |45.39|±  |  4.05|
|                                                 |       |acc_norm|46.05|±  |  4.06|
|hendrycksTest-business_ethics                    |      0|acc     |53.00|±  |  5.02|
|                                                 |       |acc_norm|46.00|±  |  5.01|
|hendrycksTest-clinical_knowledge                 |      0|acc     |38.87|±  |  3.00|
|                                                 |       |acc_norm|38.11|±  |  2.99|
|hendrycksTest-college_biology                    |      0|acc     |31.94|±  |  3.90|
|                                                 |       |acc_norm|29.17|±  |  3.80|
|hendrycksTest-college_chemistry                  |      0|acc     |33.00|±  |  4.73|
|                                                 |       |acc_norm|30.00|±  |  4.61|
|hendrycksTest-college_computer_science           |      0|acc     |33.00|±  |  4.73|
|                                                 |       |acc_norm|28.00|±  |  4.51|
|hendrycksTest-college_mathematics                |      0|acc     |32.00|±  |  4.69|
|                                                 |       |acc_norm|32.00|±  |  4.69|
|hendrycksTest-college_medicine                   |      0|acc     |37.57|±  |  3.69|
|                                                 |       |acc_norm|30.64|±  |  3.51|
|hendrycksTest-college_physics                    |      0|acc     |23.53|±  |  4.22|
|                                                 |       |acc_norm|32.35|±  |  4.66|
|hendrycksTest-computer_security                  |      0|acc     |37.00|±  |  4.85|
|                                                 |       |acc_norm|44.00|±  |  4.99|
|hendrycksTest-conceptual_physics                 |      0|acc     |32.77|±  |  3.07|
|                                                 |       |acc_norm|21.70|±  |  2.69|
|hendrycksTest-econometrics                       |      0|acc     |28.95|±  |  4.27|
|                                                 |       |acc_norm|26.32|±  |  4.14|
|hendrycksTest-electrical_engineering             |      0|acc     |35.86|±  |  4.00|
|                                                 |       |acc_norm|32.41|±  |  3.90|
|hendrycksTest-elementary_mathematics             |      0|acc     |32.01|±  |  2.40|
|                                                 |       |acc_norm|29.10|±  |  2.34|
|hendrycksTest-formal_logic                       |      0|acc     |30.95|±  |  4.13|
|                                                 |       |acc_norm|34.92|±  |  4.26|
|hendrycksTest-global_facts                       |      0|acc     |32.00|±  |  4.69|
|                                                 |       |acc_norm|29.00|±  |  4.56|
|hendrycksTest-high_school_biology                |      0|acc     |35.81|±  |  2.73|
|                                                 |       |acc_norm|35.81|±  |  2.73|
|hendrycksTest-high_school_chemistry              |      0|acc     |25.12|±  |  3.05|
|                                                 |       |acc_norm|29.56|±  |  3.21|
|hendrycksTest-high_school_computer_science       |      0|acc     |41.00|±  |  4.94|
|                                                 |       |acc_norm|34.00|±  |  4.76|
|hendrycksTest-high_school_european_history       |      0|acc     |40.61|±  |  3.83|
|                                                 |       |acc_norm|36.97|±  |  3.77|
|hendrycksTest-high_school_geography              |      0|acc     |42.93|±  |  3.53|
|                                                 |       |acc_norm|36.36|±  |  3.43|
|hendrycksTest-high_school_government_and_politics|      0|acc     |48.19|±  |  3.61|
|                                                 |       |acc_norm|37.31|±  |  3.49|
|hendrycksTest-high_school_macroeconomics         |      0|acc     |31.79|±  |  2.36|
|                                                 |       |acc_norm|30.26|±  |  2.33|
|hendrycksTest-high_school_mathematics            |      0|acc     |22.59|±  |  2.55|
|                                                 |       |acc_norm|30.74|±  |  2.81|
|hendrycksTest-high_school_microeconomics         |      0|acc     |38.66|±  |  3.16|
|                                                 |       |acc_norm|36.55|±  |  3.13|
|hendrycksTest-high_school_physics                |      0|acc     |20.53|±  |  3.30|
|                                                 |       |acc_norm|27.15|±  |  3.63|
|hendrycksTest-high_school_psychology             |      0|acc     |46.61|±  |  2.14|
|                                                 |       |acc_norm|30.83|±  |  1.98|
|hendrycksTest-high_school_statistics             |      0|acc     |34.26|±  |  3.24|
|                                                 |       |acc_norm|34.26|±  |  3.24|
|hendrycksTest-high_school_us_history             |      0|acc     |42.65|±  |  3.47|
|                                                 |       |acc_norm|31.37|±  |  3.26|
|hendrycksTest-high_school_world_history          |      0|acc     |39.24|±  |  3.18|
|                                                 |       |acc_norm|33.76|±  |  3.08|
|hendrycksTest-human_aging                        |      0|acc     |37.22|±  |  3.24|
|                                                 |       |acc_norm|25.11|±  |  2.91|
|hendrycksTest-human_sexuality                    |      0|acc     |51.15|±  |  4.38|
|                                                 |       |acc_norm|36.64|±  |  4.23|
|hendrycksTest-international_law                  |      0|acc     |38.84|±  |  4.45|
|                                                 |       |acc_norm|57.85|±  |  4.51|
|hendrycksTest-jurisprudence                      |      0|acc     |43.52|±  |  4.79|
|                                                 |       |acc_norm|50.00|±  |  4.83|
|hendrycksTest-logical_fallacies                  |      0|acc     |38.04|±  |  3.81|
|                                                 |       |acc_norm|34.97|±  |  3.75|
|hendrycksTest-machine_learning                   |      0|acc     |30.36|±  |  4.36|
|                                                 |       |acc_norm|26.79|±  |  4.20|
|hendrycksTest-management                         |      0|acc     |48.54|±  |  4.95|
|                                                 |       |acc_norm|36.89|±  |  4.78|
|hendrycksTest-marketing                          |      0|acc     |61.11|±  |  3.19|
|                                                 |       |acc_norm|50.43|±  |  3.28|
|hendrycksTest-medical_genetics                   |      0|acc     |44.00|±  |  4.99|
|                                                 |       |acc_norm|40.00|±  |  4.92|
|hendrycksTest-miscellaneous                      |      0|acc     |58.37|±  |  1.76|
|                                                 |       |acc_norm|38.95|±  |  1.74|
|hendrycksTest-moral_disputes                     |      0|acc     |36.42|±  |  2.59|
|                                                 |       |acc_norm|33.24|±  |  2.54|
|hendrycksTest-moral_scenarios                    |      0|acc     |27.60|±  |  1.50|
|                                                 |       |acc_norm|27.26|±  |  1.49|
|hendrycksTest-nutrition                          |      0|acc     |39.54|±  |  2.80|
|                                                 |       |acc_norm|43.79|±  |  2.84|
|hendrycksTest-philosophy                         |      0|acc     |40.19|±  |  2.78|
|                                                 |       |acc_norm|35.37|±  |  2.72|
|hendrycksTest-prehistory                         |      0|acc     |40.12|±  |  2.73|
|                                                 |       |acc_norm|27.78|±  |  2.49|
|hendrycksTest-professional_accounting            |      0|acc     |30.14|±  |  2.74|
|                                                 |       |acc_norm|29.43|±  |  2.72|
|hendrycksTest-professional_law                   |      0|acc     |29.66|±  |  1.17|
|                                                 |       |acc_norm|28.55|±  |  1.15|
|hendrycksTest-professional_medicine              |      0|acc     |33.82|±  |  2.87|
|                                                 |       |acc_norm|27.94|±  |  2.73|
|hendrycksTest-professional_psychology            |      0|acc     |38.40|±  |  1.97|
|                                                 |       |acc_norm|29.90|±  |  1.85|
|hendrycksTest-public_relations                   |      0|acc     |39.09|±  |  4.67|
|                                                 |       |acc_norm|22.73|±  |  4.01|
|hendrycksTest-security_studies                   |      0|acc     |40.82|±  |  3.15|
|                                                 |       |acc_norm|31.02|±  |  2.96|
|hendrycksTest-sociology                          |      0|acc     |47.76|±  |  3.53|
|                                                 |       |acc_norm|42.79|±  |  3.50|
|hendrycksTest-us_foreign_policy                  |      0|acc     |56.00|±  |  4.99|
|                                                 |       |acc_norm|45.00|±  |  5.00|
|hendrycksTest-virology                           |      0|acc     |39.76|±  |  3.81|
|                                                 |       |acc_norm|28.92|±  |  3.53|
|hendrycksTest-world_religions                    |      0|acc     |62.57|±  |  3.71|
|                                                 |       |acc_norm|51.46|±  |  3.83|

## llama-7B_pawsx_0-shot.json
|  Task  |Version|Metric|Value|   |Stderr|
|--------|------:|------|----:|---|-----:|
|pawsx_de|      0|acc   |54.65|±  |  1.11|
|pawsx_en|      0|acc   |61.85|±  |  1.09|
|pawsx_es|      0|acc   |56.10|±  |  1.11|
|pawsx_fr|      0|acc   |52.95|±  |  1.12|
|pawsx_ja|      0|acc   |56.70|±  |  1.11|
|pawsx_ko|      0|acc   |49.70|±  |  1.12|
|pawsx_zh|      0|acc   |49.10|±  |  1.12|

## llama-7B_question_answering_0-shot.json
|    Task     |Version|   Metric   |Value|   |Stderr|
|-------------|------:|------------|----:|---|-----:|
|headqa_en    |      0|acc         |32.42|±  |  0.89|
|             |       |acc_norm    |35.92|±  |  0.92|
|headqa_es    |      0|acc         |28.26|±  |  0.86|
|             |       |acc_norm    |32.42|±  |  0.89|
|logiqa       |      0|acc         |21.81|±  |  1.62|
|             |       |acc_norm    |30.26|±  |  1.80|
|squad2       |      1|exact       | 9.42|   |      |
|             |       |f1          |19.45|   |      |
|             |       |HasAns_exact|18.49|   |      |
|             |       |HasAns_f1   |38.58|   |      |
|             |       |NoAns_exact | 0.37|   |      |
|             |       |NoAns_f1    | 0.37|   |      |
|             |       |best_exact  |50.07|   |      |
|             |       |best_f1     |50.08|   |      |
|triviaqa     |      1|acc         | 0.00|±  |  0.00|
|truthfulqa_mc|      1|mc1         |21.05|±  |  1.43|
|             |       |mc2         |34.14|±  |  1.31|
|webqs        |      0|acc         | 0.00|±  |  0.00|

## llama-7B_reading_comprehension_0-shot.json
|Task|Version|Metric|Value|   |Stderr|
|----|------:|------|----:|---|-----:|
|coqa|      1|f1    |75.21|±  |  1.53|
|    |       |em    |62.67|±  |  1.88|
|drop|      1|em    | 3.59|±  |  0.19|
|    |       |f1    |11.35|±  |  0.23|
|race|      1|acc   |39.90|±  |  1.52|

Julen Etxaniz's avatar
Julen Etxaniz committed
435
436
437
438
439
440
441
442
443
## llama-7B_unscramble_0-shot.json
|      Task      |Version|Metric|Value|   |Stderr|
|----------------|------:|------|----:|---|-----:|
|anagrams1       |      0|acc   |    0|±  |     0|
|anagrams2       |      0|acc   |    0|±  |     0|
|cycle_letters   |      0|acc   |    0|±  |     0|
|random_insertion|      0|acc   |    0|±  |     0|
|reversed_words  |      0|acc   |    0|±  |     0|

444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
## llama-7B_xcopa_0-shot.json
|  Task  |Version|Metric|Value|   |Stderr|
|--------|------:|------|----:|---|-----:|
|xcopa_et|      0|acc   | 48.8|±  |  2.24|
|xcopa_ht|      0|acc   | 51.0|±  |  2.24|
|xcopa_id|      0|acc   | 54.6|±  |  2.23|
|xcopa_it|      0|acc   | 62.0|±  |  2.17|
|xcopa_qu|      0|acc   | 51.4|±  |  2.24|
|xcopa_sw|      0|acc   | 50.8|±  |  2.24|
|xcopa_ta|      0|acc   | 55.2|±  |  2.23|
|xcopa_th|      0|acc   | 55.8|±  |  2.22|
|xcopa_tr|      0|acc   | 55.6|±  |  2.22|
|xcopa_vi|      0|acc   | 51.6|±  |  2.24|
|xcopa_zh|      0|acc   | 56.2|±  |  2.22|

## llama-7B_xnli_0-shot.json
| Task  |Version|Metric|Value|   |Stderr|
|-------|------:|------|----:|---|-----:|
|xnli_ar|      0|acc   |33.57|±  |  0.67|
|xnli_bg|      0|acc   |36.99|±  |  0.68|
|xnli_de|      0|acc   |44.77|±  |  0.70|
|xnli_el|      0|acc   |34.93|±  |  0.67|
|xnli_en|      0|acc   |51.06|±  |  0.71|
|xnli_es|      0|acc   |40.62|±  |  0.69|
|xnli_fr|      0|acc   |43.75|±  |  0.70|
|xnli_hi|      0|acc   |36.11|±  |  0.68|
|xnli_ru|      0|acc   |39.36|±  |  0.69|
|xnli_sw|      0|acc   |33.71|±  |  0.67|
|xnli_th|      0|acc   |34.51|±  |  0.67|
|xnli_tr|      0|acc   |35.59|±  |  0.68|
|xnli_ur|      0|acc   |33.39|±  |  0.67|
|xnli_vi|      0|acc   |35.59|±  |  0.68|
|xnli_zh|      0|acc   |36.23|±  |  0.68|

## llama-7B_xstory_cloze_0-shot.json
|     Task      |Version|Metric|Value|   |Stderr|
|---------------|------:|------|----:|---|-----:|
|xstory_cloze_ar|      0|acc   |48.31|±  |  1.29|
|xstory_cloze_en|      0|acc   |74.78|±  |  1.12|
|xstory_cloze_es|      0|acc   |65.12|±  |  1.23|
|xstory_cloze_eu|      0|acc   |50.10|±  |  1.29|
|xstory_cloze_hi|      0|acc   |52.68|±  |  1.28|
|xstory_cloze_id|      0|acc   |52.08|±  |  1.29|
|xstory_cloze_my|      0|acc   |48.71|±  |  1.29|
|xstory_cloze_ru|      0|acc   |61.35|±  |  1.25|
|xstory_cloze_sw|      0|acc   |50.36|±  |  1.29|
|xstory_cloze_te|      0|acc   |52.88|±  |  1.28|
|xstory_cloze_zh|      0|acc   |54.33|±  |  1.28|

## llama-7B_xwinograd_0-shot.json
|    Task    |Version|Metric|Value|   |Stderr|
|------------|------:|------|----:|---|-----:|
|xwinograd_en|      0|acc   |84.95|±  |  0.74|
|xwinograd_fr|      0|acc   |72.29|±  |  4.94|
|xwinograd_jp|      0|acc   |58.92|±  |  1.59|
|xwinograd_pt|      0|acc   |70.72|±  |  2.81|
|xwinograd_ru|      0|acc   |64.44|±  |  2.70|
|xwinograd_zh|      0|acc   |63.69|±  |  2.14|