README.md 33.9 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
# llama-13B

## llama-13B_arithmetic_5-shot.json
|     Task     |Version|Metric|Value|   |Stderr|
|--------------|------:|------|----:|---|-----:|
|arithmetic_1dc|      0|acc   |    0|±  |     0|
|arithmetic_2da|      0|acc   |    0|±  |     0|
|arithmetic_2dm|      0|acc   |    0|±  |     0|
|arithmetic_2ds|      0|acc   |    0|±  |     0|
|arithmetic_3da|      0|acc   |    0|±  |     0|
|arithmetic_3ds|      0|acc   |    0|±  |     0|
|arithmetic_4da|      0|acc   |    0|±  |     0|
|arithmetic_4ds|      0|acc   |    0|±  |     0|
|arithmetic_5da|      0|acc   |    0|±  |     0|
|arithmetic_5ds|      0|acc   |    0|±  |     0|

## llama-13B_bbh_3-shot.json
|                      Task                      |Version|       Metric        |Value|   |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement                       |      0|multiple_choice_grade|49.47|±  |  3.64|
|bigbench_date_understanding                     |      0|multiple_choice_grade|63.96|±  |  2.50|
|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|45.74|±  |  3.11|
|bigbench_dyck_languages                         |      0|multiple_choice_grade|20.10|±  |  1.27|
|bigbench_formal_fallacies_syllogisms_negation   |      0|multiple_choice_grade|51.13|±  |  0.42|
|bigbench_geometric_shapes                       |      0|multiple_choice_grade|23.12|±  |  2.23|
|                                                |       |exact_str_match      | 0.00|±  |  0.00|
|bigbench_hyperbaton                             |      0|multiple_choice_grade|50.38|±  |  0.22|
|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|30.00|±  |  2.05|
|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|22.29|±  |  1.57|
|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|41.67|±  |  2.85|
|bigbench_movie_recommendation                   |      0|multiple_choice_grade|43.60|±  |  2.22|
|bigbench_navigate                               |      0|multiple_choice_grade|51.70|±  |  1.58|
|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|37.05|±  |  1.08|
|bigbench_ruin_names                             |      0|multiple_choice_grade|34.60|±  |  2.25|
|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|19.34|±  |  1.25|
|bigbench_snarks                                 |      0|multiple_choice_grade|46.96|±  |  3.72|
|bigbench_sports_understanding                   |      0|multiple_choice_grade|58.11|±  |  1.57|
|bigbench_temporal_sequences                     |      0|multiple_choice_grade|28.00|±  |  1.42|
|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|21.44|±  |  1.16|
|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|14.46|±  |  0.84|
|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|41.67|±  |  2.85|

## llama-13B_blimp_0-shot.json
|                          Task                           |Version|Metric|Value|   |Stderr|
|---------------------------------------------------------|------:|------|----:|---|-----:|
|blimp_adjunct_island                                     |      0|acc   | 33.8|±  |  1.50|
|blimp_anaphor_gender_agreement                           |      0|acc   | 57.6|±  |  1.56|
|blimp_anaphor_number_agreement                           |      0|acc   | 56.5|±  |  1.57|
|blimp_animate_subject_passive                            |      0|acc   | 65.1|±  |  1.51|
|blimp_animate_subject_trans                              |      0|acc   | 61.6|±  |  1.54|
|blimp_causative                                          |      0|acc   | 35.9|±  |  1.52|
|blimp_complex_NP_island                                  |      0|acc   | 30.3|±  |  1.45|
|blimp_coordinate_structure_constraint_complex_left_branch|      0|acc   | 34.5|±  |  1.50|
|blimp_coordinate_structure_constraint_object_extraction  |      0|acc   | 27.9|±  |  1.42|
|blimp_determiner_noun_agreement_1                        |      0|acc   | 34.1|±  |  1.50|
|blimp_determiner_noun_agreement_2                        |      0|acc   | 36.1|±  |  1.52|
|blimp_determiner_noun_agreement_irregular_1              |      0|acc   | 35.6|±  |  1.51|
|blimp_determiner_noun_agreement_irregular_2              |      0|acc   | 36.9|±  |  1.53|
|blimp_determiner_noun_agreement_with_adj_2               |      0|acc   | 39.2|±  |  1.54|
|blimp_determiner_noun_agreement_with_adj_irregular_1     |      0|acc   | 34.2|±  |  1.50|
|blimp_determiner_noun_agreement_with_adj_irregular_2     |      0|acc   | 39.3|±  |  1.55|
|blimp_determiner_noun_agreement_with_adjective_1         |      0|acc   | 39.1|±  |  1.54|
|blimp_distractor_agreement_relational_noun               |      0|acc   | 51.4|±  |  1.58|
|blimp_distractor_agreement_relative_clause               |      0|acc   | 42.3|±  |  1.56|
|blimp_drop_argument                                      |      0|acc   | 70.5|±  |  1.44|
|blimp_ellipsis_n_bar_1                                   |      0|acc   | 62.4|±  |  1.53|
|blimp_ellipsis_n_bar_2                                   |      0|acc   | 26.4|±  |  1.39|
|blimp_existential_there_object_raising                   |      0|acc   | 69.0|±  |  1.46|
|blimp_existential_there_quantifiers_1                    |      0|acc   | 30.8|±  |  1.46|
|blimp_existential_there_quantifiers_2                    |      0|acc   | 78.8|±  |  1.29|
|blimp_existential_there_subject_raising                  |      0|acc   | 70.1|±  |  1.45|
|blimp_expletive_it_object_raising                        |      0|acc   | 61.9|±  |  1.54|
|blimp_inchoative                                         |      0|acc   | 47.4|±  |  1.58|
|blimp_intransitive                                       |      0|acc   | 64.3|±  |  1.52|
|blimp_irregular_past_participle_adjectives               |      0|acc   | 63.6|±  |  1.52|
|blimp_irregular_past_participle_verbs                    |      0|acc   | 31.4|±  |  1.47|
|blimp_irregular_plural_subject_verb_agreement_1          |      0|acc   | 51.8|±  |  1.58|
|blimp_irregular_plural_subject_verb_agreement_2          |      0|acc   | 50.4|±  |  1.58|
|blimp_left_branch_island_echo_question                   |      0|acc   | 49.0|±  |  1.58|
|blimp_left_branch_island_simple_question                 |      0|acc   | 41.1|±  |  1.56|
|blimp_matrix_question_npi_licensor_present               |      0|acc   | 54.8|±  |  1.57|
|blimp_npi_present_1                                      |      0|acc   | 30.4|±  |  1.46|
|blimp_npi_present_2                                      |      0|acc   | 39.0|±  |  1.54|
|blimp_only_npi_licensor_present                          |      0|acc   | 73.1|±  |  1.40|
|blimp_only_npi_scope                                     |      0|acc   | 27.8|±  |  1.42|
|blimp_passive_1                                          |      0|acc   | 52.9|±  |  1.58|
|blimp_passive_2                                          |      0|acc   | 52.6|±  |  1.58|
|blimp_principle_A_c_command                              |      0|acc   | 32.6|±  |  1.48|
|blimp_principle_A_case_1                                 |      0|acc   |  2.8|±  |  0.52|
|blimp_principle_A_case_2                                 |      0|acc   | 44.3|±  |  1.57|
|blimp_principle_A_domain_1                               |      0|acc   | 32.4|±  |  1.48|
|blimp_principle_A_domain_2                               |      0|acc   | 74.0|±  |  1.39|
|blimp_principle_A_domain_3                               |      0|acc   | 56.3|±  |  1.57|
|blimp_principle_A_reconstruction                         |      0|acc   | 79.2|±  |  1.28|
|blimp_regular_plural_subject_verb_agreement_1            |      0|acc   | 56.0|±  |  1.57|
|blimp_regular_plural_subject_verb_agreement_2            |      0|acc   | 45.6|±  |  1.58|
|blimp_sentential_negation_npi_licensor_present           |      0|acc   | 39.2|±  |  1.54|
|blimp_sentential_negation_npi_scope                      |      0|acc   | 63.8|±  |  1.52|
|blimp_sentential_subject_island                          |      0|acc   | 62.1|±  |  1.53|
|blimp_superlative_quantifiers_1                          |      0|acc   | 52.2|±  |  1.58|
|blimp_superlative_quantifiers_2                          |      0|acc   | 71.4|±  |  1.43|
|blimp_tough_vs_raising_1                                 |      0|acc   | 36.1|±  |  1.52|
|blimp_tough_vs_raising_2                                 |      0|acc   | 64.2|±  |  1.52|
|blimp_transitive                                         |      0|acc   | 47.3|±  |  1.58|
|blimp_wh_island                                          |      0|acc   | 50.6|±  |  1.58|
|blimp_wh_questions_object_gap                            |      0|acc   | 45.5|±  |  1.58|
|blimp_wh_questions_subject_gap                           |      0|acc   | 36.9|±  |  1.53|
|blimp_wh_questions_subject_gap_long_distance             |      0|acc   | 40.8|±  |  1.55|
|blimp_wh_vs_that_no_gap                                  |      0|acc   | 19.6|±  |  1.26|
|blimp_wh_vs_that_no_gap_long_distance                    |      0|acc   | 30.1|±  |  1.45|
|blimp_wh_vs_that_with_gap                                |      0|acc   | 84.7|±  |  1.14|
|blimp_wh_vs_that_with_gap_long_distance                  |      0|acc   | 69.2|±  |  1.46|

## llama-13B_common_sense_reasoning_0-shot.json
|    Task     |Version| Metric |Value|   |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge|      0|acc     |43.94|±  |  1.45|
|             |       |acc_norm|44.62|±  |  1.45|
|arc_easy     |      0|acc     |74.58|±  |  0.89|
|             |       |acc_norm|59.89|±  |  1.01|
|boolq        |      1|acc     |68.50|±  |  0.81|
|copa         |      0|acc     |90.00|±  |  3.02|
|hellaswag    |      0|acc     |59.10|±  |  0.49|
|             |       |acc_norm|76.24|±  |  0.42|
|mc_taco      |      0|em      |10.96|   |      |
|             |       |f1      |47.53|   |      |
|openbookqa   |      0|acc     |30.60|±  |  2.06|
|             |       |acc_norm|42.20|±  |  2.21|
|piqa         |      0|acc     |78.84|±  |  0.95|
|             |       |acc_norm|79.11|±  |  0.95|
|prost        |      0|acc     |26.89|±  |  0.32|
|             |       |acc_norm|30.52|±  |  0.34|
|swag         |      0|acc     |56.73|±  |  0.35|
|             |       |acc_norm|69.35|±  |  0.33|
|winogrande   |      0|acc     |70.17|±  |  1.29|
|wsc273       |      0|acc     |86.08|±  |  2.10|

## llama-13B_glue_0-shot.json
|     Task      |Version|Metric|Value|   |Stderr|
|---------------|------:|------|----:|---|-----:|
|cola           |      0|mcc   | 0.00|±  |  0.00|
|mnli           |      0|acc   |43.56|±  |  0.50|
|mnli_mismatched|      0|acc   |45.35|±  |  0.50|
|mrpc           |      0|acc   |68.63|±  |  2.30|
|               |       |f1    |81.34|±  |  1.62|
|qnli           |      0|acc   |49.95|±  |  0.68|
|qqp            |      0|acc   |36.79|±  |  0.24|
|               |       |f1    |53.66|±  |  0.26|
|rte            |      0|acc   |65.34|±  |  2.86|
|sst            |      0|acc   |65.37|±  |  1.61|
|wnli           |      1|acc   |46.48|±  |  5.96|

## llama-13B_gsm8k_8-shot.json
|Task |Version|Metric|Value|   |Stderr|
|-----|------:|------|----:|---|-----:|
|gsm8k|      0|acc   |13.57|±  |  0.94|

## llama-13B_human_alignment_0-shot.json
|                 Task                  |Version|       Metric        | Value |   |Stderr|
|---------------------------------------|------:|---------------------|------:|---|-----:|
|crows_pairs_english_age                |      0|likelihood_difference| 771.02|±  | 93.66|
|                                       |       |pct_stereotype       |  56.04|±  |  5.23|
|crows_pairs_english_autre              |      0|likelihood_difference|1142.61|±  |435.33|
|                                       |       |pct_stereotype       |  36.36|±  | 15.21|
|crows_pairs_english_disability         |      0|likelihood_difference|1297.88|±  |182.88|
|                                       |       |pct_stereotype       |  35.38|±  |  5.98|
|crows_pairs_english_gender             |      0|likelihood_difference| 867.58|±  | 65.49|
|                                       |       |pct_stereotype       |  58.44|±  |  2.76|
|crows_pairs_english_nationality        |      0|likelihood_difference|1184.87|±  | 83.43|
|                                       |       |pct_stereotype       |  38.43|±  |  3.32|
|crows_pairs_english_physical_appearance|      0|likelihood_difference| 752.95|±  | 87.93|
|                                       |       |pct_stereotype       |  47.22|±  |  5.92|
|crows_pairs_english_race_color         |      0|likelihood_difference| 985.84|±  | 50.57|
|                                       |       |pct_stereotype       |  50.20|±  |  2.22|
|crows_pairs_english_religion           |      0|likelihood_difference|1181.25|±  |117.52|
|                                       |       |pct_stereotype       |  49.55|±  |  4.77|
|crows_pairs_english_sexual_orientation |      0|likelihood_difference|1072.24|±  |115.61|
|                                       |       |pct_stereotype       |  54.84|±  |  5.19|
|crows_pairs_english_socioeconomic      |      0|likelihood_difference|1122.24|±  | 78.07|
|                                       |       |pct_stereotype       |  50.53|±  |  3.64|
|crows_pairs_french_age                 |      0|likelihood_difference|1310.14|±  |112.01|
|                                       |       |pct_stereotype       |  38.89|±  |  5.17|
|crows_pairs_french_autre               |      0|likelihood_difference| 994.23|±  |314.84|
|                                       |       |pct_stereotype       |  53.85|±  | 14.39|
|crows_pairs_french_disability          |      0|likelihood_difference|1732.39|±  |182.40|
|                                       |       |pct_stereotype       |  40.91|±  |  6.10|
|crows_pairs_french_gender              |      0|likelihood_difference|1079.15|±  | 67.67|
|                                       |       |pct_stereotype       |  51.40|±  |  2.79|
|crows_pairs_french_nationality         |      0|likelihood_difference|1633.10|±  | 92.24|
|                                       |       |pct_stereotype       |  31.62|±  |  2.93|
|crows_pairs_french_physical_appearance |      0|likelihood_difference|1257.99|±  |157.39|
|                                       |       |pct_stereotype       |  52.78|±  |  5.92|
|crows_pairs_french_race_color          |      0|likelihood_difference|1192.74|±  | 50.28|
|                                       |       |pct_stereotype       |  35.00|±  |  2.23|
|crows_pairs_french_religion            |      0|likelihood_difference|1119.24|±  |108.66|
|                                       |       |pct_stereotype       |  59.13|±  |  4.60|
|crows_pairs_french_sexual_orientation  |      0|likelihood_difference|1755.49|±  |118.03|
|                                       |       |pct_stereotype       |  78.02|±  |  4.36|
|crows_pairs_french_socioeconomic       |      0|likelihood_difference|1279.15|±  | 93.70|
|                                       |       |pct_stereotype       |  35.71|±  |  3.43|
|ethics_cm                              |      0|acc                  |  51.74|±  |  0.80|
|ethics_deontology                      |      0|acc                  |  50.33|±  |  0.83|
|                                       |       |em                   |   0.11|   |      |
|ethics_justice                         |      0|acc                  |  49.93|±  |  0.96|
|                                       |       |em                   |   0.15|   |      |
|ethics_utilitarianism                  |      0|acc                  |  52.45|±  |  0.72|
|ethics_utilitarianism_original         |      0|acc                  |  98.07|±  |  0.20|
|ethics_virtue                          |      0|acc                  |  20.32|±  |  0.57|
|                                       |       |em                   |   0.00|   |      |
|toxigen                                |      0|acc                  |  42.66|±  |  1.61|
|                                       |       |acc_norm             |  43.19|±  |  1.62|

## llama-13B_lambada_0-shot.json
|         Task         |Version|Metric|  Value   |   | Stderr  |
|----------------------|------:|------|---------:|---|--------:|
|lambada_openai        |      0|ppl   |1279051.05|±  | 60995.63|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_cloze  |      0|ppl   | 204515.39|±  |  9705.34|
|                      |       |acc   |      0.02|±  |     0.02|
|lambada_openai_mt_de  |      0|ppl   |1310285.44|±  | 71395.91|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_en  |      0|ppl   |1279051.05|±  | 60995.63|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_es  |      0|ppl   |1980241.77|±  |101614.20|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_fr  |      0|ppl   |2461448.49|±  |128013.99|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_openai_mt_it  |      0|ppl   |4091504.35|±  |218020.97|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_standard      |      0|ppl   |1409048.00|±  | 47832.88|
|                      |       |acc   |      0.00|±  |     0.00|
|lambada_standard_cloze|      0|ppl   |4235345.03|±  |132892.57|
|                      |       |acc   |      0.00|±  |     0.00|

## llama-13B_mathematical_reasoning_0-shot.json
|          Task           |Version| Metric |Value|   |Stderr|
|-------------------------|------:|--------|----:|---|-----:|
|drop                     |      1|em      | 3.88|±  |  0.20|
|                         |       |f1      |13.99|±  |  0.25|
|gsm8k                    |      0|acc     | 0.00|±  |  0.00|
|math_algebra             |      1|acc     | 1.85|±  |  0.39|
|math_asdiv               |      0|acc     | 0.00|±  |  0.00|
|math_counting_and_prob   |      1|acc     | 1.48|±  |  0.55|
|math_geometry            |      1|acc     | 1.25|±  |  0.51|
|math_intermediate_algebra|      1|acc     | 1.22|±  |  0.37|
|math_num_theory          |      1|acc     | 1.48|±  |  0.52|
|math_prealgebra          |      1|acc     | 2.87|±  |  0.57|
|math_precalc             |      1|acc     | 1.10|±  |  0.45|
|mathqa                   |      0|acc     |28.44|±  |  0.83|
|                         |       |acc_norm|28.68|±  |  0.83|

## llama-13B_mathematical_reasoning_few_shot_5-shot.json
|          Task           |Version| Metric |Value|   |Stderr|
|-------------------------|------:|--------|----:|---|-----:|
|drop                     |      1|em      | 1.71|±  |  0.13|
|                         |       |f1      | 2.45|±  |  0.14|
|gsm8k                    |      0|acc     | 0.00|±  |  0.00|
|math_algebra             |      1|acc     | 0.00|±  |  0.00|
|math_counting_and_prob   |      1|acc     | 0.21|±  |  0.21|
|math_geometry            |      1|acc     | 0.00|±  |  0.00|
|math_intermediate_algebra|      1|acc     | 0.00|±  |  0.00|
|math_num_theory          |      1|acc     | 0.19|±  |  0.19|
|math_prealgebra          |      1|acc     | 0.11|±  |  0.11|
|math_precalc             |      1|acc     | 0.00|±  |  0.00|
|mathqa                   |      0|acc     |29.98|±  |  0.84|
|                         |       |acc_norm|30.35|±  |  0.84|

## llama-13B_mmlu_5-shot.json
|                      Task                       |Version| Metric |Value|   |Stderr|
|-------------------------------------------------|------:|--------|----:|---|-----:|
|hendrycksTest-abstract_algebra                   |      0|acc     |32.00|±  |  4.69|
|                                                 |       |acc_norm|30.00|±  |  4.61|
|hendrycksTest-anatomy                            |      0|acc     |42.96|±  |  4.28|
|                                                 |       |acc_norm|29.63|±  |  3.94|
|hendrycksTest-astronomy                          |      0|acc     |48.03|±  |  4.07|
|                                                 |       |acc_norm|48.03|±  |  4.07|
|hendrycksTest-business_ethics                    |      0|acc     |53.00|±  |  5.02|
|                                                 |       |acc_norm|44.00|±  |  4.99|
|hendrycksTest-clinical_knowledge                 |      0|acc     |46.04|±  |  3.07|
|                                                 |       |acc_norm|38.49|±  |  2.99|
|hendrycksTest-college_biology                    |      0|acc     |45.83|±  |  4.17|
|                                                 |       |acc_norm|32.64|±  |  3.92|
|hendrycksTest-college_chemistry                  |      0|acc     |31.00|±  |  4.65|
|                                                 |       |acc_norm|30.00|±  |  4.61|
|hendrycksTest-college_computer_science           |      0|acc     |33.00|±  |  4.73|
|                                                 |       |acc_norm|28.00|±  |  4.51|
|hendrycksTest-college_mathematics                |      0|acc     |29.00|±  |  4.56|
|                                                 |       |acc_norm|34.00|±  |  4.76|
|hendrycksTest-college_medicine                   |      0|acc     |42.77|±  |  3.77|
|                                                 |       |acc_norm|30.06|±  |  3.50|
|hendrycksTest-college_physics                    |      0|acc     |28.43|±  |  4.49|
|                                                 |       |acc_norm|35.29|±  |  4.76|
|hendrycksTest-computer_security                  |      0|acc     |57.00|±  |  4.98|
|                                                 |       |acc_norm|44.00|±  |  4.99|
|hendrycksTest-conceptual_physics                 |      0|acc     |42.13|±  |  3.23|
|                                                 |       |acc_norm|24.26|±  |  2.80|
|hendrycksTest-econometrics                       |      0|acc     |27.19|±  |  4.19|
|                                                 |       |acc_norm|26.32|±  |  4.14|
|hendrycksTest-electrical_engineering             |      0|acc     |41.38|±  |  4.10|
|                                                 |       |acc_norm|34.48|±  |  3.96|
|hendrycksTest-elementary_mathematics             |      0|acc     |36.77|±  |  2.48|
|                                                 |       |acc_norm|32.80|±  |  2.42|
|hendrycksTest-formal_logic                       |      0|acc     |32.54|±  |  4.19|
|                                                 |       |acc_norm|34.13|±  |  4.24|
|hendrycksTest-global_facts                       |      0|acc     |34.00|±  |  4.76|
|                                                 |       |acc_norm|29.00|±  |  4.56|
|hendrycksTest-high_school_biology                |      0|acc     |49.68|±  |  2.84|
|                                                 |       |acc_norm|36.13|±  |  2.73|
|hendrycksTest-high_school_chemistry              |      0|acc     |31.03|±  |  3.26|
|                                                 |       |acc_norm|32.02|±  |  3.28|
|hendrycksTest-high_school_computer_science       |      0|acc     |49.00|±  |  5.02|
|                                                 |       |acc_norm|41.00|±  |  4.94|
|hendrycksTest-high_school_european_history       |      0|acc     |52.73|±  |  3.90|
|                                                 |       |acc_norm|49.70|±  |  3.90|
|hendrycksTest-high_school_geography              |      0|acc     |57.58|±  |  3.52|
|                                                 |       |acc_norm|42.42|±  |  3.52|
|hendrycksTest-high_school_government_and_politics|      0|acc     |58.55|±  |  3.56|
|                                                 |       |acc_norm|38.86|±  |  3.52|
|hendrycksTest-high_school_macroeconomics         |      0|acc     |37.69|±  |  2.46|
|                                                 |       |acc_norm|31.79|±  |  2.36|
|hendrycksTest-high_school_mathematics            |      0|acc     |26.67|±  |  2.70|
|                                                 |       |acc_norm|31.85|±  |  2.84|
|hendrycksTest-high_school_microeconomics         |      0|acc     |42.02|±  |  3.21|
|                                                 |       |acc_norm|40.76|±  |  3.19|
|hendrycksTest-high_school_physics                |      0|acc     |27.15|±  |  3.63|
|                                                 |       |acc_norm|25.17|±  |  3.54|
|hendrycksTest-high_school_psychology             |      0|acc     |60.73|±  |  2.09|
|                                                 |       |acc_norm|36.88|±  |  2.07|
|hendrycksTest-high_school_statistics             |      0|acc     |38.43|±  |  3.32|
|                                                 |       |acc_norm|37.50|±  |  3.30|
|hendrycksTest-high_school_us_history             |      0|acc     |52.45|±  |  3.51|
|                                                 |       |acc_norm|37.25|±  |  3.39|
|hendrycksTest-high_school_world_history          |      0|acc     |49.79|±  |  3.25|
|                                                 |       |acc_norm|42.62|±  |  3.22|
|hendrycksTest-human_aging                        |      0|acc     |57.40|±  |  3.32|
|                                                 |       |acc_norm|33.63|±  |  3.17|
|hendrycksTest-human_sexuality                    |      0|acc     |54.96|±  |  4.36|
|                                                 |       |acc_norm|39.69|±  |  4.29|
|hendrycksTest-international_law                  |      0|acc     |56.20|±  |  4.53|
|                                                 |       |acc_norm|60.33|±  |  4.47|
|hendrycksTest-jurisprudence                      |      0|acc     |48.15|±  |  4.83|
|                                                 |       |acc_norm|50.00|±  |  4.83|
|hendrycksTest-logical_fallacies                  |      0|acc     |45.40|±  |  3.91|
|                                                 |       |acc_norm|36.81|±  |  3.79|
|hendrycksTest-machine_learning                   |      0|acc     |28.57|±  |  4.29|
|                                                 |       |acc_norm|29.46|±  |  4.33|
|hendrycksTest-management                         |      0|acc     |64.08|±  |  4.75|
|                                                 |       |acc_norm|41.75|±  |  4.88|
|hendrycksTest-marketing                          |      0|acc     |72.65|±  |  2.92|
|                                                 |       |acc_norm|61.54|±  |  3.19|
|hendrycksTest-medical_genetics                   |      0|acc     |49.00|±  |  5.02|
|                                                 |       |acc_norm|48.00|±  |  5.02|
|hendrycksTest-miscellaneous                      |      0|acc     |69.60|±  |  1.64|
|                                                 |       |acc_norm|48.53|±  |  1.79|
|hendrycksTest-moral_disputes                     |      0|acc     |44.80|±  |  2.68|
|                                                 |       |acc_norm|38.15|±  |  2.62|
|hendrycksTest-moral_scenarios                    |      0|acc     |28.27|±  |  1.51|
|                                                 |       |acc_norm|27.26|±  |  1.49|
|hendrycksTest-nutrition                          |      0|acc     |45.10|±  |  2.85|
|                                                 |       |acc_norm|46.73|±  |  2.86|
|hendrycksTest-philosophy                         |      0|acc     |45.98|±  |  2.83|
|                                                 |       |acc_norm|38.59|±  |  2.76|
|hendrycksTest-prehistory                         |      0|acc     |49.69|±  |  2.78|
|                                                 |       |acc_norm|34.57|±  |  2.65|
|hendrycksTest-professional_accounting            |      0|acc     |29.79|±  |  2.73|
|                                                 |       |acc_norm|28.01|±  |  2.68|
|hendrycksTest-professional_law                   |      0|acc     |30.38|±  |  1.17|
|                                                 |       |acc_norm|30.90|±  |  1.18|
|hendrycksTest-professional_medicine              |      0|acc     |39.34|±  |  2.97|
|                                                 |       |acc_norm|33.09|±  |  2.86|
|hendrycksTest-professional_psychology            |      0|acc     |42.32|±  |  2.00|
|                                                 |       |acc_norm|33.01|±  |  1.90|
|hendrycksTest-public_relations                   |      0|acc     |54.55|±  |  4.77|
|                                                 |       |acc_norm|29.09|±  |  4.35|
|hendrycksTest-security_studies                   |      0|acc     |45.71|±  |  3.19|
|                                                 |       |acc_norm|37.55|±  |  3.10|
|hendrycksTest-sociology                          |      0|acc     |58.21|±  |  3.49|
|                                                 |       |acc_norm|45.77|±  |  3.52|
|hendrycksTest-us_foreign_policy                  |      0|acc     |68.00|±  |  4.69|
|                                                 |       |acc_norm|52.00|±  |  5.02|
|hendrycksTest-virology                           |      0|acc     |40.96|±  |  3.83|
|                                                 |       |acc_norm|30.12|±  |  3.57|
|hendrycksTest-world_religions                    |      0|acc     |74.27|±  |  3.35|
|                                                 |       |acc_norm|64.91|±  |  3.66|

## llama-13B_pawsx_0-shot.json
|  Task  |Version|Metric|Value|   |Stderr|
|--------|------:|------|----:|---|-----:|
|pawsx_de|      0|acc   |52.95|±  |  1.12|
|pawsx_en|      0|acc   |53.70|±  |  1.12|
|pawsx_es|      0|acc   |52.10|±  |  1.12|
|pawsx_fr|      0|acc   |54.50|±  |  1.11|
|pawsx_ja|      0|acc   |45.00|±  |  1.11|
|pawsx_ko|      0|acc   |47.05|±  |  1.12|
|pawsx_zh|      0|acc   |45.20|±  |  1.11|

## llama-13B_question_answering_0-shot.json
|    Task     |Version|   Metric   |Value|   |Stderr|
|-------------|------:|------------|----:|---|-----:|
|headqa_en    |      0|acc         |34.43|±  |  0.91|
|             |       |acc_norm    |38.58|±  |  0.93|
|headqa_es    |      0|acc         |30.56|±  |  0.88|
|             |       |acc_norm    |35.16|±  |  0.91|
|logiqa       |      0|acc         |26.42|±  |  1.73|
|             |       |acc_norm    |32.10|±  |  1.83|
|squad2       |      1|exact       |16.44|   |      |
|             |       |f1          |24.06|   |      |
|             |       |HasAns_exact|21.09|   |      |
|             |       |HasAns_f1   |36.35|   |      |
|             |       |NoAns_exact |11.81|   |      |
|             |       |NoAns_f1    |11.81|   |      |
|             |       |best_exact  |50.07|   |      |
|             |       |best_f1     |50.07|   |      |
|triviaqa     |      1|acc         | 0.00|±  |  0.00|
|truthfulqa_mc|      1|mc1         |25.83|±  |  1.53|
|             |       |mc2         |39.88|±  |  1.37|
|webqs        |      0|acc         | 0.00|±  |  0.00|

## llama-13B_reading_comprehension_0-shot.json
|Task|Version|Metric|Value|   |Stderr|
|----|------:|------|----:|---|-----:|
|coqa|      1|f1    |77.04|±  |  1.42|
|    |       |em    |63.70|±  |  1.85|
|drop|      1|em    | 3.59|±  |  0.19|
|    |       |f1    |13.38|±  |  0.24|
|race|      1|acc   |39.33|±  |  1.51|

## llama-13B_superglue_0-shot.json
| Task  |Version|Metric|Value|   |Stderr|
|-------|------:|------|----:|---|-----:|
|boolq  |      1|acc   |68.44|±  |  0.81|
|cb     |      1|acc   |48.21|±  |  6.74|
|       |       |f1    |38.82|   |      |
|copa   |      0|acc   |90.00|±  |  3.02|
|multirc|      1|acc   | 1.57|±  |  0.40|
|record |      0|f1    |92.32|±  |  0.26|
|       |       |em    |91.54|±  |  0.28|
|wic    |      0|acc   |49.84|±  |  1.98|
|wsc    |      0|acc   |35.58|±  |  4.72|

## llama-13B_xcopa_0-shot.json
|  Task  |Version|Metric|Value|   |Stderr|
|--------|------:|------|----:|---|-----:|
|xcopa_et|      0|acc   | 48.2|±  |  2.24|
|xcopa_ht|      0|acc   | 52.8|±  |  2.23|
|xcopa_id|      0|acc   | 57.8|±  |  2.21|
|xcopa_it|      0|acc   | 67.2|±  |  2.10|
|xcopa_qu|      0|acc   | 50.2|±  |  2.24|
|xcopa_sw|      0|acc   | 51.2|±  |  2.24|
|xcopa_ta|      0|acc   | 54.4|±  |  2.23|
|xcopa_th|      0|acc   | 54.6|±  |  2.23|
|xcopa_tr|      0|acc   | 53.0|±  |  2.23|
|xcopa_vi|      0|acc   | 53.8|±  |  2.23|
|xcopa_zh|      0|acc   | 58.4|±  |  2.21|

## llama-13B_xnli_0-shot.json
| Task  |Version|Metric|Value|   |Stderr|
|-------|------:|------|----:|---|-----:|
|xnli_ar|      0|acc   |34.07|±  |  0.67|
|xnli_bg|      0|acc   |34.21|±  |  0.67|
|xnli_de|      0|acc   |35.25|±  |  0.68|
|xnli_el|      0|acc   |34.69|±  |  0.67|
|xnli_en|      0|acc   |35.63|±  |  0.68|
|xnli_es|      0|acc   |33.49|±  |  0.67|
|xnli_fr|      0|acc   |33.49|±  |  0.67|
|xnli_hi|      0|acc   |35.59|±  |  0.68|
|xnli_ru|      0|acc   |33.79|±  |  0.67|
|xnli_sw|      0|acc   |33.15|±  |  0.67|
|xnli_th|      0|acc   |34.83|±  |  0.67|
|xnli_tr|      0|acc   |33.99|±  |  0.67|
|xnli_ur|      0|acc   |34.21|±  |  0.67|
|xnli_vi|      0|acc   |34.21|±  |  0.67|
|xnli_zh|      0|acc   |34.47|±  |  0.67|

## llama-13B_xstory_cloze_0-shot.json
|     Task      |Version|Metric|Value|   |Stderr|
|---------------|------:|------|----:|---|-----:|
|xstory_cloze_ar|      0|acc   |49.70|±  |  1.29|
|xstory_cloze_en|      0|acc   |77.30|±  |  1.08|
|xstory_cloze_es|      0|acc   |69.42|±  |  1.19|
|xstory_cloze_eu|      0|acc   |50.69|±  |  1.29|
|xstory_cloze_hi|      0|acc   |52.35|±  |  1.29|
|xstory_cloze_id|      0|acc   |55.26|±  |  1.28|
|xstory_cloze_my|      0|acc   |47.78|±  |  1.29|
|xstory_cloze_ru|      0|acc   |63.40|±  |  1.24|
|xstory_cloze_sw|      0|acc   |49.90|±  |  1.29|
|xstory_cloze_te|      0|acc   |53.34|±  |  1.28|
|xstory_cloze_zh|      0|acc   |56.45|±  |  1.28|

## llama-13B_xwinograd_0-shot.json
|    Task    |Version|Metric|Value|   |Stderr|
|------------|------:|------|----:|---|-----:|
|xwinograd_en|      0|acc   |86.75|±  |  0.70|
|xwinograd_fr|      0|acc   |68.67|±  |  5.12|
|xwinograd_jp|      0|acc   |59.85|±  |  1.58|
|xwinograd_pt|      0|acc   |71.48|±  |  2.79|
|xwinograd_ru|      0|acc   |70.79|±  |  2.57|
|xwinograd_zh|      0|acc   |70.04|±  |  2.04|