# llama-13B ## llama-13B_arithmetic_5-shot.json | Task |Version|Metric|Value| |Stderr| |--------------|------:|------|----:|---|-----:| |arithmetic_1dc| 0|acc | 0|± | 0| |arithmetic_2da| 0|acc | 0|± | 0| |arithmetic_2dm| 0|acc | 0|± | 0| |arithmetic_2ds| 0|acc | 0|± | 0| |arithmetic_3da| 0|acc | 0|± | 0| |arithmetic_3ds| 0|acc | 0|± | 0| |arithmetic_4da| 0|acc | 0|± | 0| |arithmetic_4ds| 0|acc | 0|± | 0| |arithmetic_5da| 0|acc | 0|± | 0| |arithmetic_5ds| 0|acc | 0|± | 0| ## llama-13B_bbh_3-shot.json | Task |Version| Metric |Value| |Stderr| |------------------------------------------------|------:|---------------------|----:|---|-----:| |bigbench_causal_judgement | 0|multiple_choice_grade|49.47|± | 3.64| |bigbench_date_understanding | 0|multiple_choice_grade|63.96|± | 2.50| |bigbench_disambiguation_qa | 0|multiple_choice_grade|45.74|± | 3.11| |bigbench_dyck_languages | 0|multiple_choice_grade|20.10|± | 1.27| |bigbench_formal_fallacies_syllogisms_negation | 0|multiple_choice_grade|51.13|± | 0.42| |bigbench_geometric_shapes | 0|multiple_choice_grade|23.12|± | 2.23| | | |exact_str_match | 0.00|± | 0.00| |bigbench_hyperbaton | 0|multiple_choice_grade|50.38|± | 0.22| |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.00|± | 2.05| |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|22.29|± | 1.57| |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|41.67|± | 2.85| |bigbench_movie_recommendation | 0|multiple_choice_grade|43.60|± | 2.22| |bigbench_navigate | 0|multiple_choice_grade|51.70|± | 1.58| |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|37.05|± | 1.08| |bigbench_ruin_names | 0|multiple_choice_grade|34.60|± | 2.25| |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|19.34|± | 1.25| |bigbench_snarks | 0|multiple_choice_grade|46.96|± | 3.72| |bigbench_sports_understanding | 0|multiple_choice_grade|58.11|± | 1.57| |bigbench_temporal_sequences | 0|multiple_choice_grade|28.00|± | 1.42| |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|21.44|± | 1.16| |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|14.46|± | 0.84| |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|41.67|± | 2.85| ## llama-13B_blimp_0-shot.json | Task |Version|Metric|Value| |Stderr| |---------------------------------------------------------|------:|------|----:|---|-----:| |blimp_adjunct_island | 0|acc | 33.8|± | 1.50| |blimp_anaphor_gender_agreement | 0|acc | 57.6|± | 1.56| |blimp_anaphor_number_agreement | 0|acc | 56.5|± | 1.57| |blimp_animate_subject_passive | 0|acc | 65.1|± | 1.51| |blimp_animate_subject_trans | 0|acc | 61.6|± | 1.54| |blimp_causative | 0|acc | 35.9|± | 1.52| |blimp_complex_NP_island | 0|acc | 30.3|± | 1.45| |blimp_coordinate_structure_constraint_complex_left_branch| 0|acc | 34.5|± | 1.50| |blimp_coordinate_structure_constraint_object_extraction | 0|acc | 27.9|± | 1.42| |blimp_determiner_noun_agreement_1 | 0|acc | 34.1|± | 1.50| |blimp_determiner_noun_agreement_2 | 0|acc | 36.1|± | 1.52| |blimp_determiner_noun_agreement_irregular_1 | 0|acc | 35.6|± | 1.51| |blimp_determiner_noun_agreement_irregular_2 | 0|acc | 36.9|± | 1.53| |blimp_determiner_noun_agreement_with_adj_2 | 0|acc | 39.2|± | 1.54| |blimp_determiner_noun_agreement_with_adj_irregular_1 | 0|acc | 34.2|± | 1.50| |blimp_determiner_noun_agreement_with_adj_irregular_2 | 0|acc | 39.3|± | 1.55| |blimp_determiner_noun_agreement_with_adjective_1 | 0|acc | 39.1|± | 1.54| |blimp_distractor_agreement_relational_noun | 0|acc | 51.4|± | 1.58| |blimp_distractor_agreement_relative_clause | 0|acc | 42.3|± | 1.56| |blimp_drop_argument | 0|acc | 70.5|± | 1.44| |blimp_ellipsis_n_bar_1 | 0|acc | 62.4|± | 1.53| |blimp_ellipsis_n_bar_2 | 0|acc | 26.4|± | 1.39| |blimp_existential_there_object_raising | 0|acc | 69.0|± | 1.46| |blimp_existential_there_quantifiers_1 | 0|acc | 30.8|± | 1.46| |blimp_existential_there_quantifiers_2 | 0|acc | 78.8|± | 1.29| |blimp_existential_there_subject_raising | 0|acc | 70.1|± | 1.45| |blimp_expletive_it_object_raising | 0|acc | 61.9|± | 1.54| |blimp_inchoative | 0|acc | 47.4|± | 1.58| |blimp_intransitive | 0|acc | 64.3|± | 1.52| |blimp_irregular_past_participle_adjectives | 0|acc | 63.6|± | 1.52| |blimp_irregular_past_participle_verbs | 0|acc | 31.4|± | 1.47| |blimp_irregular_plural_subject_verb_agreement_1 | 0|acc | 51.8|± | 1.58| |blimp_irregular_plural_subject_verb_agreement_2 | 0|acc | 50.4|± | 1.58| |blimp_left_branch_island_echo_question | 0|acc | 49.0|± | 1.58| |blimp_left_branch_island_simple_question | 0|acc | 41.1|± | 1.56| |blimp_matrix_question_npi_licensor_present | 0|acc | 54.8|± | 1.57| |blimp_npi_present_1 | 0|acc | 30.4|± | 1.46| |blimp_npi_present_2 | 0|acc | 39.0|± | 1.54| |blimp_only_npi_licensor_present | 0|acc | 73.1|± | 1.40| |blimp_only_npi_scope | 0|acc | 27.8|± | 1.42| |blimp_passive_1 | 0|acc | 52.9|± | 1.58| |blimp_passive_2 | 0|acc | 52.6|± | 1.58| |blimp_principle_A_c_command | 0|acc | 32.6|± | 1.48| |blimp_principle_A_case_1 | 0|acc | 2.8|± | 0.52| |blimp_principle_A_case_2 | 0|acc | 44.3|± | 1.57| |blimp_principle_A_domain_1 | 0|acc | 32.4|± | 1.48| |blimp_principle_A_domain_2 | 0|acc | 74.0|± | 1.39| |blimp_principle_A_domain_3 | 0|acc | 56.3|± | 1.57| |blimp_principle_A_reconstruction | 0|acc | 79.2|± | 1.28| |blimp_regular_plural_subject_verb_agreement_1 | 0|acc | 56.0|± | 1.57| |blimp_regular_plural_subject_verb_agreement_2 | 0|acc | 45.6|± | 1.58| |blimp_sentential_negation_npi_licensor_present | 0|acc | 39.2|± | 1.54| |blimp_sentential_negation_npi_scope | 0|acc | 63.8|± | 1.52| |blimp_sentential_subject_island | 0|acc | 62.1|± | 1.53| |blimp_superlative_quantifiers_1 | 0|acc | 52.2|± | 1.58| |blimp_superlative_quantifiers_2 | 0|acc | 71.4|± | 1.43| |blimp_tough_vs_raising_1 | 0|acc | 36.1|± | 1.52| |blimp_tough_vs_raising_2 | 0|acc | 64.2|± | 1.52| |blimp_transitive | 0|acc | 47.3|± | 1.58| |blimp_wh_island | 0|acc | 50.6|± | 1.58| |blimp_wh_questions_object_gap | 0|acc | 45.5|± | 1.58| |blimp_wh_questions_subject_gap | 0|acc | 36.9|± | 1.53| |blimp_wh_questions_subject_gap_long_distance | 0|acc | 40.8|± | 1.55| |blimp_wh_vs_that_no_gap | 0|acc | 19.6|± | 1.26| |blimp_wh_vs_that_no_gap_long_distance | 0|acc | 30.1|± | 1.45| |blimp_wh_vs_that_with_gap | 0|acc | 84.7|± | 1.14| |blimp_wh_vs_that_with_gap_long_distance | 0|acc | 69.2|± | 1.46| ## llama-13B_common_sense_reasoning_0-shot.json | Task |Version| Metric |Value| |Stderr| |-------------|------:|--------|----:|---|-----:| |arc_challenge| 0|acc |43.94|± | 1.45| | | |acc_norm|44.62|± | 1.45| |arc_easy | 0|acc |74.58|± | 0.89| | | |acc_norm|59.89|± | 1.01| |boolq | 1|acc |68.50|± | 0.81| |copa | 0|acc |90.00|± | 3.02| |hellaswag | 0|acc |59.10|± | 0.49| | | |acc_norm|76.24|± | 0.42| |mc_taco | 0|em |10.96| | | | | |f1 |47.53| | | |openbookqa | 0|acc |30.60|± | 2.06| | | |acc_norm|42.20|± | 2.21| |piqa | 0|acc |78.84|± | 0.95| | | |acc_norm|79.11|± | 0.95| |prost | 0|acc |26.89|± | 0.32| | | |acc_norm|30.52|± | 0.34| |swag | 0|acc |56.73|± | 0.35| | | |acc_norm|69.35|± | 0.33| |winogrande | 0|acc |70.17|± | 1.29| |wsc273 | 0|acc |86.08|± | 2.10| ## llama-13B_glue_0-shot.json | Task |Version|Metric|Value| |Stderr| |---------------|------:|------|----:|---|-----:| |cola | 0|mcc | 0.00|± | 0.00| |mnli | 0|acc |43.56|± | 0.50| |mnli_mismatched| 0|acc |45.35|± | 0.50| |mrpc | 0|acc |68.63|± | 2.30| | | |f1 |81.34|± | 1.62| |qnli | 0|acc |49.95|± | 0.68| |qqp | 0|acc |36.79|± | 0.24| | | |f1 |53.66|± | 0.26| |rte | 0|acc |65.34|± | 2.86| |sst | 0|acc |65.37|± | 1.61| |wnli | 1|acc |46.48|± | 5.96| ## llama-13B_gsm8k_8-shot.json |Task |Version|Metric|Value| |Stderr| |-----|------:|------|----:|---|-----:| |gsm8k| 0|acc |13.57|± | 0.94| ## llama-13B_human_alignment_0-shot.json | Task |Version| Metric | Value | |Stderr| |---------------------------------------|------:|---------------------|------:|---|-----:| |crows_pairs_english_age | 0|likelihood_difference| 771.02|± | 93.66| | | |pct_stereotype | 56.04|± | 5.23| |crows_pairs_english_autre | 0|likelihood_difference|1142.61|± |435.33| | | |pct_stereotype | 36.36|± | 15.21| |crows_pairs_english_disability | 0|likelihood_difference|1297.88|± |182.88| | | |pct_stereotype | 35.38|± | 5.98| |crows_pairs_english_gender | 0|likelihood_difference| 867.58|± | 65.49| | | |pct_stereotype | 58.44|± | 2.76| |crows_pairs_english_nationality | 0|likelihood_difference|1184.87|± | 83.43| | | |pct_stereotype | 38.43|± | 3.32| |crows_pairs_english_physical_appearance| 0|likelihood_difference| 752.95|± | 87.93| | | |pct_stereotype | 47.22|± | 5.92| |crows_pairs_english_race_color | 0|likelihood_difference| 985.84|± | 50.57| | | |pct_stereotype | 50.20|± | 2.22| |crows_pairs_english_religion | 0|likelihood_difference|1181.25|± |117.52| | | |pct_stereotype | 49.55|± | 4.77| |crows_pairs_english_sexual_orientation | 0|likelihood_difference|1072.24|± |115.61| | | |pct_stereotype | 54.84|± | 5.19| |crows_pairs_english_socioeconomic | 0|likelihood_difference|1122.24|± | 78.07| | | |pct_stereotype | 50.53|± | 3.64| |crows_pairs_french_age | 0|likelihood_difference|1310.14|± |112.01| | | |pct_stereotype | 38.89|± | 5.17| |crows_pairs_french_autre | 0|likelihood_difference| 994.23|± |314.84| | | |pct_stereotype | 53.85|± | 14.39| |crows_pairs_french_disability | 0|likelihood_difference|1732.39|± |182.40| | | |pct_stereotype | 40.91|± | 6.10| |crows_pairs_french_gender | 0|likelihood_difference|1079.15|± | 67.67| | | |pct_stereotype | 51.40|± | 2.79| |crows_pairs_french_nationality | 0|likelihood_difference|1633.10|± | 92.24| | | |pct_stereotype | 31.62|± | 2.93| |crows_pairs_french_physical_appearance | 0|likelihood_difference|1257.99|± |157.39| | | |pct_stereotype | 52.78|± | 5.92| |crows_pairs_french_race_color | 0|likelihood_difference|1192.74|± | 50.28| | | |pct_stereotype | 35.00|± | 2.23| |crows_pairs_french_religion | 0|likelihood_difference|1119.24|± |108.66| | | |pct_stereotype | 59.13|± | 4.60| |crows_pairs_french_sexual_orientation | 0|likelihood_difference|1755.49|± |118.03| | | |pct_stereotype | 78.02|± | 4.36| |crows_pairs_french_socioeconomic | 0|likelihood_difference|1279.15|± | 93.70| | | |pct_stereotype | 35.71|± | 3.43| |ethics_cm | 0|acc | 51.74|± | 0.80| |ethics_deontology | 0|acc | 50.33|± | 0.83| | | |em | 0.11| | | |ethics_justice | 0|acc | 49.93|± | 0.96| | | |em | 0.15| | | |ethics_utilitarianism | 0|acc | 52.45|± | 0.72| |ethics_utilitarianism_original | 0|acc | 98.07|± | 0.20| |ethics_virtue | 0|acc | 20.32|± | 0.57| | | |em | 0.00| | | |toxigen | 0|acc | 42.66|± | 1.61| | | |acc_norm | 43.19|± | 1.62| ## llama-13B_lambada_0-shot.json | Task |Version|Metric| Value | | Stderr | |----------------------|------:|------|---------:|---|--------:| |lambada_openai | 0|ppl |1279051.05|± | 60995.63| | | |acc | 0.00|± | 0.00| |lambada_openai_cloze | 0|ppl | 204515.39|± | 9705.34| | | |acc | 0.02|± | 0.02| |lambada_openai_mt_de | 0|ppl |1310285.44|± | 71395.91| | | |acc | 0.00|± | 0.00| |lambada_openai_mt_en | 0|ppl |1279051.05|± | 60995.63| | | |acc | 0.00|± | 0.00| |lambada_openai_mt_es | 0|ppl |1980241.77|± |101614.20| | | |acc | 0.00|± | 0.00| |lambada_openai_mt_fr | 0|ppl |2461448.49|± |128013.99| | | |acc | 0.00|± | 0.00| |lambada_openai_mt_it | 0|ppl |4091504.35|± |218020.97| | | |acc | 0.00|± | 0.00| |lambada_standard | 0|ppl |1409048.00|± | 47832.88| | | |acc | 0.00|± | 0.00| |lambada_standard_cloze| 0|ppl |4235345.03|± |132892.57| | | |acc | 0.00|± | 0.00| ## llama-13B_mathematical_reasoning_0-shot.json | Task |Version| Metric |Value| |Stderr| |-------------------------|------:|--------|----:|---|-----:| |drop | 1|em | 3.88|± | 0.20| | | |f1 |13.99|± | 0.25| |gsm8k | 0|acc | 0.00|± | 0.00| |math_algebra | 1|acc | 1.85|± | 0.39| |math_asdiv | 0|acc | 0.00|± | 0.00| |math_counting_and_prob | 1|acc | 1.48|± | 0.55| |math_geometry | 1|acc | 1.25|± | 0.51| |math_intermediate_algebra| 1|acc | 1.22|± | 0.37| |math_num_theory | 1|acc | 1.48|± | 0.52| |math_prealgebra | 1|acc | 2.87|± | 0.57| |math_precalc | 1|acc | 1.10|± | 0.45| |mathqa | 0|acc |28.44|± | 0.83| | | |acc_norm|28.68|± | 0.83| ## llama-13B_mathematical_reasoning_few_shot_5-shot.json | Task |Version| Metric |Value| |Stderr| |-------------------------|------:|--------|----:|---|-----:| |drop | 1|em | 1.71|± | 0.13| | | |f1 | 2.45|± | 0.14| |gsm8k | 0|acc | 0.00|± | 0.00| |math_algebra | 1|acc | 0.00|± | 0.00| |math_counting_and_prob | 1|acc | 0.21|± | 0.21| |math_geometry | 1|acc | 0.00|± | 0.00| |math_intermediate_algebra| 1|acc | 0.00|± | 0.00| |math_num_theory | 1|acc | 0.19|± | 0.19| |math_prealgebra | 1|acc | 0.11|± | 0.11| |math_precalc | 1|acc | 0.00|± | 0.00| |mathqa | 0|acc |29.98|± | 0.84| | | |acc_norm|30.35|± | 0.84| ## llama-13B_mmlu_5-shot.json | Task |Version| Metric |Value| |Stderr| |-------------------------------------------------|------:|--------|----:|---|-----:| |hendrycksTest-abstract_algebra | 0|acc |32.00|± | 4.69| | | |acc_norm|30.00|± | 4.61| |hendrycksTest-anatomy | 0|acc |42.96|± | 4.28| | | |acc_norm|29.63|± | 3.94| |hendrycksTest-astronomy | 0|acc |48.03|± | 4.07| | | |acc_norm|48.03|± | 4.07| |hendrycksTest-business_ethics | 0|acc |53.00|± | 5.02| | | |acc_norm|44.00|± | 4.99| |hendrycksTest-clinical_knowledge | 0|acc |46.04|± | 3.07| | | |acc_norm|38.49|± | 2.99| |hendrycksTest-college_biology | 0|acc |45.83|± | 4.17| | | |acc_norm|32.64|± | 3.92| |hendrycksTest-college_chemistry | 0|acc |31.00|± | 4.65| | | |acc_norm|30.00|± | 4.61| |hendrycksTest-college_computer_science | 0|acc |33.00|± | 4.73| | | |acc_norm|28.00|± | 4.51| |hendrycksTest-college_mathematics | 0|acc |29.00|± | 4.56| | | |acc_norm|34.00|± | 4.76| |hendrycksTest-college_medicine | 0|acc |42.77|± | 3.77| | | |acc_norm|30.06|± | 3.50| |hendrycksTest-college_physics | 0|acc |28.43|± | 4.49| | | |acc_norm|35.29|± | 4.76| |hendrycksTest-computer_security | 0|acc |57.00|± | 4.98| | | |acc_norm|44.00|± | 4.99| |hendrycksTest-conceptual_physics | 0|acc |42.13|± | 3.23| | | |acc_norm|24.26|± | 2.80| |hendrycksTest-econometrics | 0|acc |27.19|± | 4.19| | | |acc_norm|26.32|± | 4.14| |hendrycksTest-electrical_engineering | 0|acc |41.38|± | 4.10| | | |acc_norm|34.48|± | 3.96| |hendrycksTest-elementary_mathematics | 0|acc |36.77|± | 2.48| | | |acc_norm|32.80|± | 2.42| |hendrycksTest-formal_logic | 0|acc |32.54|± | 4.19| | | |acc_norm|34.13|± | 4.24| |hendrycksTest-global_facts | 0|acc |34.00|± | 4.76| | | |acc_norm|29.00|± | 4.56| |hendrycksTest-high_school_biology | 0|acc |49.68|± | 2.84| | | |acc_norm|36.13|± | 2.73| |hendrycksTest-high_school_chemistry | 0|acc |31.03|± | 3.26| | | |acc_norm|32.02|± | 3.28| |hendrycksTest-high_school_computer_science | 0|acc |49.00|± | 5.02| | | |acc_norm|41.00|± | 4.94| |hendrycksTest-high_school_european_history | 0|acc |52.73|± | 3.90| | | |acc_norm|49.70|± | 3.90| |hendrycksTest-high_school_geography | 0|acc |57.58|± | 3.52| | | |acc_norm|42.42|± | 3.52| |hendrycksTest-high_school_government_and_politics| 0|acc |58.55|± | 3.56| | | |acc_norm|38.86|± | 3.52| |hendrycksTest-high_school_macroeconomics | 0|acc |37.69|± | 2.46| | | |acc_norm|31.79|± | 2.36| |hendrycksTest-high_school_mathematics | 0|acc |26.67|± | 2.70| | | |acc_norm|31.85|± | 2.84| |hendrycksTest-high_school_microeconomics | 0|acc |42.02|± | 3.21| | | |acc_norm|40.76|± | 3.19| |hendrycksTest-high_school_physics | 0|acc |27.15|± | 3.63| | | |acc_norm|25.17|± | 3.54| |hendrycksTest-high_school_psychology | 0|acc |60.73|± | 2.09| | | |acc_norm|36.88|± | 2.07| |hendrycksTest-high_school_statistics | 0|acc |38.43|± | 3.32| | | |acc_norm|37.50|± | 3.30| |hendrycksTest-high_school_us_history | 0|acc |52.45|± | 3.51| | | |acc_norm|37.25|± | 3.39| |hendrycksTest-high_school_world_history | 0|acc |49.79|± | 3.25| | | |acc_norm|42.62|± | 3.22| |hendrycksTest-human_aging | 0|acc |57.40|± | 3.32| | | |acc_norm|33.63|± | 3.17| |hendrycksTest-human_sexuality | 0|acc |54.96|± | 4.36| | | |acc_norm|39.69|± | 4.29| |hendrycksTest-international_law | 0|acc |56.20|± | 4.53| | | |acc_norm|60.33|± | 4.47| |hendrycksTest-jurisprudence | 0|acc |48.15|± | 4.83| | | |acc_norm|50.00|± | 4.83| |hendrycksTest-logical_fallacies | 0|acc |45.40|± | 3.91| | | |acc_norm|36.81|± | 3.79| |hendrycksTest-machine_learning | 0|acc |28.57|± | 4.29| | | |acc_norm|29.46|± | 4.33| |hendrycksTest-management | 0|acc |64.08|± | 4.75| | | |acc_norm|41.75|± | 4.88| |hendrycksTest-marketing | 0|acc |72.65|± | 2.92| | | |acc_norm|61.54|± | 3.19| |hendrycksTest-medical_genetics | 0|acc |49.00|± | 5.02| | | |acc_norm|48.00|± | 5.02| |hendrycksTest-miscellaneous | 0|acc |69.60|± | 1.64| | | |acc_norm|48.53|± | 1.79| |hendrycksTest-moral_disputes | 0|acc |44.80|± | 2.68| | | |acc_norm|38.15|± | 2.62| |hendrycksTest-moral_scenarios | 0|acc |28.27|± | 1.51| | | |acc_norm|27.26|± | 1.49| |hendrycksTest-nutrition | 0|acc |45.10|± | 2.85| | | |acc_norm|46.73|± | 2.86| |hendrycksTest-philosophy | 0|acc |45.98|± | 2.83| | | |acc_norm|38.59|± | 2.76| |hendrycksTest-prehistory | 0|acc |49.69|± | 2.78| | | |acc_norm|34.57|± | 2.65| |hendrycksTest-professional_accounting | 0|acc |29.79|± | 2.73| | | |acc_norm|28.01|± | 2.68| |hendrycksTest-professional_law | 0|acc |30.38|± | 1.17| | | |acc_norm|30.90|± | 1.18| |hendrycksTest-professional_medicine | 0|acc |39.34|± | 2.97| | | |acc_norm|33.09|± | 2.86| |hendrycksTest-professional_psychology | 0|acc |42.32|± | 2.00| | | |acc_norm|33.01|± | 1.90| |hendrycksTest-public_relations | 0|acc |54.55|± | 4.77| | | |acc_norm|29.09|± | 4.35| |hendrycksTest-security_studies | 0|acc |45.71|± | 3.19| | | |acc_norm|37.55|± | 3.10| |hendrycksTest-sociology | 0|acc |58.21|± | 3.49| | | |acc_norm|45.77|± | 3.52| |hendrycksTest-us_foreign_policy | 0|acc |68.00|± | 4.69| | | |acc_norm|52.00|± | 5.02| |hendrycksTest-virology | 0|acc |40.96|± | 3.83| | | |acc_norm|30.12|± | 3.57| |hendrycksTest-world_religions | 0|acc |74.27|± | 3.35| | | |acc_norm|64.91|± | 3.66| ## llama-13B_pawsx_0-shot.json | Task |Version|Metric|Value| |Stderr| |--------|------:|------|----:|---|-----:| |pawsx_de| 0|acc |52.95|± | 1.12| |pawsx_en| 0|acc |53.70|± | 1.12| |pawsx_es| 0|acc |52.10|± | 1.12| |pawsx_fr| 0|acc |54.50|± | 1.11| |pawsx_ja| 0|acc |45.00|± | 1.11| |pawsx_ko| 0|acc |47.05|± | 1.12| |pawsx_zh| 0|acc |45.20|± | 1.11| ## llama-13B_question_answering_0-shot.json | Task |Version| Metric |Value| |Stderr| |-------------|------:|------------|----:|---|-----:| |headqa_en | 0|acc |34.43|± | 0.91| | | |acc_norm |38.58|± | 0.93| |headqa_es | 0|acc |30.56|± | 0.88| | | |acc_norm |35.16|± | 0.91| |logiqa | 0|acc |26.42|± | 1.73| | | |acc_norm |32.10|± | 1.83| |squad2 | 1|exact |16.44| | | | | |f1 |24.06| | | | | |HasAns_exact|21.09| | | | | |HasAns_f1 |36.35| | | | | |NoAns_exact |11.81| | | | | |NoAns_f1 |11.81| | | | | |best_exact |50.07| | | | | |best_f1 |50.07| | | |triviaqa | 1|acc | 0.00|± | 0.00| |truthfulqa_mc| 1|mc1 |25.83|± | 1.53| | | |mc2 |39.88|± | 1.37| |webqs | 0|acc | 0.00|± | 0.00| ## llama-13B_reading_comprehension_0-shot.json |Task|Version|Metric|Value| |Stderr| |----|------:|------|----:|---|-----:| |coqa| 1|f1 |77.04|± | 1.42| | | |em |63.70|± | 1.85| |drop| 1|em | 3.59|± | 0.19| | | |f1 |13.38|± | 0.24| |race| 1|acc |39.33|± | 1.51| ## llama-13B_superglue_0-shot.json | Task |Version|Metric|Value| |Stderr| |-------|------:|------|----:|---|-----:| |boolq | 1|acc |68.44|± | 0.81| |cb | 1|acc |48.21|± | 6.74| | | |f1 |38.82| | | |copa | 0|acc |90.00|± | 3.02| |multirc| 1|acc | 1.57|± | 0.40| |record | 0|f1 |92.32|± | 0.26| | | |em |91.54|± | 0.28| |wic | 0|acc |49.84|± | 1.98| |wsc | 0|acc |35.58|± | 4.72| ## llama-13B_xcopa_0-shot.json | Task |Version|Metric|Value| |Stderr| |--------|------:|------|----:|---|-----:| |xcopa_et| 0|acc | 48.2|± | 2.24| |xcopa_ht| 0|acc | 52.8|± | 2.23| |xcopa_id| 0|acc | 57.8|± | 2.21| |xcopa_it| 0|acc | 67.2|± | 2.10| |xcopa_qu| 0|acc | 50.2|± | 2.24| |xcopa_sw| 0|acc | 51.2|± | 2.24| |xcopa_ta| 0|acc | 54.4|± | 2.23| |xcopa_th| 0|acc | 54.6|± | 2.23| |xcopa_tr| 0|acc | 53.0|± | 2.23| |xcopa_vi| 0|acc | 53.8|± | 2.23| |xcopa_zh| 0|acc | 58.4|± | 2.21| ## llama-13B_xnli_0-shot.json | Task |Version|Metric|Value| |Stderr| |-------|------:|------|----:|---|-----:| |xnli_ar| 0|acc |34.07|± | 0.67| |xnli_bg| 0|acc |34.21|± | 0.67| |xnli_de| 0|acc |35.25|± | 0.68| |xnli_el| 0|acc |34.69|± | 0.67| |xnli_en| 0|acc |35.63|± | 0.68| |xnli_es| 0|acc |33.49|± | 0.67| |xnli_fr| 0|acc |33.49|± | 0.67| |xnli_hi| 0|acc |35.59|± | 0.68| |xnli_ru| 0|acc |33.79|± | 0.67| |xnli_sw| 0|acc |33.15|± | 0.67| |xnli_th| 0|acc |34.83|± | 0.67| |xnli_tr| 0|acc |33.99|± | 0.67| |xnli_ur| 0|acc |34.21|± | 0.67| |xnli_vi| 0|acc |34.21|± | 0.67| |xnli_zh| 0|acc |34.47|± | 0.67| ## llama-13B_xstory_cloze_0-shot.json | Task |Version|Metric|Value| |Stderr| |---------------|------:|------|----:|---|-----:| |xstory_cloze_ar| 0|acc |49.70|± | 1.29| |xstory_cloze_en| 0|acc |77.30|± | 1.08| |xstory_cloze_es| 0|acc |69.42|± | 1.19| |xstory_cloze_eu| 0|acc |50.69|± | 1.29| |xstory_cloze_hi| 0|acc |52.35|± | 1.29| |xstory_cloze_id| 0|acc |55.26|± | 1.28| |xstory_cloze_my| 0|acc |47.78|± | 1.29| |xstory_cloze_ru| 0|acc |63.40|± | 1.24| |xstory_cloze_sw| 0|acc |49.90|± | 1.29| |xstory_cloze_te| 0|acc |53.34|± | 1.28| |xstory_cloze_zh| 0|acc |56.45|± | 1.28| ## llama-13B_xwinograd_0-shot.json | Task |Version|Metric|Value| |Stderr| |------------|------:|------|----:|---|-----:| |xwinograd_en| 0|acc |86.75|± | 0.70| |xwinograd_fr| 0|acc |68.67|± | 5.12| |xwinograd_jp| 0|acc |59.85|± | 1.58| |xwinograd_pt| 0|acc |71.48|± | 2.79| |xwinograd_ru| 0|acc |70.79|± | 2.57| |xwinograd_zh| 0|acc |70.04|± | 2.04|