"...composable_kernel_onnxruntime.git" did not exist on "81c942cd7e198b46f55be31451d51f27459d34ab"
Commit 5552c8dc authored by thefazzer's avatar thefazzer
Browse files

Merge remote-tracking branch 'origin/master' into fazz/refactor-task-coqa

parents c0862026 826d90e2
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Python application
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache
uses: actions/cache@v2.1.3
with:
# A list of files, directories, and wildcard patterns to cache and restore
path: |
data
~/.cache
# An explicit key for restoring and saving the cache
key: evaldata-cache
- name: Set up Python 3.9
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest pytest-cov
pip install -e .
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --cov=lm_eval/ tests/
- name: Upload to codecov
run: |
bash <(curl -s https://codecov.io/bash)
\ No newline at end of file
# Evaluation Harness for Large Language Models # Evaluation Harness for Large Language Models
![](https://github.com/EleutherAI/lm-evaluation-harness/workflows/Python%20application/badge.svg)
[![codecov](https://codecov.io/gh/EleutherAI/lm-evaluation-harness/branch/master/graph/badge.svg?token=JSG3O2427J)](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)
## Overview ## Overview
The goal of this project is to build a set of tools for evaluating LMs on typical NLU tasks, based on evaluation of GPT-3 as described in https://arxiv.org/pdf/2005.14165.pdf. Following the initial description, this repo should support 3 functions: The goal of this project is to build a set of tools for evaluating LMs on typical NLU tasks, based on evaluation of GPT-3 as described in https://arxiv.org/pdf/2005.14165.pdf. Following the initial description, this repo should support 3 functions:
......
...@@ -20,4 +20,4 @@ class DummyLM(LM): ...@@ -20,4 +20,4 @@ class DummyLM(LM):
def greedy_until(self, requests): def greedy_until(self, requests):
# TODO: implement # TODO: implement
pass pass
\ No newline at end of file
...@@ -43,4 +43,4 @@ class GPT2LM(LM): ...@@ -43,4 +43,4 @@ class GPT2LM(LM):
def greedy_until(self, requests): def greedy_until(self, requests):
# TODO: implement # TODO: implement
pass pass
\ No newline at end of file
...@@ -46,7 +46,7 @@ TASK_REGISTRY = { ...@@ -46,7 +46,7 @@ TASK_REGISTRY = {
"lambada": lambada.LAMBADA, "lambada": lambada.LAMBADA,
"piqa": piqa.PiQA, "piqa": piqa.PiQA,
"triviaqa": triviaqa.TriviaQA, #"triviaqa": triviaqa.TriviaQA,
# "arc_easy": arc.ARCEasy, # not implemented yet # "arc_easy": arc.ARCEasy, # not implemented yet
# "arc_challenge": arc.ARCChallenge, # not implemented yet # "arc_challenge": arc.ARCChallenge, # not implemented yet
# "quac": quac.QuAC, # not implemented yet # "quac": quac.QuAC, # not implemented yet
......
...@@ -70,4 +70,4 @@ class ARCEasy(HFTask): ...@@ -70,4 +70,4 @@ class ARCEasy(HFTask):
class ARCChallenge(ARCEasy): class ARCChallenge(ARCEasy):
DATASET_PATH = "ai2_arc" DATASET_PATH = "ai2_arc"
DATASET_NAME = "ARC-Challenge" DATASET_NAME = "ARC-Challenge"
\ No newline at end of file
...@@ -104,4 +104,4 @@ class DROP(Dataset): ...@@ -104,4 +104,4 @@ class DROP(Dataset):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -67,4 +67,4 @@ class LAMBADA(Dataset): ...@@ -67,4 +67,4 @@ class LAMBADA(Dataset):
return { return {
'perplexity': False, 'perplexity': False,
'accuracy': True 'accuracy': True
} }
\ No newline at end of file
from . common import HFTask from . common import HFTask
from itertools import islice from itertools import islice
import random
class NaturalQs(HFTask): class NaturalQs(HFTask):
# TODO: naturalqs has a *really* large train set that huggingface just
# automatically downloads even if you dont use it. we should try and only
# download the val set and not even bother with the train set.
DATASET_PATH = "natural_questions" DATASET_PATH = "natural_questions"
DATASET_NAME = None DATASET_NAME = None
...@@ -87,4 +92,4 @@ class NaturalQs(HFTask): ...@@ -87,4 +92,4 @@ class NaturalQs(HFTask):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -95,4 +95,4 @@ class OpenBookQA(HFTask): ...@@ -95,4 +95,4 @@ class OpenBookQA(HFTask):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -74,4 +74,4 @@ class PiQA(Dataset): ...@@ -74,4 +74,4 @@ class PiQA(Dataset):
def higher_is_better(self): def higher_is_better(self):
return { return {
'acc': True 'acc': True
} }
\ No newline at end of file
...@@ -103,4 +103,4 @@ class QuAC(Dataset): ...@@ -103,4 +103,4 @@ class QuAC(Dataset):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -23,7 +23,8 @@ class RACE(HFTask): ...@@ -23,7 +23,8 @@ class RACE(HFTask):
return True return True
def _collate_data(self, set): def _collate_data(self, set):
if set in self.cache: return self.cache[set] if set in self.cache:
return self.cache[set]
# One big issue with HF's implementation of this dataset: it makes a # One big issue with HF's implementation of this dataset: it makes a
# separate document for each question; meanwhile, in the GPT3 paper it # separate document for each question; meanwhile, in the GPT3 paper it
# is shown that one document is made per passage. # is shown that one document is made per passage.
......
...@@ -83,4 +83,4 @@ class SQuAD(HFTask): ...@@ -83,4 +83,4 @@ class SQuAD(HFTask):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -89,4 +89,4 @@ class StoryCloze(Dataset): ...@@ -89,4 +89,4 @@ class StoryCloze(Dataset):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -21,7 +21,7 @@ class TriviaQA(Dataset): ...@@ -21,7 +21,7 @@ class TriviaQA(Dataset):
return True return True
def has_test_docs(self): def has_test_docs(self):
return True return False
def training_docs(self): def training_docs(self):
return json.load(open('data/triviaqa/triviaqa-unfiltered/unfiltered-web-train.json'))['Data'] return json.load(open('data/triviaqa/triviaqa-unfiltered/unfiltered-web-train.json'))['Data']
...@@ -74,4 +74,4 @@ class TriviaQA(Dataset): ...@@ -74,4 +74,4 @@ class TriviaQA(Dataset):
def higher_is_better(self): def higher_is_better(self):
return { return {
"acc": True "acc": True
} }
\ No newline at end of file
...@@ -70,4 +70,4 @@ class WebQs(HFTask): ...@@ -70,4 +70,4 @@ class WebQs(HFTask):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -14,9 +14,11 @@ class WikiText103(NLP_TASK): ...@@ -14,9 +14,11 @@ class WikiText103(NLP_TASK):
def doc_to_text(self, doc): def doc_to_text(self, doc):
# TODO: implement # TODO: implement
pass
def doc_to_target(self, doc): def doc_to_target(self, doc):
# TODO: implement # TODO: implement
pass
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
""" Uses RequestFactory to construct Requests and returns an iterable of """ Uses RequestFactory to construct Requests and returns an iterable of
...@@ -74,9 +76,11 @@ class WikiText2(NLP_TASK): ...@@ -74,9 +76,11 @@ class WikiText2(NLP_TASK):
def doc_to_text(self, doc): def doc_to_text(self, doc):
# TODO: implement # TODO: implement
pass
def doc_to_target(self, doc): def doc_to_target(self, doc):
# TODO: implement # TODO: implement
pass
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
""" Uses RequestFactory to construct Requests and returns an iterable of """ Uses RequestFactory to construct Requests and returns an iterable of
...@@ -121,4 +125,4 @@ class WikiText2(NLP_TASK): ...@@ -121,4 +125,4 @@ class WikiText2(NLP_TASK):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -90,4 +90,4 @@ class Winogrande(HFTask): ...@@ -90,4 +90,4 @@ class Winogrande(HFTask):
whether a higher value of the submetric is better whether a higher value of the submetric is better
""" """
# TODO: implement evaluation. # TODO: implement evaluation.
raise NotImplementedError('Evaluation not implemented') raise NotImplementedError('Evaluation not implemented')
\ No newline at end of file
...@@ -28,4 +28,4 @@ def simple_parse_args_string(args_string): ...@@ -28,4 +28,4 @@ def simple_parse_args_string(args_string):
def join_iters(iters): def join_iters(iters):
for iter in iters: for iter in iters:
yield from iter yield from iter
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment