Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
5552c8dc
Commit
5552c8dc
authored
Feb 02, 2021
by
thefazzer
Browse files
Merge remote-tracking branch 'origin/master' into fazz/refactor-task-coqa
parents
c0862026
826d90e2
Changes
23
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
81 additions
and
19 deletions
+81
-19
.github/workflows/python-app.yml
.github/workflows/python-app.yml
+49
-0
README.md
README.md
+3
-0
lm_eval/models/dummy.py
lm_eval/models/dummy.py
+1
-1
lm_eval/models/gpt2.py
lm_eval/models/gpt2.py
+1
-1
lm_eval/tasks/__init__.py
lm_eval/tasks/__init__.py
+1
-1
lm_eval/tasks/arc.py
lm_eval/tasks/arc.py
+1
-1
lm_eval/tasks/drop.py
lm_eval/tasks/drop.py
+1
-1
lm_eval/tasks/lambada.py
lm_eval/tasks/lambada.py
+1
-1
lm_eval/tasks/naturalqs.py
lm_eval/tasks/naturalqs.py
+6
-1
lm_eval/tasks/openbookqa.py
lm_eval/tasks/openbookqa.py
+1
-1
lm_eval/tasks/piqa.py
lm_eval/tasks/piqa.py
+1
-1
lm_eval/tasks/quac.py
lm_eval/tasks/quac.py
+1
-1
lm_eval/tasks/race.py
lm_eval/tasks/race.py
+2
-1
lm_eval/tasks/squad.py
lm_eval/tasks/squad.py
+1
-1
lm_eval/tasks/storycloze.py
lm_eval/tasks/storycloze.py
+1
-1
lm_eval/tasks/triviaqa.py
lm_eval/tasks/triviaqa.py
+2
-2
lm_eval/tasks/webqs.py
lm_eval/tasks/webqs.py
+1
-1
lm_eval/tasks/wikitext.py
lm_eval/tasks/wikitext.py
+5
-1
lm_eval/tasks/winogrande.py
lm_eval/tasks/winogrande.py
+1
-1
lm_eval/utils.py
lm_eval/utils.py
+1
-1
No files found.
.github/workflows/python-app.yml
0 → 100644
View file @
5552c8dc
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name
:
Python application
on
:
push
:
branches
:
[
master
]
pull_request
:
branches
:
[
master
]
jobs
:
build
:
runs-on
:
ubuntu-latest
steps
:
-
uses
:
actions/checkout@v2
-
name
:
Cache
uses
:
actions/cache@v2.1.3
with
:
# A list of files, directories, and wildcard patterns to cache and restore
path
:
|
data
~/.cache
# An explicit key for restoring and saving the cache
key
:
evaldata-cache
-
name
:
Set up Python
3.9
uses
:
actions/setup-python@v2
with
:
python-version
:
3.9
-
name
:
Install dependencies
run
:
|
python -m pip install --upgrade pip
pip install flake8 pytest pytest-cov
pip install -e .
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
-
name
:
Lint with flake8
run
:
|
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
-
name
:
Test with pytest
run
:
|
pytest --cov=lm_eval/ tests/
-
name
:
Upload to codecov
run
:
|
bash <(curl -s https://codecov.io/bash)
\ No newline at end of file
README.md
View file @
5552c8dc
# Evaluation Harness for Large Language Models

[

](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)
## Overview
The goal of this project is to build a set of tools for evaluating LMs on typical NLU tasks, based on evaluation of GPT-3 as described in https://arxiv.org/pdf/2005.14165.pdf. Following the initial description, this repo should support 3 functions:
...
...
lm_eval/models/dummy.py
View file @
5552c8dc
...
...
@@ -20,4 +20,4 @@ class DummyLM(LM):
def
greedy_until
(
self
,
requests
):
# TODO: implement
pass
\ No newline at end of file
pass
lm_eval/models/gpt2.py
View file @
5552c8dc
...
...
@@ -43,4 +43,4 @@ class GPT2LM(LM):
def
greedy_until
(
self
,
requests
):
# TODO: implement
pass
\ No newline at end of file
pass
lm_eval/tasks/__init__.py
View file @
5552c8dc
...
...
@@ -46,7 +46,7 @@ TASK_REGISTRY = {
"lambada"
:
lambada
.
LAMBADA
,
"piqa"
:
piqa
.
PiQA
,
"triviaqa"
:
triviaqa
.
TriviaQA
,
#
"triviaqa": triviaqa.TriviaQA,
# "arc_easy": arc.ARCEasy, # not implemented yet
# "arc_challenge": arc.ARCChallenge, # not implemented yet
# "quac": quac.QuAC, # not implemented yet
...
...
lm_eval/tasks/arc.py
View file @
5552c8dc
...
...
@@ -70,4 +70,4 @@ class ARCEasy(HFTask):
class
ARCChallenge
(
ARCEasy
):
DATASET_PATH
=
"ai2_arc"
DATASET_NAME
=
"ARC-Challenge"
\ No newline at end of file
DATASET_NAME
=
"ARC-Challenge"
lm_eval/tasks/drop.py
View file @
5552c8dc
...
...
@@ -104,4 +104,4 @@ class DROP(Dataset):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/lambada.py
View file @
5552c8dc
...
...
@@ -67,4 +67,4 @@ class LAMBADA(Dataset):
return
{
'perplexity'
:
False
,
'accuracy'
:
True
}
\ No newline at end of file
}
lm_eval/tasks/naturalqs.py
View file @
5552c8dc
from
.
common
import
HFTask
from
itertools
import
islice
import
random
class
NaturalQs
(
HFTask
):
# TODO: naturalqs has a *really* large train set that huggingface just
# automatically downloads even if you dont use it. we should try and only
# download the val set and not even bother with the train set.
DATASET_PATH
=
"natural_questions"
DATASET_NAME
=
None
...
...
@@ -87,4 +92,4 @@ class NaturalQs(HFTask):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/openbookqa.py
View file @
5552c8dc
...
...
@@ -95,4 +95,4 @@ class OpenBookQA(HFTask):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/piqa.py
View file @
5552c8dc
...
...
@@ -74,4 +74,4 @@ class PiQA(Dataset):
def
higher_is_better
(
self
):
return
{
'acc'
:
True
}
\ No newline at end of file
}
lm_eval/tasks/quac.py
View file @
5552c8dc
...
...
@@ -103,4 +103,4 @@ class QuAC(Dataset):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/race.py
View file @
5552c8dc
...
...
@@ -23,7 +23,8 @@ class RACE(HFTask):
return
True
def
_collate_data
(
self
,
set
):
if
set
in
self
.
cache
:
return
self
.
cache
[
set
]
if
set
in
self
.
cache
:
return
self
.
cache
[
set
]
# One big issue with HF's implementation of this dataset: it makes a
# separate document for each question; meanwhile, in the GPT3 paper it
# is shown that one document is made per passage.
...
...
lm_eval/tasks/squad.py
View file @
5552c8dc
...
...
@@ -83,4 +83,4 @@ class SQuAD(HFTask):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/storycloze.py
View file @
5552c8dc
...
...
@@ -89,4 +89,4 @@ class StoryCloze(Dataset):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/triviaqa.py
View file @
5552c8dc
...
...
@@ -21,7 +21,7 @@ class TriviaQA(Dataset):
return
True
def
has_test_docs
(
self
):
return
Tru
e
return
Fals
e
def
training_docs
(
self
):
return
json
.
load
(
open
(
'data/triviaqa/triviaqa-unfiltered/unfiltered-web-train.json'
))[
'Data'
]
...
...
@@ -74,4 +74,4 @@ class TriviaQA(Dataset):
def
higher_is_better
(
self
):
return
{
"acc"
:
True
}
\ No newline at end of file
}
lm_eval/tasks/webqs.py
View file @
5552c8dc
...
...
@@ -70,4 +70,4 @@ class WebQs(HFTask):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/wikitext.py
View file @
5552c8dc
...
...
@@ -14,9 +14,11 @@ class WikiText103(NLP_TASK):
def
doc_to_text
(
self
,
doc
):
# TODO: implement
pass
def
doc_to_target
(
self
,
doc
):
# TODO: implement
pass
def
construct_requests
(
self
,
doc
,
ctx
):
""" Uses RequestFactory to construct Requests and returns an iterable of
...
...
@@ -74,9 +76,11 @@ class WikiText2(NLP_TASK):
def
doc_to_text
(
self
,
doc
):
# TODO: implement
pass
def
doc_to_target
(
self
,
doc
):
# TODO: implement
pass
def
construct_requests
(
self
,
doc
,
ctx
):
""" Uses RequestFactory to construct Requests and returns an iterable of
...
...
@@ -121,4 +125,4 @@ class WikiText2(NLP_TASK):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/tasks/winogrande.py
View file @
5552c8dc
...
...
@@ -90,4 +90,4 @@ class Winogrande(HFTask):
whether a higher value of the submetric is better
"""
# TODO: implement evaluation.
raise
NotImplementedError
(
'Evaluation not implemented'
)
\ No newline at end of file
raise
NotImplementedError
(
'Evaluation not implemented'
)
lm_eval/utils.py
View file @
5552c8dc
...
...
@@ -28,4 +28,4 @@ def simple_parse_args_string(args_string):
def
join_iters
(
iters
):
for
iter
in
iters
:
yield
from
iter
\ No newline at end of file
yield
from
iter
Prev
1
2
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment