Update docs

f1ec7b06 · Jason Phang · a85ad214 · f1ec7b06 · f1ec7b06 · f1ec7b06
Commit f1ec7b06 authored Sep 14, 2020 by Jason Phang
Hide whitespace changes
Inline Side-by-side

Showing with 82 additions and 8 deletions

README.md README.md +78 -4

lm_eval/models/gpt2.py lm_eval/models/gpt2.py +3 -3

main.py main.py +1 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -9,6 +9,83 @@ The goal of this project is to build a set of tools for evaluating LMs on typica

 The raw Google doc can be found here: https://docs.google.com/document/d/177dwJpH8GHebISXYZSn4NL98sXdCtQMH82b7O5F7jmw/edit?usp=sharing

+## Usage
+
+### Evaluate a task
+
+To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. RTE, Winograd Scheme Challenge), you can run the following command.
+
+```bash
+python main.py \
+	--model gpt2 \
+	--model_args device=cuda:0 \
+	--tasks rte,wsc \
+	--provide_description \
+	--num_fewshot 2
+```
+
+If you have access to an OpenAI API key, you can also evaluate GPT-3 on various tasks with the following command:
+
+```bash
+export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
+python main.py \
+	--model gpt3 \
+	--tasks rte,wsc \
+	--provide_description \
+	--num_fewshot 2
+```
+
+To inspect what the LM inputs look like, you can run the following command:
+
+```bash
+python write_out.py \
+	--tasks all_tasks \
+	--provide_description \
+	--num_fewshot 5 \
+	--num_examples 10 \
+	--output_base_path /path/to/output/folder
+```
+
+This will write out one text file for each task.
+
+### Code Structure
+
+There are two major components of the library:
+
+1. LMs (language models), e.g. GPT-2, GPT-3
+2. Tasks, e.g. MNLI, RTE, SQuAD (coming soon)
+
+Both LMs (`lm_eval.models`) and Tasks (`lm_eval.tasks`) are kept in a registry data structure, for easy CLI instantiation.
+
+**If you want to extend either models or tasks, simply add a new LM or Task subclass, and decorate with the registry decorator**.
+
+**GLUE**
+- [X] CoLA
+- [X] MNLI
+- [X] MRPC
+- [X] RTE
+- [X] QNLI
+- [X] QQP
+- [X] STS-B
+- [X] SST-2
+- [X] WNLI
+
+**SuperGLUE**
+- [X] BoolQ
+- [X] CommitmentBank
+- [X] COPA
+- [ ] MultiRC
+- [ ] ReCoRD
+- [X] RTE (See: GLUE)
+- [X] WiC
+- [X] WSC
+
+**QA Tasks**
+- [ ] CoQA 
+- [ ] DROP
+
+## Description
+
 ### 1. LM Evaluation
 Given an LM, we want to evaluate it on a wide range of NLU tasks. We should at least cover the set of tasks in the GPT-3 paper, and any other tasks/benchmarks that are relevant. We will follow the GPT-3 format of a) zero-shot, b) one-shot, c) few-shot evaluation.

@@ -49,8 +126,5 @@ This part is the easiest. I guess we just write out some text files containing t
 ## Summary (need to convert from google docs at some point):
 https://docs.google.com/document/d/177dwJpH8GHebISXYZSn4NL98sXdCtQMH82b7O5F7jmw/edit?usp=sharing

-## Current Datasets:
-
-[] CoQA 
-[] DROP
+## Current Tasks:

--- a/lm_eval/models/gpt2.py
+++ b/lm_eval/models/gpt2.py
@@ -19,12 +19,12 @@ class GPT2LM(LM):
        return cls(device=args.get("device", "cpu"))

    def generate(self, context, max_gen_length, truncate=True):
-        context = torch.tensor([self.tokenizer.encode(context.strip())], dtype=torch.long).to(self.device)
+        context_tensor = torch.tensor([self.tokenizer.encode(context.strip())], dtype=torch.long).to(self.device)
        res = self.gpt2.generate(
-            context,
+            context_tensor,
            eos_token_id=self.tokenizer.eos_token_id,
            do_sample=False,
-            max_length=max_gen_length,
+            max_length=self.num_tokens(context) + max_gen_length,
        )

        # chop off the prompt and the final eos token

--- a/main.py
+++ b/main.py
@@ -40,7 +40,7 @@ def main():
            num_fewshot=args.num_fewshot,
        )
        results[task_name] = result
-                
+
    dumped = json.dumps(results, indent=2)
    print(dumped)
    if args.output_path: