guide.md

# Guide: adding a new task

Here we provide a step by step guide for adding a new task to the `bigcode-evaluation-harness` to evaluate code generation language models. The process is similar to adding tasks in [lm_evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), from which this repository is inspired, so this document is based on their [task_guide](https://github.com/EleutherAI/lm-evaluation-harness/edit/master/docs/task_guide.md). The `Task` class is the backbone of all tasks in this framewok.

## Setup

If you haven't already, fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:

```sh
# After forking...
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
git checkout -b <task-name>
pip install -r requirements.txt
```

## Creating Your Task File

From the `bigcode-evaluation-harness` project root, copy over the `new_task.py` template to `bigcode_eval/tasks`.

```sh
cp template/new_task.py bigcode_eval/tasks/<task-name>.py
```

## Task Heading

Open the file you've just created and add a multiline docstring on the first line with the following contents:

```python
"""
<Paper title>
<Paper PDF URL>

<Short description of task>

Homepage: <URL to task's homepage>
"""
```

## Data Handling

### Downloading your Data

All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, if your dataset isn't already on the hub (see [catalog](https://huggingface.co/datasets)), please consider adding it to make it accessible to a wider user base by following this [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md).

Now, that you have your HF dataset, you need to assign its path and name to your `Task` in the following fields:

```python
class TaskName(...):
    DATASET_PATH = "..."
    DATASET_NAME = "..."
```

where `DATASET_PATH` is the name of the dataset as listed by HF in the `datasets` Hub and `DATASET_NAME` is the name of sub-task of the benchmark. If your task does not contain any data instances/subsets, just set `DATASET_NAME = None`.

Next you need to load the evaluation split of the dataset in `get_dataset` function. For example

```python
def get_dataset(self):
    return self.dataset["test"]
```

You might need to redefine some arguments of the class, like `stop_words` which defines the stop words for stopping criteria during the code generation, and `requires_execution` which defines whether the task requires code execution or not.

```python
    def __init__(self):
        super().__init__(
            stop_words=["\n"],
            requires_execution=True,
        )
```

### Processing Documents

Then you need to format your document into a single query prompt __without the answer__  to be sent to the Language Model in `get_prompt` method.

It takes a single `doc` example of type `dict` with `str` key-value members.

```python
def get_prompt(self, doc):
    return ""
```

If the prompt involves few-shot examples, you first need to save them in a json `<task_name>_few_shot_prompts.json` in `bigcode_eval/tasks/few_shot_example` and then load them in `fewshot_examples` method like this:

```python
def fewshot_examples(self):
    with open("bigcode_eval/tasks/few_shot_examples/<task_name>_few_shot_prompts.json", "r") as file:
        examples = json.load(file)
    return examples
```

The prompt will be sent to the languge model, and the generation will be evaluated against ground truth solutions or unit tests. You need to load them from the `doc` in `get_target` method.

```python
def get_target(self, doc):
    return ""
```

### Postprocessing & Evaluation

The solutions generated by the language model often require postprocessing to remove unececessary text and get executable code. This is done in the `postprocess_generation` function. It takes as input the model generation `generation` and the document index to which the generation belongs in the dataset `idx` (this is not needed in most cases).

```python
def postprocess_generation(self, generation, idx):
    return ""
```

The evaluation happens in `process_results` function. This function takes as argument the list of generations for all selected problems in the benchmark in `generations` and their refernces in `references` and returns a dictionary of metrics and their values.

```python
def process_results(self, generations, references):
    return {}
```

You need to load your metric and run it. Check Hugging Face `evaluate` [library](https://huggingface.co/docs/evaluate/index) for the available metrics. For example [code_eval](https://huggingface.co/spaces/evaluate-metric/code_eval) for pass@k, [BLEU](https://huggingface.co/spaces/evaluate-metric/bleu) for BLEU score and [apps_metric](https://huggingface.co/spaces/codeparrot/apps_metric) are implemented. If you cannot find your desired metric, you can either add it to the `evaluate` library or implement it in the `bigcode_eval/tasks/custom_metrics` folder and import it from there.


### Registering Your Task

Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `bigcode_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY`  dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/__init__.py).

## Task submission

### Running Unit Tests

To run the entire test suite, use:

```sh
pytest
```

## Fine-tuning
Few-shot tasks are easier to conduct, but if you need to add the finetuning script for your task, you can create a folder for it in `finetuning` folder and use a similar training and evaluation script to the other tasks.

## Code formatting
You can format your changes and perform `black` standard checks
```sh
black bigcode_eval/tasks/<task-name>.py
```
## Task documentation
Please document your task with advised parameters for execution from litterature in the [docs](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md) like it's done for the other benchamrks.

## Pull request
Please specify in your pull request if you followed the orginal paper's approach to build the prompts or if some changes were introduced (especially if you build few shot examples). Ideally, you can evaluate some public models and compare the scores to the published results and see if they match.

If there are no published results for your task, make sure the evaluation works properly by testing some samples with a good code generation model such as InCoder-1B. During the experiments you have the option to save `generation.json` and `references.json`, take a look to see if the generations are properely cleaned and are somewhat close to the references for match-based evaluations for example.

Now push your work and make a pull request! Thanks for the contribution 🚀.