添加Megatron项目

5add46aa · hepj · deb8370c · 5add46aa · 5add46aa · 5add46aa
Commit 5add46aa authored Jan 09, 2025 by hepj
20 changed files
--- a/Bigcode-Evaluation-Harness-240327/docs/README.md
+++ b/Bigcode-Evaluation-Harness-240327/docs/README.md
+<h4 align="center">
+    <p>
+        <a href="#code-generation-benchmarks-with-unit-tests">Benchmarks w/ unit tests</a> |
+        <a href="#code-generation-benchmarks-without-unit-tests">Benchmarks w/o unit tests</a> |
+        <a href="#documentation-generation-task">Documentation generation </a> |
+        <a href="#downstream-classification-tasks">Downstream classification</a> |
+        <a href="#how-to-add-a-new-benchmark">New benchmark</a> 
+    <p>
+</h4>
+
+# Documentation
+
+Here we document the tasks available in this benchmark. Code generation models, just like natural language models can
+be evaluated using match-based metrics such as BLEU score. However, these metrics fail in capturing the syntactic and 
+semantic features of code.  A more appropriate way to evaluate these models is functional correctness, where a solution 
+is considered correct if it passes some unit tests, a popular metric for this is `pass@k`. 
+
+In this evaluation harness, we include tasks with unit tests, but also some tasks with BLEU evaluation, due to the scarcity and evaluation cost of the first type.
+
+Before diving into the tasks, here are some instructions that stand for all the benchmarks:
+  * Adapt `max_length_generation` based on your model's context size and task, by default it is 512. This value is enough for tasks like HumanEval and MBPP but some tasks such as APPS require a larger value because the prompts are long, you can use the full model's context size.
+  * Adapt the  `batch_size` based on your device memory and `n_samples`, by default it is 1. It should be smaller than `n_samples`, but for multiple generations per problem, the larger the batch size the better, since it makes the generation faster.
+  * `allow_code_execution` allows the execution of the model-generated (untrusted) code on your machine, please read carefully the displayed warning before calling it (it is off by default). 
+  * You can adapt the text generation parameter by changing `do_sample`, `top_p` and `temperature` parameters. 
+  * Some models, such as [InCoder](https://huggingface.co/facebook/incoder-6B), might require adding a prefix before the prompt to give a hint about the language. To add the prefix for InCoder to indicate Python language for example, set `prefix` argument to `"<| file ext=.py |>\n"`.
+  * The generations are saved with `save_generations` that should be called during the execution, you can visualize the post-processed model generations used for the evaluation. You also have the option of saving the references, it can be useful for tasks that use BLEU score and actual solutions as references, you just need to `save_references`.
+  * For experimenting, you can choose the number of tasks to evaluate on instead of using the whole test set with the `limit` argument, try using a number that is proportional to your number of devices.
+
+## Code generation benchmarks with unit tests
+
+### HumanEval
+[HumanEval](https://huggingface.co/datasets/openai_humaneval): 164 handwritten Python programming problems with a function signature, docstring, body, and several unit tests.
+
+* Prompts & generation: in a zero-shot setting, we use function signatures as prompts to the models and generate code until some stop words. By default, top-p sampling is used with $p=0.95$ (same for the other tasks unless we say otherwise), this is set using the arguments `do_sample` and `top_p`. 
+We follow Chen et al. approach for pass@k estimation, where $n=200 > k$ solutions are generated per problem for the estimation of the success rate (`n_samples=200`).
+* Evaluation: we evaluate the pass@1, pass@10 and pass@100 for a given temperature.
+
+Below are the commands to run the evaluation with these settings:
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks humaneval \
+  --temperature 0.2 \
+  --n_samples 200 \
+  --batch_size 10 \
+  --allow_code_execution
+```
+
+If you want to evaluate only on the first $n$ samples instead of all the test dataset, set `limit` argument to $n$. 
+
+### HumanEval+
+[HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus): HumanEval with additional unit tests (80x of the original HumanEval) for each of the 164 problems.
+
+The generation and evaluation follows the same approach as [HumanEval](#humaneval). One only needs to change the task name to `humanevalplus` to run the evaluation on HumanEval+, such as:
+
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks humanevalplus \
+  --temperature 0.2 \
+  --n_samples 200 \
+  --batch_size 10 \
+  --allow_code_execution
+```
+
+
+### HumanEvalPack
+
+[HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack) extends HumanEval to **3** scenarios across **6** languages via human annotations. There are different prompting options depending on the model that can be specified with the `--prompt` flag:
+- `continue`: This prompt is the same as HumanEval and only works for HumanEvalSynthesize
+- `instruct`: For this prompt an intuitive instruction is given to the model to tell it what to do.
+- `octocoder`, `wizardcoder`, `instructcodet5p` etc.: These are custom prompt formats for individual models to align with how they were finetuned.
+
+The three scenarios are listed below. The selectable languages are: `python`, `js`, `java`, `go`, `cpp` & `rust`.
+- HumanEvalFix: In this task models are provided with a solution with a subtle bug and several unit tests. The task is to fix the function. There is a variant of this task where the function docstring instead of the unit tests are provided, which can be selected via `humanevalfixdocs`.
+```
+accelerate launch main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --prompt <PROMPT> \
+  --tasks humanevalfixtests-python \
+  --temperature 0.2 \
+  --n_samples 20 \
+  --batch_size 10 \
+  --allow_code_execution
+```
+- HumanEvalExplain: In this task models need to explain a HumanEval solution (without docstring) and subsequently regenerate the solution given only the model's own explanation. Thus, it requires two runs. The first one generates the descriptions, the second loads the descriptions, generates the solution & is scored.
+```
+accelerate launch main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --prompt <PROMPT> \
+  --tasks humanevalexplaindescribe-python \
+  --temperature 0.2 \
+  --n_samples 20 \
+  --batch_size 10 \
+  --allow_code_execution \
+  --generation_only
+
+accelerate launch main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --prompt <PROMPT> \
+  --load_data_path <PATH_TO_EXPLANATIONS_FROM_ABOVE> \
+  --tasks humanevalexplainsynthesize-python \
+  --temperature 0.2 \
+  --n_samples 1 \
+  --batch_size 1 \
+  --allow_code_execution
+```
+- HumanEvalSynthesize: This is like HumanEval but with human translations for JavaScript, Java, Go, C++ and Rust. It is based on [HumanEval-X](https://arxiv.org/abs/2303.17568), however, with additional fixes and improvements documented [here](https://github.com/bigcode-project/octopack/tree/main/evaluation/create/humaneval-x#modifications-muennighoff).
+
+```
+accelerate launch main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --prompt <PROMPT> \
+  --tasks humanevalsynthesize-python \
+  --temperature 0.2 \
+  --n_samples 20 \
+  --batch_size 10 \
+  --allow_code_execution \
+  --save_generations
+```
+
+
+There is also a version to run the OpenAI API on HumanEvalPack at `bigcode_eval/tasks/humanevalpack_openai.py`. It requires the `openai` package that can be installed via `pip install openai`. You will need to set the environment variables `OPENAI_ORGANIZATION` and `OPENAI_API_KEY`. Then you may want to modify the global variables defined in the script, such as `LANGUAGE`. Finally, you can run it with `python bigcode_eval/tasks/humanevalpack_openai.py`.
+
+
+### InstructHumanEval
+[InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval): 164 handwritten Python programming problems described by an instruction (derived from the HumanEval docstring), a function signature and several unit tests.
+
+This evaluation suite is similar to HumanEval but it is dedicated to instruction-tuned models. Each prompt is built as  an instruction followed by a context, which are separated by delimiter tokens (those used in the instruction-tuning of the model). Here we focus on 3 of such tokens:
+- <user_token> : this token represents the role of the person who uses/prompts the model to solve a given task. It can be `Question:`, `USER` etc.
+- <end_token> : this token is used to designate the end of the user turn (the end of their request). It can be `<|end|>` or `</s>`. It can even be as simple as `\n`, ` `, or `\n\n`.
+- <assistant_token> : similar to <user_token>, this represents the LLM. Some common templates include `Assistant:`, `Response:`, `Answer:`, `<|Assistant|>` etc.
+
+Our evaluation supports two scenarios :
+- *Code completion* (`tasks = instruct-humaneval`)
+Here the model is prompted with the following instruction
+```bash
+<user_token> + <instruction> + <end_token> + <assistant_token> + <context>
+```
+The model is expected to complete a function signature. Make sure to add a `\n` at the end of your `<assistant_token>` to trigger a return to line for the function declaration.
+- *Docstring to code* (`tasks = instruct-humaneval-nocontext`)
+Here the model is prompted with the following instruction
+```bash
+<user_token> + <instruction> + <end_token> + <assistant_token>
+```
+The model is expected to solve the problem formulated as instruction. There is no additional guidance provided by `<context>` (which contains imports, auxiliary functions and the function signature), which increases the complexity of the task.
+
+Here are the commands to run the evaluation in each setting:
+
+for code completion
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks instruct-humaneval \
+  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
+  --temperature 0.2 \
+  --n_samples 200 \
+  --batch_size 10 \
+  --allow_code_execution
+```
+
+for docstring to code
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks instruct-humaneval-nocontext \
+  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
+  --temperature 0.2 \
+  --n_samples 200 \
+  --batch_size 10 \
+  --allow_code_execution
+```
+The main change is the use of the `instruction_tokens` argument which represents the 3 tokens we mentionned above separated from each other by a comma `,`.
+For [StarChat-Beta](https://huggingface.co/HuggingFaceH4/starchat-beta) for example we used these tokens`<|user|>\n,<|end|>\n and <|assistant|>\n`. You might need to escape `|` and `\` characters in bash with `--instruction_tokens \<\|user\|\>$'\n',\<\|end\|\>$'\n',\<\|assistant\|\>$'\n'`
+### MBPP
+[MBPP](https://huggingface.co/datasets/mbpp):  consists of around 1,000 crowd-sourced Python programming problems, 
+designed to be solvable by entry-level programmers. Each problem consists of a task description in English, a code solution and 3 automated test cases. We evaluate on the test set of samples from index 11 to 511.
+
+* Prompts and generation: We use a few-shot setting in InCoder style prompt: we feed the prompt to the model as a doctring and only include one solution, to help the model catch the function name which is required in the unit tests.
+  ```python
+  prompt = f'"""\n{description}\n{test_example}\n"""\n'
+  ```
+  To use this setting (it's the case by default) set `prompt_type_mbpp` to `incoder`. We also give the option to include a code solution in the prompt, just set `include_solution_mbpp` to `True`.
+  We use single generations per problem (pass@1), where the model is only given one chance to solve each problem. But we still follow Chen et al. approach similarily to HumanEval for pass@k estimation, we generate $n=15 > k$ solutions ($k=1$ in this case) per problem for the estimation of the success rate (`n_samples=15`).
+* Evaluation: we evaluate the pass@1.
+
+Below are the commands to run the evaluation with these settings:
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks mbpp \
+  --temperature 0.1 \
+  --n_samples 15 \
+  --batch_size 10 \
+  --allow_code_execution
+```
+
+Low temperatures generally work better for small $k$ in pass@k.
+
+### MBPP+
+[MBPP+](https://huggingface.co/datasets/evalplus/mbppplus): MBPP with additional unit tests (35x of the original MBPP) for each of the 164 problems.
+
+The generation and evaluation follows the same approach as [MBPP](#mbpp). One only needs to change the task name to `mbppplus` to run the evaluation on MBPP+, such as:
+
+> [!Note]
+> Note MBPP+ only includes **399** tasks which are a subset of the original MBPP dataset (~1000 tasks). 
+> The subset is selected from the sanitized MBPP (a subset of ~427 manually examined tasks by the original MBPP authors)
+> and EvalPlus further removes low-quality and ill-formed one for benchmark quality control to get MBPP+.
+
+```bash
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks mbppplus \
+  --temperature 0.1 \
+  --n_samples 15 \
+  --batch_size 10 \
+  --allow_code_execution
+```
+
+By setting `MBBPPLUS_USE_MBPP_TESTS=1` when running MBPP+, one can run the 399 MBPP+ tasks (a subset of the 500 MBPP evaluation tasks) with the original MBPP base tests:
+
+```bash
+MBBPPLUS_USE_MBPP_TESTS=1 accelerate launch main.py \
+  --tasks mbppplus \
+  --allow_code_execution \
+  --load_generations_path generations_mbppplus.json \
+  --model <MODEL_NAME>
+```
+
+### DS-1000
+[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
+
+The task can be specified as `--tasks ds1000-$SUBSET-$MODE`, where subset can include `all` libraries or any of the following subsets: `numpy`, `scipy`, `pandas`, `tensorflow`, `pytorch`, `sklearn`, `matplotlib`. Supported generation modes are `completion` (purely autoregressive) or `insertion` via fill-in-middle [FIM] (this mode now only supports InCoder and BigCode Models).
+
+- Prompts & Generation: prompts include partial code with one or more missing lines. The form of such prompts varies between `completion` and `insertion` modes (`[insert]` token used to reflect FIM region). Default generation args are reflected below.
+- Evaluation: generations are evaluated via execution of unit tests. As in the original manuscript, $pass@1$ is evaluated over each of `num_samples` and the mean pass rate is returned as the metric. Default evaluation args are presented below.
+
+Below is the command to run evaluation on the full benchmark in insertion mode with the arguments that correspond to the original manuscript.
+
+```bash
+export TF_FORCE_GPU_ALLOW_GROWTH=true
+TF_CPP_MIN_LOG_LEVEL=3 accelerate launch main.py \
+  --model <MODEL_NAME> \
+  --batch_size <BATCH_SIZE> \
+  --tasks ds1000-all-insertion \
+  --n_samples 40 \
+  --max_length_generation 1024 \
+  --temperature 0.2 \
+  --top_p 0.95 \
+  --allow_code_execution
+```
+
+### MultiPL-E
+[MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E): is a benchmark for evaluating large language models for code generation that supports 18 programming languages. It takes the OpenAI "HumanEval" Python benchmark and uses little compilers to translate them to other languages. We use similar implementation as [the original repository](https://github.com/nuprl/MultiPL-E/tree/main) and evaluation parameters are similar to HumanEval. Although for this benchmark, we strongly recommend using the provided Dockerfile to build the MultiPL-E container with all required dependencies, and for more safety especially when evaluating on languages like `bash`.
+Tasks are named `multiple-<LANG>` where `<LANG>` is the language name, e.g. `multiple-py` for python.
+
+```bash
+$ sudo make DOCKERFILE=Dockerfile-multiple all
+```
+This creates an image called `evaluation-harness-multiple`.
+
+Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
+```bash
+accelerate launch  main.py \
+    --model bigcode/santacoder  \
+    --tasks multiple-py  \
+    --max_length_generation 650 \
+    --temperature 0.8   \
+    --do_sample True  \
+    --n_samples 200  \
+    --batch_size 200  \
+    --trust_remote_code \
+    --generation_only \
+    --save_generations \
+    --save_generations_path generations_py.json
+```
+To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit`  if it was used during generation):
+```bash
+$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
+    --model bigcode/santacoder \
+    --tasks multiple-py \
+    --load_generations_path /app/generations_py.json \
+    --allow_code_execution  \
+    --temperature 0.8 \
+    --n_samples 200
+```
+Execution time may vary depending on the programming languages.
+
+### APPS
+[APPS](https://huggingface.co/datasets/codeparrot/apps): is a challenging benchmark for code generation with 10,000 Python problems, 
+5,000 for the training and 5000 for the evaluation. It has three difficulty levels: introductory, interview and competition. 
+Most papers finetune models on the training split before the evaluation, since the problems are often challenging the problem descriptions are long.
+However, Chen et al. evaluated Codex-12B in a one-shot setting, although the details about the prompt format aren't given we propose two evaluation modes: 
+with fine-tuning and in a one-shot setting:
+* Prompts & generation
+
+**1- Fine-tuning:** we provide the code to fine-tune autoregressive model on this dataset in 
+[`finetuning/APPS`](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS). To evaluate a fine-tuned model,
+we a similar prompt format to the original paper of Hendrycks et al. There are two types of calls based if the function name is provided for the sample or not.
+
+```python
+    prompt = "\nQUESTION:\n"
+    prompt += sample["question"]
+    if starter_code:
+        prompt += starter_code
+    if fn_name:
+        call_format = "\nUse Standard Input format"
+        prompt += call_format
+    else:
+        call_format = "\nUse Call-Based format"
+        prompt += call_format
+    prompt += "\nANSWER:\n"
+```
+Sometimes the prompts can be long and exceed model's context length, so they get truncated. In this case we truncate the prompt before the context length to be able to include the entry "\nUse Call-Based format\nANSWER:\n" for example. The problem description is always at the beginning of the prompt followed by examples that aren't always relevant while the entry is important for the model to know it has to generate Python code that can be executed and not natural text.
+
+To use this setting (it's the case by default) set the argument `setup_apps` to `finetuning`. To select a difficulty level use `level_apps`argument, by default it is `all`.
+
+**2- Few-shot:** for non finetuned models, we provide one example in the prompt for each call type (Standard Input and Call-Based format). We add the examples with an instruction before the prompt above:
+
+```
+    one_shot_prompt = (
+        "Implement answers to the following questions:\nQUESTION:\n"
+        + examples["problem_type1"]
+        + "\nUse Standard Input format\nANSWER:\n"
+        + examples["solution_type1"]
+        + "\nQUESTION:\n"
+        + examples["problem_type2"]
+        + "\nUse Call-Based format\nANSWER:\n"
+        + examples["solution_type2"]
+        + "\n"
+        + prompt
+    )
+```
+
+* Evaluation: we have two types of evaluations for this benchmark:
+  * the original Hendrycks et al. evaluation, where we do single generations (`n_samples=1`) and compute the average accuracy of the number 
+of tests that pass for each problem, and the sctrict accuracy, where a problem is solved if all tests pass and we average over all problems. This metric is fast to compute given that we do single generations and capture incremental improvement especially for small models. However, strict accuracy is often very low and average accuracy may not very reprsentative as the number of tests is not consistent through the problems. Recent papers evaluate this benchmark using pass@k.
+  * we compute the pass@1, pass@10 and pass@100 and generate 200 problems per task (`n_samples=200`). Note that this takes a lot of time since there are 5000 evaluation samples, and there aren't some python stop words for the generation to prevent small models that struggle in answering from generating until max_length or EOS token.
+
+In case of single generations (`n_samples=1`), the first metric is used, but when multiple generations are made the pass@k metric is used.
+
+Below are the commands to run the evaluation with these settings for introductory level for example:
+```python
+# to compute average/strict accuracies: use n_samples 1 
+# to compute pass@k: use n_samples != 1 (200)
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks apps-introductory \
+  --n_samples 1 \
+  --temperature 0.1 \
+  --batch_size 1 \
+  --allow_code_execution
+```
+We expect a model [finetuned](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS) on the train split of APPS.
+TODO: add few-shot setup for APPS.
+
+### Recode
+[Recode](https://github.com/amazon-science/recode/tree/main) proposes a set of code and natural language transformations to evaluate the robustness of code-generation models. The perturbations can be applied to any code-generation benchmark. Specifically, they release perturbed versions of HumanEval and MBPP.
+
+For now, we support the perturbed version of the HumanEval benchmark.
+The task is specified with `--tasks perturbed-humaneval-{category}-num_seeds_{num_seeds}` where `category` can be one of `format`, `func_name`, `natgen`, `nlaugmenter`, and the number of seeds per perturbation is from `1` to `10`. The author's recommendation is to run with 5 seeds, with greedy generation.
+
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation 1024 \
+  --tasks <TASK> \
+  --batch_size 1 \
+  --do_sample False \
+  --n_samples 1 \
+  --allow_code_execution
+```
+
+
+## Code generation benchmarks without unit tests
+
+For these tasks, we do single generations and compare the generated code against reference solutions and compute BLEU score. For the following tasks, we use a two-shot setting where we include 2 inputs and their solutions in the prompt, all preceded by an instruction such as: ` "Answer the following instructions in a one line SQL query:\n"`. The solutions consist of one line so we stop the generation when a new line is generated. 3 languages are present: Python, SQL and Java.
+
+- [CoNaLa](https://huggingface.co/datasets/neulab/conala)for Python code generation, it has 500 tasks in the test set.
+- [Spider](https://huggingface.co/datasets/spider) for SQL code generation, it has 1,034 tasks in the test set.
+- [Concode](https://huggingface.co/datasets/code_x_glue_tc_text_to_code) for Java code generation, it has 2,000 tasks in the test set.
+
+We only do single generation `n_samples=1`, and use the same generation settings as before.
+Below are the commands to run the evaluation:
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks <TASK> \
+  --n_samples 1 \
+  --temperature 0.1 \
+  --batch_size 1 
+```
+If you ever get index out-of-range errors try using a number of problems `limit` that is proportional to the number of devices you are using.
+
+### SantaCoder-FIM
+[SantaCoder-FIM](https://huggingface.co/datasets/bigcode/santacoder-fim-task): 4,792 tasks for FIM insertion described in [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988). The tasks are similar to other tasks without unit tests, with two key differences:
+1. Instead of BLEU Score, Exact Match is used to score the generations.
+2. Use zero-shot setting instead of 2-shot
+
+SantaCoder-FIM includes 2 tasks:
+- `StarCoderFIM`: which uses the default FIM tokens `"<fim_prefix>", "<fim_middle>", "<fim_suffix>"`, and
+- `SantaCoderFIM`: which uses SantaCoder FIM tokens `"<fim-prefix>", "<fim-middle>", "<fim-suffix>"`
+So depending on the FIM tokens used to train the model, you will need to select the appropriate task for evaluation.
+
+We only do single generation `n_samples=1`, and use the same generation settings as before.
+Below are the commands to run the evaluation:
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks <TASK> \
+  --n_samples 1 \
+  --temperature 0.2 \
+  --batch_size 1 
+```
+If you ever get index out-of-range errors try using a number of problems `limit` that is proportional to the number of devices you are using.
+
+## Documentation generation task
+Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text): is a benchmark for English documentation generation from for 6 programming languages: Python, Go, Ruby, Java, JavaScript and PHP. 
+
+For Python: we evaluate in a zero-shot setting. We have two options:
+  * in the first one: we give as a prompt the function signature, which we extract by splitting at the beginning of the docstring. This task is `codexglue_code_to_text-python-left`.
+  * in the second one: we include the full fucntion body (withoout the docstring) and add this sentence at the end of the prompt: `'\n"""The goal of this function is to:\n'`. This task is `codexglue_code_to_text-python`.
+We retrieve the reference solutions from the docstring tokens, similarily to InCoder's approach, since the target docstrings in the dataset include extra context such as argument definitions. We only keep one line in the model generation.
+
+For the other languages (task `codexglue_code_to_text-<language>`): the docstring is not included in the code so we currently don't extract signatures and use the full function body followed by a comment in that language saying `\n=begin The goal of this function is to:\n` for Ruby, and `\n/* The goal of this function is to:\n` for the rest. This task is still not well tested, please report any bugs you might find.
+
+For this task, we advise using greedy generation. For evaluation, we compute the BLEU score.
+
+Below are the commands to run the evaluation:
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks codexglue_code_to_text-python-left \
+  --n_samples 1 \
+  --batch_size 1 \
+```
+## Downstream classification tasks
+
+These are classification tasks for Java and C, we provide the code to finetune models on these benchmarks and evaluate on them in the 
+[`finetuning`](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/) folder:
+
+* [Java Complexity prediction](https://huggingface.co/datasets/codeparrot/codecomplex)
+* [Java code equivalence prediction](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench)
+* [C code defect prediction](https://huggingface.co/datasets/code_x_glue_cc_defect_detection)
+
+## Natural language reasoning tasks
+
+These are reasoning tasks involving mathematical , symbolic and procedural reasoning with the task description / questions are in natural language.
+
+#### PAL - Program-aided Language Models
+
+In PAL, Large Language Models solve reasoning problems by generating reasoning chains with code. PAL datasets that are currently supported:
+
+* [GSM8K](https://huggingface.co/datasets/gsm8k) - Grade School Math 8K
+* [GSM-HARD](https://huggingface.co/datasets/reasoning-machines/gsm-hard) - Created by replacing the numbers in the questions of GSM8K with larger numbers 
+
+The model is prompted with few-shot examples of questions and reasoning steps as code. It then generates reasoning steps for a new question as Python code, which is executed to get the model's predicted answer.
+
+PAL uses two types of few-shot evaluation - 
+
+- `greedy` - samples one generation by greedy decoding and evaluates against reference answers
+- `majority_voting` - samples k (k=40 in paper) generations and takes majority voted answer to evaluate against the reference.
+
+**Task signature** : `pal-{dataset_name}-{evaluation_type}` (eg: `pal-gsm8k-greedy`,`pal-gsmhard-majority_voting`)
+
+Commands to run the evaluation:
+
+**Greedy Decoding**
+
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks pal-gsm8k-greedy \
+  --n_samples 1 \
+  --batch_size 1 \
+  --do_sample False \
+  --allow_code_execution
+```
+
+**Majority Voting**
+
+```python
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --max_length_generation <MAX_LENGTH> \
+  --tasks pal-gsmhard-majority_voting \
+  --n_samples 40 \
+  --batch_size 1 \
+  --temperature 0.7 \
+  --top_p 0.95 \
+  --allow_code_execution
+```
+
+The complete prompt with 8 shot examples (as used in [PAL](https://github.com/reasoning-machines/pal)) take up `~1500` tokens, hence the value should be greater than that and the recommended value of `max_length_generation` is `2048`.
+
+## How to add a new benchmark
+
+We welcome contributions to add new code benchmarks to this evaluation harness. You can find a step-by-step guide in [`guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md).
--- a/Bigcode-Evaluation-Harness-240327/docs/guide.md
+++ b/Bigcode-Evaluation-Harness-240327/docs/guide.md
+# Guide: adding a new task
+
+Here we provide a step by step guide for adding a new task to the `bigcode-evaluation-harness` to evaluate code generation language models. The process is similar to adding tasks in [lm_evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), from which this repository is inspired, so this document is based on their [task_guide](https://github.com/EleutherAI/lm-evaluation-harness/edit/master/docs/task_guide.md). The `Task` class is the backbone of all tasks in this framewok.
+
+## Setup
+
+If you haven't already, fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
+
+```sh
+# After forking...
+git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
+cd bigcode-evaluation-harness
+git checkout -b <task-name>
+pip install -r requirements.txt
+```
+
+## Creating Your Task File
+
+From the `bigcode-evaluation-harness` project root, copy over the `new_task.py` template to `bigcode_eval/tasks`.
+
+```sh
+cp template/new_task.py bigcode_eval/tasks/<task-name>.py
+```
+
+## Task Heading
+
+Open the file you've just created and add a multiline docstring on the first line with the following contents:
+
+```python
+"""
+<Paper title>
+<Paper PDF URL>
+
+<Short description of task>
+
+Homepage: <URL to task's homepage>
+"""
+```
+
+## Data Handling
+
+### Downloading your Data
+
+All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, if your dataset isn't already on the hub (see [catalog](https://huggingface.co/datasets)), please consider adding it to make it accessible to a wider user base by following this [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md).
+
+Now, that you have your HF dataset, you need to assign its path and name to your `Task` in the following fields:
+
+```python
+class TaskName(...):
+    DATASET_PATH = "..."
+    DATASET_NAME = "..."
+```
+
+where `DATASET_PATH` is the name of the dataset as listed by HF in the `datasets` Hub and `DATASET_NAME` is the name of sub-task of the benchmark. If your task does not contain any data instances/subsets, just set `DATASET_NAME = None`.
+
+Next you need to load the evaluation split of the dataset in `get_dataset` function. For example
+
+```python
+def get_dataset(self):
+    return self.dataset["test"]
+```
+
+You might need to redefine some arguments of the class, like `stop_words` which defines the stop words for stopping criteria during the code generation, and `requires_execution` which defines whether the task requires code execution or not.
+
+```python
+    def __init__(self):
+        super().__init__(
+            stop_words=["\n"],
+            requires_execution=True,
+        )
+```
+
+### Processing Documents
+
+Then you need to format your document into a single query prompt __without the answer__  to be sent to the Language Model in `get_prompt` method.
+
+It takes a single `doc` example of type `dict` with `str` key-value members.
+
+```python
+def get_prompt(self, doc):
+    return ""
+```
+
+If the prompt involves few-shot examples, you first need to save them in a json `<task_name>_few_shot_prompts.json` in `bigcode_eval/tasks/few_shot_example` and then load them in `fewshot_examples` method like this:
+
+```python
+def fewshot_examples(self):
+    with open("bigcode_eval/tasks/few_shot_examples/<task_name>_few_shot_prompts.json", "r") as file:
+        examples = json.load(file)
+    return examples
+```
+
+The prompt will be sent to the languge model, and the generation will be evaluated against ground truth solutions or unit tests. You need to load them from the `doc` in `get_target` method.
+
+```python
+def get_target(self, doc):
+    return ""
+```
+
+### Postprocessing & Evaluation
+
+The solutions generated by the language model often require postprocessing to remove unececessary text and get executable code. This is done in the `postprocess_generation` function. It takes as input the model generation `generation` and the document index to which the generation belongs in the dataset `idx` (this is not needed in most cases).
+
+```python
+def postprocess_generation(self, generation, idx):
+    return ""
+```
+
+The evaluation happens in `process_results` function. This function takes as argument the list of generations for all selected problems in the benchmark in `generations` and their refernces in `references` and returns a dictionary of metrics and their values.
+
+```python
+def process_results(self, generations, references):
+    return {}
+```
+
+You need to load your metric and run it. Check Hugging Face `evaluate` [library](https://huggingface.co/docs/evaluate/index) for the available metrics. For example [code_eval](https://huggingface.co/spaces/evaluate-metric/code_eval) for pass@k, [BLEU](https://huggingface.co/spaces/evaluate-metric/bleu) for BLEU score and [apps_metric](https://huggingface.co/spaces/codeparrot/apps_metric) are implemented. If you cannot find your desired metric, you can either add it to the `evaluate` library or implement it in the `bigcode_eval/tasks/custom_metrics` folder and import it from there.
+
+
+### Registering Your Task
+
+Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in `bigcode_eval/tasks/__init__.py` and provide an entry in the `TASK_REGISTRY`  dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the [file](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/__init__.py).
+
+## Task submission
+
+### Running Unit Tests
+
+To run the entire test suite, use:
+
+```sh
+pytest
+```
+
+## Fine-tuning
+Few-shot tasks are easier to conduct, but if you need to add the finetuning script for your task, you can create a folder for it in `finetuning` folder and use a similar training and evaluation script to the other tasks.
+
+## Code formatting
+You can format your changes and perform `black` standard checks
+```sh
+black bigcode_eval/tasks/<task-name>.py
+```
+## Task documentation
+Please document your task with advised parameters for execution from litterature in the [docs](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md) like it's done for the other benchamrks.
+
+## Pull request
+Please specify in your pull request if you followed the orginal paper's approach to build the prompts or if some changes were introduced (especially if you build few shot examples). Ideally, you can evaluate some public models and compare the scores to the published results and see if they match.
+
+If there are no published results for your task, make sure the evaluation works properly by testing some samples with a good code generation model such as InCoder-1B. During the experiments you have the option to save `generation.json` and `references.json`, take a look to see if the generations are properely cleaned and are somewhat close to the references for match-based evaluations for example.
+
+Now push your work and make a pull request! Thanks for the contribution 🚀.
--- a/Bigcode-Evaluation-Harness-240327/finetuning/APPS/README.md
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/APPS/README.md
+# APPS finetuning
+In this folder we show how to train an autoregressive Language model on APPS dataset, since a common way to evaluate on this benchmark is after finetuning the model on its training split.
+We use Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) which supports distributed training on multiple GPUs.
+
+## Setup
+
+First login to Weights & Biases
+```
+wandb login
+```
+
+You can finetune a model, `gpt_345_python_any_license` for example, by running:
+```python
+# we use a global batch size of 256, here = 8 (GPUs) * 2 (batch_size_per_device) * 16 (gradient_accumulation)
+python apps_train.py \
+        --model_ckpt BigCode/gpt_345_python_any_license \
+        --num_epochs 10 \
+        --batch_size 2 \
+        --gradient_accumulation_steps 16 \
+        --learning_rate 5e-5 \
+        --eval_freq 250 \
+        --fp16
+```
+The fine-tuning takes 11h on 4 A100 GPUs.
+
+## Acknowledgments
+
+This script is adapted from [APPS repository](https://github.com/hendrycks/apps).
\ No newline at end of file
--- a/Bigcode-Evaluation-Harness-240327/finetuning/APPS/apps_dataset.py
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/APPS/apps_dataset.py
+import json
+import random
+
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer
+
+
+class APPSBaseDataset(torch.utils.data.Dataset):
+    def __init__(self, dataset, max_tokens, tokenizer_path):
+        self.dataset = dataset
+        self.max_tokens = max_tokens
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            tokenizer_path, use_auth_token=True
+        )
+        self.samples = []  # Should be set in initialize()
+
+        self.initialize(self.tokenizer)
+
+    def initialize(self, tokenizer):
+
+        all_samples = []
+        skipped_problems = []
+
+        all_samples_dict = {}  # Mapping from question_fname to list of samples
+        count = 0
+        for idx in tqdm(range(len(self.dataset))):
+            sample = self.dataset[idx]
+            # question
+            question_str = sample["question"]
+            # solutions
+            try:
+                solutions = json.loads(sample["solutions"])
+            except ValueError:
+                skipped_problems.append(idx)
+                continue
+            # starter code
+            starter_code = (
+                "" if len(sample["starter_code"]) == 0 else sample["starter_code"]
+            )
+            try:
+                input_outpout = json.loads(sample["input_output"])
+                fn_name = (
+                    None
+                    if not input_outpout.get("fn_name")
+                    else input_outpout["fn_name"]
+                )
+            except ValueError:
+                fn_name = None
+
+            answer_type = (
+                "\nUse Standard Input format\n"
+                if not fn_name
+                else "\nUse Call-Based format\n"
+            )
+
+            # Read all the solutions
+            for solution in solutions:
+                sample = (question_str, starter_code, solution, answer_type)
+                # remove samples with long questions
+                q_str = (
+                    "\nQUESTION:\n"
+                    + question_str
+                    + "\n"
+                    + starter_code
+                    + "\n"
+                    + answer_type
+                    + "\nANSWER:\n"
+                )
+                if len(tokenizer(q_str)["input_ids"]) >= self.max_tokens:
+                    count += 1
+                    continue
+                all_samples.append(sample)
+                if question_str in all_samples_dict:
+                    all_samples_dict[question_str].append(sample)
+                else:
+                    all_samples_dict[question_str] = [sample]
+
+        print(f"Loaded {len(all_samples)} samples")
+        print(f"Skipped {len(skipped_problems)} problems because no solution was found")
+        print(f"Skipped {count} problems because the prompt was too long")
+        self.samples = all_samples
+        self.samples_dict = all_samples_dict
+
+    def __len__(self):
+        return len(self.samples)
+
+    def pack_samples(self, idx):
+        """
+        Repeatedly pick question, answer pairs from self.dataroot until we hit max_tokens.
+        This will not include the tokens for the QUESTION and ANSWER prompt, as well as the
+        self.question_prefix. These will be added later and the total input will be
+        truncated if necessary.
+
+        Always include the sample at idx at the beginning.
+        """
+        curr_num_tokens = 0
+        curr_samples = []
+
+        curr_q, curr_s, curr_a, curr_q_prefix = self.samples[idx]
+
+        while curr_num_tokens < self.max_tokens:
+
+            # Never remove. Fixes stalling bug.
+            curr_q = curr_q[:150000]
+            curr_s = curr_s[:150000]
+            curr_a = curr_a[:150000]
+
+            # TODO change to one tokenizer call
+            curr_num_tokens += len(self.tokenizer.tokenize(curr_q))
+            curr_num_tokens += len(self.tokenizer.tokenize(curr_s))
+            curr_num_tokens += len(self.tokenizer.tokenize(curr_a))
+
+            curr_samples.append((curr_q, curr_s, curr_a, curr_q_prefix))
+
+            curr_q, curr_s, curr_a, curr_q_prefix = random.choice(self.samples)
+
+        return curr_samples
+
+    def __getitem__(self, idx):
+
+        raw_samples = self.pack_samples(idx)
+        output_samples = sample_gpt_task(
+            raw_samples,
+            max_tokens=self.max_tokens,
+            tokenizer=self.tokenizer,
+        )
+        return output_samples
+
+
+def sample_gpt_task(raw_samples, max_tokens, tokenizer):
+    """
+    Create the true sample used for the GPT task
+    """
+
+    input_ids = []
+    label_ids = []
+
+    for q_str, s_str, a_str, answer_type in raw_samples:
+
+        # Loss is not calculated on this
+        q_str = (
+            "\nQUESTION:\n" + q_str + "\n" + s_str + "\n" + answer_type + "\nANSWER:\n"
+        )
+
+        question_token_ids = tokenizer(q_str)["input_ids"]
+        answer_token_ids = tokenizer(a_str)["input_ids"] + [tokenizer.eos_token_id]
+
+        input_ids.extend(question_token_ids + answer_token_ids)
+        # labels must be of same size as inputs, -100 to ignore first tokens
+        label_ids.extend([-100] * len(question_token_ids))
+        label_ids.extend(answer_token_ids)
+
+    # Sanity check
+    assert len(input_ids) == len(label_ids)
+
+    # Cut off the excess
+    input_ids = input_ids[:max_tokens]
+    label_ids = label_ids[:max_tokens]
+
+    # TODO replace with a simple HF function/datacollator ?
+    return {
+        "input_ids": torch.LongTensor(input_ids),
+        "labels": torch.LongTensor(label_ids),
+    }
+
+
+if __name__ == "__main__":
+    import json
+
+    from datasets import load_dataset
+
+    # Do sanity checking
+    dataset = load_dataset("codeparrot/apps", split="train")
+    dataset.shuffle(seed=0)
+
+    tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small")
+    dataset = APPSBaseDataset(
+        dataset, max_tokens=1024, tokenizer_path="codeparrot/codeparrot-small"
+    )
+    print("example sample of APPSBaseDataset:")
+    example = dataset[0]
+    labels = example["labels"]
+    labels[labels == -100] = tokenizer.eos_token_id
+    print(f"input ids {'-' * 10}:\n {tokenizer.decode(example['input_ids'])}")
+    print(f"labels {'-' * 10}:\n {tokenizer.decode(labels)}")
--- a/Bigcode-Evaluation-Harness-240327/finetuning/APPS/apps_train.py
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/APPS/apps_train.py
+"""
+Fine-Tune LM on APPS train split
+"""
+
+import argparse
+import os
+
+import torch
+from apps_dataset import APPSBaseDataset
+from datasets import load_dataset
+from transformers import (AutoModelForCausalLM, Trainer, TrainingArguments,
+                          logging, set_seed)
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_ckpt", type=str, default="codeparrot/codeparrot-small")
+    parser.add_argument("--max_length", type=int, default=1024)
+    parser.add_argument("--num_epochs", type=int, default=10)
+    parser.add_argument("--max_steps", type=int, default=-1)
+    parser.add_argument("--batch_size", type=int, default=8)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=8)
+
+    parser.add_argument("--learning_rate", type=float, default=5e-5)
+    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
+    parser.add_argument("--num_warmup_steps", type=int, default=100)
+    parser.add_argument("--weight_decay", type=float, default=0.05)
+
+    parser.add_argument("--fp16", default=False, action="store_true")
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--output_dir", type=str, default="./checkpoints")
+    parser.add_argument("--log_freq", default=1, type=int)
+    parser.add_argument("--eval_freq", default=250, type=int)
+    parser.add_argument("--save_freq", default=250, type=int)
+    return parser.parse_args()
+
+
+def get_dataset(dataset, args):
+
+    train_data = APPSBaseDataset(
+        dataset=dataset, max_tokens=args.max_length, tokenizer_path=args.model_ckpt
+    )
+
+    return train_data
+
+
+def run_training(args, train_data, val_data):
+
+    model = AutoModelForCausalLM.from_pretrained(args.model_ckpt, use_auth_token=True)
+    train_data.start_iteration = 0
+
+    print(f"Starting main loop")
+
+    training_args = TrainingArguments(
+        output_dir=args.output_dir,
+        dataloader_drop_last=True,
+        evaluation_strategy="steps",
+        num_train_epochs=args.num_epochs,
+        max_steps=args.max_steps,
+        eval_steps=args.eval_freq,
+        save_steps=args.save_freq,
+        logging_steps=args.log_freq,
+        per_device_train_batch_size=args.batch_size,
+        per_device_eval_batch_size=args.batch_size,
+        learning_rate=args.learning_rate,
+        lr_scheduler_type=args.lr_scheduler_type,
+        warmup_steps=args.num_warmup_steps,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        weight_decay=args.weight_decay,
+        fp16=args.fp16,
+        run_name="apps-train",
+        report_to="wandb",
+    )
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_data,
+        eval_dataset=val_data,
+    )
+
+    print("Training...")
+    trainer.train()
+
+    print("saving last checkpoint of the model")
+    model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/"))
+
+
+def main(args):
+
+    dataset = load_dataset("codeparrot/apps", split="train")
+    dataset.shuffle(seed=args.seed)
+    data = get_dataset(dataset, args)
+    train_size = int(0.95 * len(data))
+    train_data, val_data = torch.utils.data.random_split(
+        data,
+        [train_size, len(data) - train_size],
+        generator=torch.Generator().manual_seed(args.seed),
+    )
+    print(
+        f"size of training data {len(train_data)}\nsize of validation data {len(val_data)}"
+    )
+    run_training(args, train_data, val_data)
+
+
+if __name__ == "__main__":
+
+    args = get_args()
+    set_seed(args.seed)
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    logging.set_verbosity_error()
+
+    main(args)
--- a/Bigcode-Evaluation-Harness-240327/finetuning/Code-to-text/README.md
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/Code-to-text/README.md
+# Code-to-text finetuning [WIP]
+In this folder we show how to train an autoregressive on [Code-to-text](https://huggingface.co/datasets/code_x_glue_ct_code_to_text) dataset, for natural language comments generation from code. We use Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) which supports distributed training on multiple GPUs.
+
+## Setup
+
+First login to Weights & Biases and to Hugging Face hub if you want to push your model to the hub:
+```
+wandb login
+huggingface-cli login
+```
+
+During the training, we use the code as input to the model and docstring as label. To fine-tune a model on the Python dataset for example, you can use the following command:
+```python
+python train.py \
+    --model_ckpt codeparrot/codeparrot-small \
+    --language Python \
+    --num_epochs 30 \
+    --batch_size 8 \
+    --num_warmup_steps 10 \
+    --learning_rate 5e-4 
+    --push_to_hub True
+```
+
+For the 2-shot evaluation we use as a prompt
+```
+Generate comments for these code snippets:
+Code:
+$CODE1
+Comment:
+$DOCSTRING1
+
+Code:
+CODE2
+Comment:
+$DOCSTRING2
+
+Code: $CODE
+"""
+```
--- a/Bigcode-Evaluation-Harness-240327/finetuning/Code-to-text/train.py
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/Code-to-text/train.py
+import argparse
+
+from datasets import load_dataset
+from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
+                          Trainer, TrainingArguments, set_seed)
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_ckpt", type=str, default="microsoft/unixcoder-base-nine"
+    )
+    parser.add_argument("--language", type=str, default="Python")
+    parser.add_argument("--max_length", type=int, default=1024)
+    parser.add_argument("--num_epochs", type=int, default=5)
+    parser.add_argument("--batch_size", type=int, default=6)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
+    parser.add_argument("--freeze", type=bool, default=True)
+    parser.add_argument("--learning_rate", type=float, default=5e-4)
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
+    parser.add_argument("--num_warmup_steps", type=int, default=10)
+    parser.add_argument("--weight_decay", type=float, default=0.01)
+    parser.add_argument("--output_dir", type=str, default="./results")
+    parser.add_argument("--push_to_hub", type=bool, default=False)
+    parser.add_argument("--model_hub_name", type=str, default="codeclone_model")
+    return parser.parse_args()
+
+
+def main():
+    args = get_args()
+    set_seed(args.seed)
+
+    ds = load_dataset("code_x_glue_ct_code_to_text", args.language)
+
+    print("Loading tokenizer and model")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_ckpt)
+    tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForSequenceClassification.from_pretrained(
+        args.model_ckpt, num_labels=2
+    )
+    model.config.pad_token_id = model.config.eos_token_id
+
+    if args.freeze:
+        for param in model.roberta.parameters():
+            param.requires_grad = False
+
+    def tokenize(example):
+        if args.language == "Python":
+            # remove docstring from code
+            chunks = example["code"].split('"""')
+            code = chunks[0].strip() + chunks[2]
+        else:
+            code = example["code"]
+        inputs = tokenizer(
+            code, padding="max_length", truncation=True, max_length=args.max_length
+        )
+        labels = tokenizer(
+            example["docstring"],
+            padding="max_length",
+            truncation=True,
+            max_length=args.max_length,
+        ).input_ids
+        labels_with_ignore_index = []
+        for labels_example in labels:
+            labels_example = [label if label != 0 else -100 for label in labels_example]
+            labels_with_ignore_index.append(labels_example)
+
+        return {
+            "input_ids": inputs["input_ids"],
+            "attention_mask": inputs["attention_mask"],
+            "label": labels_with_ignore_index,
+        }
+
+    tokenized_datasets = ds.map(
+        tokenize,
+        batched=True,
+        remove_columns=ds["train"].column_names,
+    )
+
+    training_args = TrainingArguments(
+        output_dir=args.output_dir,
+        learning_rate=args.learning_rate,
+        lr_scheduler_type=args.lr_scheduler_type,
+        evaluation_strategy="epoch",
+        save_strategy="epoch",
+        logging_strategy="epoch",
+        per_device_train_batch_size=args.batch_size,
+        per_device_eval_batch_size=args.batch_size,
+        num_train_epochs=args.num_epochs,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        weight_decay=args.weight_decay,
+        run_name=f"code-to-text-{args.language}",
+        report_to="wandb",
+    )
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_datasets["train"],
+        eval_dataset=tokenized_datasets["validation"],
+        tokenizer=tokenizer,
+    )
+
+    print("Training...")
+    trainer.train()
+
+    # push the model to the Hugging Face hub
+    if args.push_to_hub:
+        model.push_to_hub(args.model_hub_name)
+
+
+if __name__ == "__main__":
+    main()
--- a/Bigcode-Evaluation-Harness-240327/finetuning/CodeClone/README.md
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/CodeClone/README.md
+# CodeClone finetuning
+In this folder we show how to train an autoregressive on [CodeClone](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench) dataset, for the binary classification problem of code equivalence prediction. We use Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) which supports distributed training on multiple GPUs.
+
+## Setup
+
+First login to Weights & Biases and to Hugging Face hub if you want to push your model to the hub:
+```
+wandb login
+huggingface-cli login
+```
+
+To fine-tune a model on this dataset you can use the following command:
+```python
+python train_complexity_predictor.py \
+    --model_ckpt microsoft/unixcoder-base-nine \
+    --num_epochs 30 \
+    --batch_size 8 \
+    --num_warmup_steps 10 \
+    --learning_rate 5e-4 
+    --push_to_hub True
+```
+This will fine-tune your model, push it to the hub and print the evaluation accuracy on the test set.
+
--- a/Bigcode-Evaluation-Harness-240327/finetuning/CodeClone/train.py
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/CodeClone/train.py
+import argparse
+from copy import deepcopy
+
+import numpy as np
+from datasets import ClassLabel, load_dataset
+from evaluate import load
+from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
+                          DataCollatorWithPadding, Trainer, TrainerCallback,
+                          TrainingArguments, set_seed)
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_ckpt", type=str, default="microsoft/unixcoder-base-nine"
+    )
+    parser.add_argument("--max_length", type=int, default=1024)
+    parser.add_argument("--num_epochs", type=int, default=5)
+    parser.add_argument("--batch_size", type=int, default=6)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
+    parser.add_argument("--freeze", type=bool, default=True)
+    parser.add_argument("--learning_rate", type=float, default=5e-4)
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
+    parser.add_argument("--num_warmup_steps", type=int, default=10)
+    parser.add_argument("--weight_decay", type=float, default=0.01)
+    parser.add_argument("--output_dir", type=str, default="./results")
+    parser.add_argument("--push_to_hub", type=bool, default=False)
+    parser.add_argument("--model_hub_name", type=str, default="codeclone_model")
+    return parser.parse_args()
+
+
+metric = load("accuracy")
+
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    predictions = np.argmax(predictions, axis=1)
+    return metric.compute(predictions=predictions, references=labels)
+
+
+class CustomCallback(TrainerCallback):
+    def __init__(self, trainer) -> None:
+        super().__init__()
+        self._trainer = trainer
+
+    def on_epoch_end(self, args, state, control, **kwargs):
+        if control.should_evaluate:
+            control_copy = deepcopy(control)
+            self._trainer.evaluate(
+                eval_dataset=self._trainer.train_dataset, metric_key_prefix="train"
+            )
+            return control_copy
+
+
+def main():
+    args = get_args()
+    set_seed(args.seed)
+
+    ds = load_dataset("code_x_glue_cc_clone_detection_big_clone_bench")
+    labels = ClassLabel(num_classes=2, names=[True, False])
+    ds = ds.cast_column("label", labels)
+
+    print("Loading tokenizer and model")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_ckpt)
+    tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForSequenceClassification.from_pretrained(
+        args.model_ckpt, num_labels=2
+    )
+    model.config.pad_token_id = model.config.eos_token_id
+
+    if args.freeze:
+        for param in model.roberta.parameters():
+            param.requires_grad = False
+
+    def tokenize(example):
+        inputs = tokenizer(
+            example["func1"],
+            example["func2"],
+            truncation=True,
+            max_length=args.max_length,
+        )
+        return {
+            "input_ids": inputs["input_ids"],
+            "attention_mask": inputs["attention_mask"],
+        }
+
+    tokenized_datasets = ds.map(
+        tokenize,
+        batched=True,
+        remove_columns=["id", "id1", "id2", "func1", "func2"],
+    )
+    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+
+    training_args = TrainingArguments(
+        output_dir=args.output_dir,
+        learning_rate=args.learning_rate,
+        lr_scheduler_type=args.lr_scheduler_type,
+        evaluation_strategy="epoch",
+        save_strategy="epoch",
+        logging_strategy="epoch",
+        per_device_train_batch_size=args.batch_size,
+        per_device_eval_batch_size=args.batch_size,
+        num_train_epochs=args.num_epochs,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        weight_decay=args.weight_decay,
+        metric_for_best_model="accuracy",
+        run_name="code-clone-java",
+        report_to="wandb",
+    )
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_datasets["train"],
+        eval_dataset=tokenized_datasets["validation"],
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics,
+    )
+
+    print("Training...")
+    trainer.add_callback(CustomCallback(trainer))
+    trainer.train()
+
+    result = trainer.evaluate(eval_dataset=tokenized_datasets["test"])
+    print(f"Evaluation accuracy on the test set: {result['eval_accuracy']}")
+
+    # push the model to the Hugging Face hub
+    if args.push_to_hub:
+        model.push_to_hub(args.model_hub_name)
+
+
+if __name__ == "__main__":
+    main()
--- a/Bigcode-Evaluation-Harness-240327/finetuning/CodeComplex/README.md
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/CodeComplex/README.md
+# CodeComplex finetuning
+In this folder we show how to train an autoregressive on CodeComplex dataset, for algorithmic complexity prediction of Java programs. We use Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) which supports distributed training on multiple GPUs.
+
+## Setup
+
+First login to Weights & Biases and to Hugging Face hub if you want to push your model to the hub:
+```
+wandb login
+huggingface-cli login
+```
+
+To fine-tune a model on this dataset, `microsoft/unixcoder-base-nine` for example, you can use the following command:
+
+```python
+python train.py \
+    --model_ckpt microsoft/unixcoder-base-nine \
+    --num_epochs 60 \
+    --num_warmup_steps 10 \
+    --batch_size 8 \
+    --learning_rate 5e-4 
+```
--- a/Bigcode-Evaluation-Harness-240327/finetuning/CodeComplex/train.py
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/CodeComplex/train.py
+import argparse
+from copy import deepcopy
+
+import numpy as np
+from datasets import ClassLabel, DatasetDict, load_dataset
+from evaluate import load
+from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
+                          DataCollatorWithPadding, Trainer, TrainerCallback,
+                          TrainingArguments, set_seed)
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_ckpt", type=str, default="microsoft/unixcoder-base-nine"
+    )
+    parser.add_argument("--num_epochs", type=int, default=5)
+    parser.add_argument("--batch_size", type=int, default=6)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
+    parser.add_argument("--freeze", type=bool, default=True)
+    parser.add_argument("--learning_rate", type=float, default=5e-4)
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
+    parser.add_argument("--num_warmup_steps", type=int, default=10)
+    parser.add_argument("--weight_decay", type=float, default=0.01)
+    parser.add_argument("--output_dir", type=str, default="./results")
+    parser.add_argument("--push_to_hub", type=bool, default=False)
+    parser.add_argument("--model_hub_name", type=str, default="codecomplex_model")
+    return parser.parse_args()
+
+
+metric = load("accuracy")
+
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    predictions = np.argmax(predictions, axis=1)
+    return metric.compute(predictions=predictions, references=labels)
+
+
+class CustomCallback(TrainerCallback):
+    def __init__(self, trainer) -> None:
+        super().__init__()
+        self._trainer = trainer
+
+    def on_epoch_end(self, args, state, control, **kwargs):
+        if control.should_evaluate:
+            control_copy = deepcopy(control)
+            self._trainer.evaluate(
+                eval_dataset=self._trainer.train_dataset, metric_key_prefix="train"
+            )
+            return control_copy
+
+
+def main():
+    args = get_args()
+    set_seed(args.seed)
+
+    dataset = load_dataset("codeparrot/codecomplex", split="train")
+    train_test = dataset.train_test_split(test_size=0.2)
+    test_validation = train_test["test"].train_test_split(test_size=0.5)
+    train_test_validation = DatasetDict(
+        {
+            "train": train_test["train"],
+            "test": test_validation["train"],
+            "valid": test_validation["test"],
+        }
+    )
+
+    print("Loading tokenizer and model")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_ckpt)
+    tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForSequenceClassification.from_pretrained(
+        args.model_ckpt, num_labels=7
+    )
+    model.config.pad_token_id = model.config.eos_token_id
+
+    if args.freeze:
+        for param in model.roberta.parameters():
+            param.requires_grad = False
+
+    labels = ClassLabel(
+        num_classes=7, names=list(set(train_test_validation["train"]["complexity"]))
+    )
+
+    def tokenize(example):
+        inputs = tokenizer(example["src"], truncation=True, max_length=1024)
+        label = labels.str2int(example["complexity"])
+        return {
+            "input_ids": inputs["input_ids"],
+            "attention_mask": inputs["attention_mask"],
+            "label": label,
+        }
+
+    tokenized_datasets = train_test_validation.map(
+        tokenize,
+        batched=True,
+        remove_columns=train_test_validation["train"].column_names,
+    )
+    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+
+    training_args = TrainingArguments(
+        output_dir=args.output_dir,
+        learning_rate=args.learning_rate,
+        lr_scheduler_type=args.lr_scheduler_type,
+        evaluation_strategy="epoch",
+        save_strategy="epoch",
+        logging_strategy="epoch",
+        per_device_train_batch_size=args.batch_size,
+        per_device_eval_batch_size=args.batch_size,
+        num_train_epochs=args.num_epochs,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        weight_decay=args.weight_decay,
+        metric_for_best_model="accuracy",
+        run_name="complexity-java",
+        report_to="wandb",
+    )
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_datasets["train"],
+        eval_dataset=tokenized_datasets["valid"],
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics,
+    )
+
+    print("Training...")
+    trainer.add_callback(CustomCallback(trainer))
+    trainer.train()
+
+    result = trainer.evaluate(eval_dataset=tokenized_datasets["test"])
+    print(f"Evaluation accuracy on the test set: {result['eval_accuracy']}")
+
+    # push the model to the Hugging Face hub
+    if args.push_to_hub:
+        model.push_to_hub(args.model_hub_name)
+
+
+if __name__ == "__main__":
+    main()
--- a/Bigcode-Evaluation-Harness-240327/finetuning/CodeDefect/README.md
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/CodeDefect/README.md
+# CodeDefect finetuning
+In this folder we show how to train an autoregressive on [CodeDefect](https://huggingface.co/datasets/code_x_glue_cc_defect_detection) dataset, for the problem of predicting if a code is insecure or not. We use Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) which supports distributed training on multiple GPUs.
+
+## Setup
+
+First login to Weights & Biases and to Hugging Face hub if you want to push your model to the hub:
+```
+wandb login
+huggingface-cli login
+```
+
+To fine-tune a model on this dataset you can use the following command:
+```python
+python train.py \
+    --model_ckpt microsoft/unixcoder-base-nine \
+    --num_epochs 30 \
+    --batch_size 8 \
+    --num_warmup_steps 10 \
+    --learning_rate 5e-4 
+    --push_to_hub True
+```
+This will fine-tune your model, push it to the hub and print the evaluation accuracy on the test set.
\ No newline at end of file
--- a/Bigcode-Evaluation-Harness-240327/finetuning/CodeDefect/train.py
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/CodeDefect/train.py
+import argparse
+from copy import deepcopy
+
+import numpy as np
+from datasets import ClassLabel, load_dataset
+from evaluate import load
+from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
+                          DataCollatorWithPadding, Trainer, TrainerCallback,
+                          TrainingArguments, set_seed)
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_ckpt", type=str, default="microsoft/unixcoder-base-nine"
+    )
+    parser.add_argument("--max_length", type=int, default=1024)
+    parser.add_argument("--num_epochs", type=int, default=5)
+    parser.add_argument("--batch_size", type=int, default=6)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
+    parser.add_argument("--freeze", type=bool, default=True)
+    parser.add_argument("--learning_rate", type=float, default=5e-4)
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
+    parser.add_argument("--num_warmup_steps", type=int, default=10)
+    parser.add_argument("--weight_decay", type=float, default=0.01)
+    parser.add_argument("--output_dir", type=str, default="./results")
+    parser.add_argument("--push_to_hub", type=bool, default=False)
+    parser.add_argument("--model_hub_name", type=str, default="codedefect_model")
+    return parser.parse_args()
+
+
+metric = load("accuracy")
+
+
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    predictions = np.argmax(predictions, axis=1)
+    return metric.compute(predictions=predictions, references=labels)
+
+
+class CustomCallback(TrainerCallback):
+    def __init__(self, trainer) -> None:
+        super().__init__()
+        self._trainer = trainer
+
+    def on_epoch_end(self, args, state, control, **kwargs):
+        if control.should_evaluate:
+            control_copy = deepcopy(control)
+            self._trainer.evaluate(
+                eval_dataset=self._trainer.train_dataset, metric_key_prefix="train"
+            )
+            return control_copy
+
+
+def main():
+    args = get_args()
+    set_seed(args.seed)
+
+    ds = load_dataset("code_x_glue_cc_defect_detection")
+    labels = ClassLabel(num_classes=2, names=[True, False])
+    ds = ds.cast_column("target", labels)
+    ds = ds.rename_column("target", "label")
+
+    print("Loading tokenizer and model")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_ckpt)
+    tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForSequenceClassification.from_pretrained(
+        args.model_ckpt, num_labels=2
+    )
+    model.config.pad_token_id = model.config.eos_token_id
+
+    if args.freeze:
+        for param in model.roberta.parameters():
+            param.requires_grad = False
+
+    def tokenize(example):
+        inputs = tokenizer(example["func"], truncation=True, max_length=args.max_length)
+        return {
+            "input_ids": inputs["input_ids"],
+            "attention_mask": inputs["attention_mask"],
+            "label": example["target"],
+        }
+
+    tokenized_datasets = ds.map(
+        tokenize,
+        batched=True,
+        remove_columns=["id", "func", "project", "commit_id"],
+    )
+    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+
+    training_args = TrainingArguments(
+        output_dir=args.output_dir,
+        learning_rate=args.learning_rate,
+        lr_scheduler_type=args.lr_scheduler_type,
+        evaluation_strategy="epoch",
+        save_strategy="epoch",
+        logging_strategy="epoch",
+        per_device_train_batch_size=args.batch_size,
+        per_device_eval_batch_size=args.batch_size,
+        num_train_epochs=args.num_epochs,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        weight_decay=args.weight_decay,
+        metric_for_best_model="accuracy",
+        run_name="code-defect-c",
+        report_to="wandb",
+    )
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_datasets["train"],
+        eval_dataset=tokenized_datasets["validation"],
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics,
+    )
+
+    print("Training...")
+    trainer.add_callback(CustomCallback(trainer))
+    trainer.train()
+
+    result = trainer.evaluate(eval_dataset=tokenized_datasets["test"])
+    print(f"Evaluation accuracy on the test set: {result['eval_accuracy']}")
+
+    # push the model to the Hugging Face hub
+    if args.push_to_hub:
+        model.push_to_hub(args.model_hub_name)
+
+
+if __name__ == "__main__":
+    main()
--- a/Bigcode-Evaluation-Harness-240327/finetuning/README.md
+++ b/Bigcode-Evaluation-Harness-240327/finetuning/README.md
+# Finetuning 
+In this folder we show how to fine-tune an autoregressive Language model on the following evaluation and downstream tasks with support for 7 programming languages:
+
+* [APPS](https://huggingface.co/datasets/codeparrot/apps): Python benchmark to evaluate code generation. It is similar to HumanEval and MBPP, but it is more challanging and has more evaluation problems.
+* [CodeComplex](https://huggingface.co/datasets/codeparrot/codecomplex): **Java** benchmark with a classification problem to predict the algorithmic complexity of Java programs among 7 labels.
+* [CodeClone](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench): **Java** benchmark from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE) dataset, with a binary classification problem of predicting the semantic equivalence of two programs. [WIP]
+* [CodeDefect](https://huggingface.co/datasets/code_x_glue_cc_defect_detection): **C** benchmark from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE), with a binary classification problem of predicting whether a code is insecure code and may attack software systems. [WIP]
+* [Code-to-text](https://huggingface.co/datasets/code_x_glue_ct_code_to_text): Dataset from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE) for generationg natural language comments from code in **Python, Go, Java, Javascript, PHP and Ruby**. This task can also be done in a zero-shot setting without need for fine-tuning. [WIP]
+
+We use Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) API for all tasks, which supports distributed training on multiple GPUs. 
+
+The evaluation score on the test set is shown at the end of the fine-tuning. For implementation details, please refer to the README inside each folder.
--- a/Bigcode-Evaluation-Harness-240327/leaderboard/README.md
+++ b/Bigcode-Evaluation-Harness-240327/leaderboard/README.md
+<h1 align="center">:star: Multilingual Code Evaluation LeaderBoard Guide</h1>
+
+
+<h4 align="center">
+    <p>
+        <a href="#running-the-evaluation">Running Evaluation</a> |
+        <a href="#submission-of-results-to-the-leaderboard">Results Submission</a> 
+    <p>
+</h4>
+
+This is a guide to submit and reproduce the numbers in the [Multilingual Code Evaluation LeaderBoard](https://huggingface.co/spaces/bigcode/multilingual-code-evals).
+The LeaderBoard is a demo for evaluating and comparing the performance of language models on code generation tasks.
+
+The LeaderBoard is open for submissions of results produced by the community. If you have a model that you want to submit results for, please follow the instructions below.
+
+## Running the evaluation
+We report the passs@1 for [HumanEval](https://huggingface.co/datasets/openai_humaneval) Python benchamrk and some languages from the [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E) benchmark. We use the same template and parameters for all models.
+
+### 1-Setup
+Follow the setup instructions in the evaluation harness [README](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main#setup).
+
+Create two folders `generations_$model` and `metrics_$model` where you will save the generated code and the metrics respectively for your model `$model`.
+```bash
+cd bigcode-evaluation-harness
+mkdir generations_$model
+mkdir metrics_$model
+```
+
+To run the evaluation, we first generate the code solutions for the target tasks on GPUs, then execute the code on a docker container (only cpus are needed).
+
+### 2- Generation
+Below are the instruction for generating the code solutions sequentially or in parallel with slurm. You might need to reduce the batch size for some models or change the precision based on your device.
+```bash
+# after activating env and setting up accelerate...
+langs=(py js java cpp swift php d jl lua r rkt rs)
+
+model=YOUR_MODEL
+org=HF_ORGANISATION
+
+for lang in "${langs[@]}"; do
+    # use humaneval for py and multipl-e for the rest
+    if [ "$lang" == "py" ]; then
+        task=humaneval
+    else
+        task=multiple-$lang
+    fi
+
+    echo "Running task $task"
+    generations_path=generations_$model/generations_$task\_$model.json
+    accelerate launch main.py \
+            --model $org/$model \
+            --task $task \
+            --n_samples 50 \
+            --batch_size 50 \
+            --max_length_generation 512 \
+            --temperature 0.2 \
+            --precision bf16 \
+            --trust_remote_code \
+            --use_auth_token \
+            --generation_only \
+            --save_generations_path $generations_path
+    echo "Task $task done"
+done
+```
+This will generate and save the code solutions for all tasks in the `generations_$model` folder.
+
+If you want to submit jobs in parallel with `slurm`, run multiple-eval.slurm with:
+```bash
+langs=(py js java cpp swift php d jl lua r rkt rs)
+
+model=YOUR_MODEL
+org=HF_ORGANISATION
+out_path=generations_$model
+
+for lang in "${langs[@]}"; do
+    if [ "$lang" == "py" ]; then
+        task=humaneval
+    else
+        task=multiple-$lang
+    fi
+    echo "Submitting task $task"
+    sbatch -J "eval-$model-$task" multiple_evals.slurm "$model" "$task" "$org" "$out_path"
+done
+```
+This will submit one job for each task.
+
+### 3- Execution
+
+We execute and evaluate the solutions inside a docker container, you can either build the image or pull the one we provide:
+```bash
+# to build it:
+# sudo make DOCKERFILE=Dockerfile-multiple all
+sudo docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
+sudo docker tag ghcr.io/bigcode-project/evaluation-harness-multiple evaluation-harness-multiple
+````
+
+Then, you can run the evaluation on the generated code:
+```bash
+langs=(py js java cpp swift php d jl lua r rkt rs)
+
+model=YOUR_MODEL
+org=HF_ORGANISATION
+# if you provide absolute paths remove the $(pwd) from the command below
+generations_path=generations_$model
+metrics_path=metrics_$model
+
+for lang in "${langs[@]}"; do
+    if [ "$lang" == "py" ]; then
+        task=humaneval
+    else
+        task=multiple-$lang
+    fi
+
+    gen_suffix=generations_$task\_$model.json
+    metric_suffix=metrics_$task\_$model.json
+    echo "Evaluation of $model on $task benchmark, data in $generations_path/$gen_suffix"
+
+    sudo docker run -v $(pwd)/$generations_path/$gen_suffix:/app/$gen_suffix:ro  -v $(pwd)/$metrics_path:/app/$metrics_path -it evaluation-harness-multiple python3 main.py \
+        --model $org/$model \
+        --tasks $task \
+        --load_generations_path /app/$gen_suffix \
+        --metric_output_path /app/$metrics_path/$metric_suffix \
+        --allow_code_execution  \
+        --use_auth_token \
+        --temperature 0.2 \
+        --n_samples 50 | tee -a logs_$model.txt
+    echo "Task $task done, metric saved at $metrics_path/$metric_suffix"
+done
+```
+
+## Submission of results to the LeaderBoard
+If you followed the steps above you now have a folder `metrics_$model` with `json` files, each containing the result of one task. To submit the results to the LeaderBoard, you need to create a json summarizing these metrics using `group_jsons.py` and submit it [here](https://huggingface.co/spaces/bigcode/multilingual-code-evals). Follow the instruction on `Submit here` section.
+```bash
+python group_jsons.py --metrics_path metrics_$model --model $model --org $org --username $your_hf_username
+```
+For credibility, we invite you to add the generations and json metrics to your submission.
+
+Now you're ready to submit your results by opening a PR on the leaderboard, go to `Submit results :rocket:`section for more details.
+
+## Notes
+Some models might require some extra arguments, like [CodeGeeX2-6b](https://huggingface.co/THUDM/codegeex2-6b) which requires providing the language tag as a prefix and doing generation under torch 2.0. And [replit-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) that requires adding extra. You can just add the prefix as a new argument
+```bash
+# define prefixes base on codegeex-2 repo
+declare -A langs
+langs=( [py]="# Python" [js]="// JavaScript" [java]="// Java" [cpp]="// C++" [swift]="// Swift" [php]="// PHP" [jl]="# Julia" [lua]="// Lua" [r]="# R" [rkt]="; Racket" [rs]="// Rust" [d]="" )
+
+model="codegeex2-6b"
+org="THUDM"
+
+for lang in "${!langs[@]}"; do
+    prefix="language: ${langs[$lang]}"
+    echo "For language $lang, the prefix is: $prefix"
+    generations_path=generations_$model/generations_$task\_$model.json
+    accelerate launch main.py \
+            --model $org/$model \
+            --task multiple-l$ang \
+            --n_samples 5 \
+            --batch_size 5 \
+            --limit 8 \
+            --max_length_generation 512 \
+            --temperature 0.2 \
+            --precision bf16 \
+            --trust_remote_code \
+            --use_auth_token \
+            --generation_only \
+            --save_generations_path $generations_path \
+            --prefix \"$prefix\" \
+    echo "Task $task done"
+done
+```
+Replit model command (pull code from this [PR](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/115)):
+```bash
+accelerate launch main.py \
+    --model replit/replit-code-v1-3b \
+    --tasks multiple-$lang \
+    --max_length_generation 512 \
+    --batch_size 50 \
+    --n_samples 10 \
+    --temperature 0.2 \
+    --precision fp16 \
+    --allow_code_execution \
+    --trust_remote_code \
+    --save_generations \
+    --use_auth_token \
+    --generation_only \
+    --save_generations_path /fsx/loubna/code/bigcode-evaluation-harness/multiple_gens_replit/replit-$lang.json \
+    --automodel_kwargs '{\
+        \"attn_config\": {\
+            \"alibi\": true,\
+            \"alibi_bias_max\": 8,\
+            \"attn_impl\": \"triton\",\
+            \"attn_pdrop\": 0,\
+            \"attn_type\": \"multihead_attention\",\
+            \"attn_uses_sequence_id\": false,\
+            \"clip_qkv\": null,\
+            \"prefix_lm\": false,\
+            \"qk_ln\": false,\
+            \"softmax_scale\": null\
+        }\
+    }'
+```
+
+## Bonus
+For the throughput and peak memory measurments, we point you to [optimum-benchamrk](https://github.com/huggingface/optimum-benchmark) (checkout commit `49f0924e2bb041cf17d78dd0848d8e2cad31632d` [here](https://github.com/huggingface/optimum-benchmark/commit/49f0924e2bb041cf17d78dd0848d8e2cad31632d)).
+You can follow the instructions in the repo, copy our config yaml and run the command below:
+```bash
+cp throughput_config.yaml optimum-benchmark/examples
+device=cuda:0
+batch=1
+optimum-benchmark --config-dir examples --config-name throughput_config model=$org/$model device=$device benchmark.input_shapes.batch_size=$batch
+```
--- a/Bigcode-Evaluation-Harness-240327/leaderboard/group_jsons.py
+++ b/Bigcode-Evaluation-Harness-240327/leaderboard/group_jsons.py
+import argparse
+import pandas as pd
+import json
+import os
+import glob
+
+
+parser = argparse.ArgumentParser(description='Process metric files')
+parser.add_argument('--metrics_path', type=str, required=True, help='Path where metric files are stored')
+parser.add_argument('--model', type=str, required=True, help='Name of the model')
+parser.add_argument('--org', type=str, required=True, help='Organization/user hosting the model')
+parser.add_argument('--username', type=str, required=True, help='Your HF username')
+args = parser.parse_args()
+
+
+# List of valid tasks
+valid_tasks = ["humaneval"] + ["multiple-" + lang for lang in ["js", "java", "cpp", "swift", "php", "d", "jl", "lua", "r", "rkt", "rb", "rs"]]
+
+final_results = {"results": [], "meta": {"model": f"{args.org}/{args.model}"}}
+
+# Iterate over all .json files in the metrics_path
+for json_file in glob.glob(os.path.join(args.metrics_path, '*.json')):
+
+    # Extract task from file name
+    print(f"Processing {json_file}")
+    task = os.path.splitext(os.path.basename(json_file))[0].split('_')[1]
+    if task not in valid_tasks:
+        print(f"Skipping invalid task: {task}")
+        continue
+
+    with open(json_file, 'r') as f:
+        data = json.load(f)
+
+    pass_at_1 = data.get(task, {}).get("pass@1", None)
+    output = {"task": task, "pass@1": pass_at_1}
+    final_results["results"].append(output)
+    
+
+with open(f"{args.org}_{args.model}_{args.username}.json", 'w') as f:
+    json.dump(final_results, f)
+
+print(f"Saved {args.org}_{args.model}_{args.username}.json")
\ No newline at end of file
--- a/Bigcode-Evaluation-Harness-240327/leaderboard/multiple_eval.slurm
+++ b/Bigcode-Evaluation-Harness-240327/leaderboard/multiple_eval.slurm
+#!/bin/bash
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
+#SBATCH --cpus-per-task=48
+#SBATCH --gres=gpu:4
+#SBATCH --partition=production-cluster
+#SBATCH --output=/fsx/loubna/logs/evaluation/leaderboard/%x-%j.out
+
+set -x -e
+source /admin/home/loubna/.bashrc
+
+conda activate brr4
+
+# File Path setup
+echo "START TIME: $(date)"
+
+GPUS_PER_NODE=4
+MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+MASTER_PORT=6000
+NNODES=$SLURM_NNODES
+NODE_RANK=$SLURM_PROCID
+WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
+
+
+model=$1
+task=$2
+org=$3
+out_path=$4
+
+CMD="\
+    /fsx/loubna/code/bigcode-evaluation-harness/main.py \
+    --model $org/$model \
+    --tasks $task \
+    --max_length_generation 512 \
+    --batch_size 50 \
+    --n_samples 50 \
+    --temperature 0.2 \
+    --precision bf16 \
+    --allow_code_execution \
+    --trust_remote_code \
+    --save_generations \
+    --use_auth_token \
+    --generation_only \
+    --save_generations_path $out_path/generations_$task\_$model.json \
+"
+
+export LAUNCHER="accelerate launch \
+    --multi_gpu \
+    --num_machines $NNODES \
+    --num_processes $WORLD_SIZE \
+    --main_process_ip "$MASTER_ADDR" \
+    --main_process_port $MASTER_PORT \
+    --num_processes $WORLD_SIZE \
+    --machine_rank \$SLURM_PROCID \
+    --role $SLURMD_NODENAME: \
+    --rdzv_conf rdzv_backend=c10d \
+    --max_restarts 0 \
+    --tee 3 \
+    "
+
+# force crashing on nccl issues like hanging broadcast
+export NCCL_ASYNC_ERROR_HANDLING=1
+
+# AWS specific
+export NCCL_PROTO=simple
+export RDMAV_FORK_SAFE=1
+export FI_EFA_FORK_SAFE=1
+export FI_EFA_USE_DEVICE_RDMA=1
+export FI_PROVIDER=efa
+export FI_LOG_LEVEL=1
+export NCCL_IB_DISABLE=1
+export NCCL_SOCKET_IFNAME=ens
+
+echo $CMD
+
+SRUN_ARGS=" \
+    --wait=60 \
+    --kill-on-bad-exit=1 \
+    "
+
+clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER $CMD"
+
+echo "END TIME: $(date)"
\ No newline at end of file
--- a/Bigcode-Evaluation-Harness-240327/leaderboard/throughput_config.yaml
+++ b/Bigcode-Evaluation-Harness-240327/leaderboard/throughput_config.yaml
+defaults:
+  - backend: pytorch # default backend
+  - benchmark: inference # default benchmark
+  - experiment # inheriting experiment schema
+  - _self_ # for hydra 1.1 compatibility
+  - override hydra/job_logging: colorlog # colorful logging
+  - override hydra/hydra_logging: colorlog # colorful logging
+
+hydra:
+  run:
+    dir: runs/${experiment_name}
+  sweep:
+    dir: sweeps/${experiment_name}
+  job:
+    chdir: true
+    env_set:
+      CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5,6,7
+    
+experiment_name: code_evals
+
+model: bigcode/santacoder
+
+hub_kwargs:
+  use_auth_token: true
+  trust_remote_code: true
+
+backend:
+  torch_dtype: float16
+
+device: cuda:0
+
+benchmark:
+  memory: true
+  input_shapes:
+    batch_size: 1
+    sequence_length: 1
+  new_tokens: 1000
--- a/Bigcode-Evaluation-Harness-240327/main.py
+++ b/Bigcode-Evaluation-Harness-240327/main.py
+import os
+import fnmatch
+import json
+import warnings
+
+import datasets
+import torch
+import transformers
+from accelerate import Accelerator
+from transformers import (
+    AutoModelForCausalLM,
+    AutoModelForSeq2SeqLM,
+    AutoTokenizer,
+    HfArgumentParser,
+)
+
+from bigcode_eval.arguments import EvalArguments
+from bigcode_eval.evaluator import Evaluator
+from bigcode_eval.tasks import ALL_TASKS
+
+
+class MultiChoice:
+    def __init__(self, choices):
+        self.choices = choices
+
+    # Simple wildcard support (linux filename patterns)
+    def __contains__(self, values):
+        for value in values.split(","):
+            if len(fnmatch.filter(self.choices, value)) == 0:
+                return False
+
+        return True
+
+    def __iter__(self):
+        for choice in self.choices:
+            yield choice
+
+
+def parse_args():
+    parser = HfArgumentParser(EvalArguments)
+
+    parser.add_argument(
+        "--model",
+        default="codeparrot/codeparrot-small",
+        help="Model to evaluate, provide a repo name in Hugging Face hub or a local path",
+    )
+    parser.add_argument(
+        "--modeltype",
+        default="causal",
+        help="AutoModel to use, it can be causal or seq2seq",
+    )
+    parser.add_argument(
+        "--peft_model",
+        type=str,
+        default=None,
+        help="Adapter to the PEFT base model. Can be utilized for loading PEFT adapters such as a LoRA trained model. The --model parameter needs to be the base model.",
+    )
+    parser.add_argument(
+        "--revision",
+        default=None,
+        help="Model revision to use",
+    )
+    parser.add_argument(
+        "--use_auth_token",
+        action="store_true",
+        help="Use the token generated when running `huggingface-cli login` (necessary for private model).",
+    )
+    parser.add_argument(
+        "--trust_remote_code",
+        action="store_true",
+        help="Use a model with custom code, this requires executing code by the author of the model.",
+    )
+    parser.add_argument(
+        "--tasks",
+        default=None,
+        choices=MultiChoice(ALL_TASKS),
+        help=f"Evaluation tasks from {ALL_TASKS}",
+    )
+    parser.add_argument(
+        "--instruction_tokens",
+        default=None,
+        help="A series of instruction tokens used for instruction-tuning benchamrks separated by comma e.g. <user_message>,<end_user_message>,<assistant_message>",
+    )
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="Batch size for evaluation on each worker, can be larger for HumanEval",
+    )
+    parser.add_argument(
+        "--max_length_generation",
+        type=int,
+        default=512,
+        help="Maximum length of generated sequence (prompt+generation)",
+    )
+    parser.add_argument(
+        "--precision",
+        type=str,
+        default="fp32",
+        help="Model precision, from: fp32, fp16 or bf16",
+    )
+    parser.add_argument(
+        "--load_in_8bit",
+        action="store_true",
+        help="Load model in 8bit",
+    )
+    parser.add_argument(
+        "--load_in_4bit",
+        action="store_true",
+        help="Load model in 4bit",
+    )
+    parser.add_argument(
+        "--left_padding",
+        action="store_true",
+        help="Force left padding, needed for models like chatglm3-6b",
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        default=None,
+        help="Number of samples to solve and evaluate from the benchmark",
+    )
+    parser.add_argument(
+        "--limit_start",
+        type=int,
+        default=0,
+        help="Optional offset to start from when limiting the number of samples",
+    )
+    parser.add_argument(
+        "--save_every_k_tasks",
+        type=int,
+        default=-1,
+        help="Optional saving after every k tasks",
+    )
+    parser.add_argument(
+        "--postprocess",
+        action="store_false",
+        help="Postprocess model outputs before execution, always on except during generation tests",
+    )
+    parser.add_argument(
+        "--allow_code_execution",
+        action="store_true",
+        help="Allow code evaluation to execute external/untrusted Python code on your machine",
+    )
+    parser.add_argument(
+        "--generation_only",
+        action="store_true",
+        help="Do code generation but no evaluation",
+    )
+    parser.add_argument(
+        "--load_generations_path",
+        type=str,
+        default=None,
+        help="Path of file with previously generated solutions, if provided generation is skipped and only evaluation is done",
+    )
+    parser.add_argument(
+        "--load_data_path",
+        type=str,
+        default=None,
+        help="Path of additional data to load for the tasks",
+    )
+    parser.add_argument(
+        "--metric_output_path",
+        type=str,
+        default="evaluation_results.json",
+        help="Path to save the results",
+    )
+    parser.add_argument(
+        "--save_generations",
+        action="store_true",
+        help="Whether to save code generations",
+    )
+    parser.add_argument(
+        "--load_generations_intermediate_paths",
+        type=str,
+        nargs="*",
+        help="List of paths for saving the intermediate code generations",
+    )
+    parser.add_argument(
+        "--save_generations_path",
+        type=str,
+        default="generations.json",
+        help="Path for saving the code generations",
+    )
+    parser.add_argument(
+        "--save_references",
+        action="store_true",
+        help="Whether to save reference solutions/tests",
+    )
+    parser.add_argument(
+        "--save_references_path",
+        type=str,
+        default="references.json",
+        help="Path for saving the references solutions/tests",
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        default="prompt",
+        help="Prompt type to use for generation in HumanEvalPack tasks",
+    )
+    parser.add_argument(
+        "--max_memory_per_gpu",
+        type=str,
+        default=None,
+        help="Max memroy to allocate per gpu, you can also use 'auto'",
+    )
+    parser.add_argument(
+        "--check_references",
+        action="store_true",
+        help="Don't run generation but benchmark groundtruth (useful for debugging)",
+    )
+    return parser.parse_args()
+
+
+def pattern_match(patterns, source_list):
+    """Returns a list containing all values of the source_list that
+    match at least one of the patterns"""
+    task_names = set()
+    for pattern in patterns:
+        for matching in fnmatch.filter(source_list, pattern):
+            task_names.add(matching)
+    return list(task_names)
+
+
+def get_gpus_max_memory(max_memory, num_gpus):
+    max_memory = {i: max_memory for i in range(num_gpus)}
+    print("Loading model via these GPUs & max memories: ", max_memory)
+    return max_memory
+
+
+def main():
+    args = parse_args()
+    transformers.logging.set_verbosity_error()
+    datasets.logging.set_verbosity_error()
+
+    if args.tasks is None:
+        task_names = ALL_TASKS
+    else:
+        task_names = pattern_match(args.tasks.split(","), ALL_TASKS)
+
+    accelerator = Accelerator()
+    if accelerator.is_main_process:
+        print(f"Selected Tasks: {task_names}")
+
+    results = {}
+    if args.load_generations_path:
+        # here we don't generate code but only evaluate previously computed generations
+        if accelerator.is_main_process:
+            print("evaluation only mode")
+        evaluator = Evaluator(accelerator, None, None, args)
+        for task in task_names:
+            results[task] = evaluator.evaluate(task)
+    else:
+        # here we generate code and save it (evaluation is optional but True by default)
+        dict_precisions = {
+            "fp32": torch.float32,
+            "fp16": torch.float16,
+            "bf16": torch.bfloat16,
+        }
+        if args.precision not in dict_precisions:
+            raise ValueError(
+                f"Non valid precision {args.precision}, choose from: fp16, fp32, bf16"
+            )
+
+        model_kwargs = {
+            "revision": args.revision,
+            "trust_remote_code": args.trust_remote_code,
+            "use_auth_token": args.use_auth_token,
+        }
+        if args.load_in_8bit:
+            print("Loading model in 8bit")
+            model_kwargs["load_in_8bit"] = args.load_in_8bit
+            model_kwargs["device_map"] = {"": accelerator.process_index}
+        elif args.load_in_4bit:
+            print("Loading model in 4bit")
+            model_kwargs["load_in_4bit"] = args.load_in_4bit
+            model_kwargs["device_map"] = {"": accelerator.process_index}
+        else:
+            print(f"Loading model in {args.precision}")
+            model_kwargs["torch_dtype"] = dict_precisions[args.precision]
+
+            if args.max_memory_per_gpu:
+                if args.max_memory_per_gpu != "auto":
+                    model_kwargs["max_memory"] = get_gpus_max_memory(
+                        args.max_memory_per_gpu, accelerator.num_processes
+                    )
+                    model_kwargs["offload_folder"] = "offload"
+                else:
+                    model_kwargs["device_map"] = "auto"
+                    print("Loading model in auto mode")
+
+        if args.modeltype == "causal":
+            model = AutoModelForCausalLM.from_pretrained(
+                args.model,
+                **model_kwargs,
+            )
+        elif args.modeltype == "seq2seq":
+            warnings.warn(
+                "Seq2Seq models have only been tested for HumanEvalPack & CodeT5+ models."
+            )
+            model = AutoModelForSeq2SeqLM.from_pretrained(
+                args.model,
+                **model_kwargs,
+            )
+        else:
+            raise ValueError(
+                f"Non valid modeltype {args.modeltype}, choose from: causal, seq2seq"
+            )
+
+        if args.peft_model:
+            from peft import PeftModel  # dynamic import to avoid dependency on peft
+
+            model = PeftModel.from_pretrained(model, args.peft_model)
+            print("Loaded PEFT model. Merging...")
+            model.merge_and_unload()
+            print("Merge complete.")
+
+        if args.left_padding:
+            # left padding is required for some models like chatglm3-6b
+            tokenizer = AutoTokenizer.from_pretrained(
+                args.model,
+                revision=args.revision,
+                trust_remote_code=args.trust_remote_code,
+                use_auth_token=args.use_auth_token,
+                padding_side="left",  
+            )
+        else:
+            # used by default for most models
+            tokenizer = AutoTokenizer.from_pretrained(
+                args.model,
+                revision=args.revision,
+                trust_remote_code=args.trust_remote_code,
+                use_auth_token=args.use_auth_token,
+                truncation_side="left",
+                padding_side="right",  
+            )
+        if not tokenizer.eos_token:
+            if tokenizer.bos_token:
+                tokenizer.eos_token = tokenizer.bos_token
+                print("bos_token used as eos_token")
+            else:
+                raise ValueError("No eos_token or bos_token found")
+        try:
+            tokenizer.pad_token = tokenizer.eos_token
+            
+        # Some models like CodeGeeX2 have pad_token as a read-only property
+        except AttributeError:
+            print("Not setting pad_token to eos_token")
+            pass
+        WIZARD_LLAMA_MODELS = [
+            "WizardLM/WizardCoder-Python-34B-V1.0",
+            "WizardLM/WizardCoder-34B-V1.0",
+            "WizardLM/WizardCoder-Python-13B-V1.0"
+        ]
+        if args.model in WIZARD_LLAMA_MODELS:
+            tokenizer.bos_token = "<s>"
+            tokenizer.bos_token_id = 1
+            print("Changing bos_token to <s>")
+
+        evaluator = Evaluator(accelerator, model, tokenizer, args)
+
+        if (
+            args.load_generations_intermediate_paths
+            and len(args.load_generations_intermediate_paths) != len(task_names)
+        ):
+            raise ValueError(
+                "If passing --load_generations_intermediate_paths, \
+                must pass equal number of files as number of tasks"
+            )
+
+        for idx, task in enumerate(task_names):
+            intermediate_generations = None
+            if args.load_generations_intermediate_paths:
+                with open(args.load_generations_intermediate_paths[idx], "r") as f_in:
+                    # intermediate_generations: list[list[str | None]] of len n_tasks
+                    # where list[i] = generated codes or empty
+                    intermediate_generations = json.load(f_in)
+
+            if args.generation_only:
+                if accelerator.is_main_process:
+                    print("generation mode only")
+                generations, references = evaluator.generate_text(
+                    task, intermediate_generations=intermediate_generations
+                )
+                if accelerator.is_main_process:
+                    save_generations_path = f"{os.path.splitext(args.save_generations_path)[0]}_{task}.json"
+                    save_references_path = f"references_{task}.json"
+                    evaluator.save_json_files(
+                        generations,
+                        references,
+                        save_generations_path,
+                        save_references_path,
+                    )
+            else:
+                results[task] = evaluator.evaluate(
+                    task, intermediate_generations=intermediate_generations
+                )
+
+    # Save all args to config
+    results["config"] = vars(args)
+    if not args.generation_only:
+        dumped = json.dumps(results, indent=2)
+        if accelerator.is_main_process:
+            print(dumped)
+
+        with open(args.metric_output_path, "w") as f:
+            f.write(dumped)
+
+
+if __name__ == "__main__":
+    main()
--- a/Bigcode-Evaluation-Harness-240327/makefile
+++ b/Bigcode-Evaluation-Harness-240327/makefile
+# There are two dockerfiles: for all benchmarks, and for MultiPL-E
+DOCKERFILE=Dockerfile
+
+ifeq ($(DOCKERFILE), Dockerfile)
+	IMAGE_NAME=evaluation-harness
+else
+	IMAGE_NAME=evaluation-harness-multiple
+endif
+
+build:
+	docker build -f $(DOCKERFILE) -t $(IMAGE_NAME) .
+
+test:
+	docker run -v $(CURDIR)/tests/docker_test/test_generations.json:/app/test_generations.json:ro \
+	-it $(IMAGE_NAME) python3 main.py --model dummy_model --tasks humaneval --limit 4 \
+	--load_generations_path /app/test_generations.json --allow_code_execution 
+
+	@echo "If pass@1 is 0.25 then your configuration for standard benchmarks is correct"
+
+all: build test
\ No newline at end of file