添加Megatron项目

5add46aa · hepj · deb8370c · 5add46aa · 5add46aa · 5add46aa
Commit 5add46aa authored Jan 09, 2025 by hepj
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,4 @@ LM-Evaluation-Harness*
 Bigcode-Evaluation-Harness*
 **/__pycache__
 .vscode
+"Megatron-LM-240405/tests/functional_tests/test_results/jet/dgx_h100/gpt3_345m_mcore-pyt_merge-request_bf16_nodes-1_gpus-8_bs-32_steps-50_tp-1_pp-1_args--recompute-granularity-full-recompute-method-uniform-recompute-num-layers-1-_mcore-true_te-false.json
\ No newline at end of file
--- a/Bigcode-Evaluation-Harness-240327/CONTRIBUTING.md
+++ b/Bigcode-Evaluation-Harness-240327/CONTRIBUTING.md
+# How to contribute to BigCode?
+Everyone is welcome to contribute, and we value everybody's contribution. Code
+is thus not the only way to help the community. Answering questions, helping
+others, reaching out and improving the documentations are immensely valuable to
+the community.
+Whichever way you choose to contribute, please be mindful to respect our
+[code of conduct](https://bigcode-project.org/docs/about/code_of_conduct/).
+## You can contribute in so many ways!
+There are 4 ways you can contribute to this repository:
+* Fixing outstanding issues with the existing code;
+* Implementing new models;
+* Contributing to the examples or to the documentation;
+* Submitting issues related to bugs or desired new features.
+*All are equally valuable to the community.*
+## License
+Note that all contributions are licensed under Apache 2.0 by default. The 
+Technical Steering Committee (TSC) may approve the use of an alternative 
+license or licenses for inbound or outbound contributions on an exception basis. 
+To request an exception, please describe the contribution, the alternative 
+license, and the justification for using an alternative license for the 
+described contribution. License exceptions must be approved by the TSC. 
+Contributed files should contain license information indicating the open 
+source license or licenses pertaining to the file.
+## Submitting a new issue or feature request
+Do your best to follow these guidelines when submitting an issue or a feature
+request. It will make it easier for us to come back to you quickly and with good
+feedback.
+### Did you find a bug?
+First, we would really appreciate it if you could **make sure the bug was not
+already reported** (use the search bar on Github under Issues).
+Did not find it? :( So we can act quickly on it, please follow these steps:
+* Include your **OS type and version**, the versions of **Python**, **PyTorch** and
+  **Tensorflow** when applicable;
+* A short, self-contained, code snippet that allows us to reproduce the bug in
+  less than 30s;
+* Provide the *full* traceback if an exception is raised.
+### Do you want a new feature?
+A world-class feature request addresses the following points:
+1. Motivation first:
+  * Is it related to a problem/frustration with the current features? If so, please explain
+    why. Providing a code snippet that demonstrates the problem is best.
+  * Is it related to something you would need for a project? We'd love to hear
+    about it!
+  * Is it something you worked on and think could benefit the community?
+    Awesome! Tell us what problem it solved for you.
+2. Write a *full paragraph* describing the feature;
+3. Provide a **code snippet** that demonstrates its future use;
+4. In case this is related to a paper, please attach a link;
+5. Attach any additional information (drawings, screenshots, etc.) you think may help.
+If your issue is well written we're already 80% of the way there by the time you
+post it.
+## Start contributing! (Pull Requests)
+Before writing code, we strongly advise you to search through the existing PRs or
+issues to make sure that nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+You will need basic `git` proficiency to be able to contribute to
+BigCode. `git` is not the easiest tool to use but it has the greatest
+manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+Follow these steps to start contributing:
+1. Fork the repository by
+   clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+   under your GitHub user account.
+2. Clone your fork to your local disk, and add the base repository as a remote:
+   ```bash
+   $ git clone git@github.com:<your Github handle>/<Repo name>.git
+   $ cd <Repo name>
+   $ git remote add upstream https://github.com/bigcode-project/<Repo name>.git
+   ```
+3. Create a new branch to hold your development changes:
+   ```bash
+   $ git checkout -b a-descriptive-name-for-my-changes
+   ```
+   **Do not** work on the `main` branch.
+4. Set up a development environment by running the following command in a virtual environment:
+   ```bash
+   $ pip install -r requirements.txt
+   ```
+5. Develop the features on your branch.
+   Once you're happy with your changes, add changed files using `git add` and
+   make a commit with `git commit` to record your changes locally:
+   ```bash
+   $ git add modified_file.py
+   $ git commit
+   ```
+   Please write [good commit
+   messages](https://chris.beams.io/posts/git-commit/).
+   It is a good idea to sync your copy of the code with the original
+   repository regularly. This way you can quickly account for changes:
+   ```bash
+   $ git fetch upstream
+   $ git rebase upstream/main
+   ```
+   Push the changes to your account using:
+   ```bash
+   $ git push -u origin a-descriptive-name-for-my-changes
+   ```
+6. Once you are satisfied (**and the checklist below is happy too**), go to the
+   webpage of your fork on GitHub. Click on 'Pull request' to send your changes
+   to the project maintainers for review.
+7. It's ok if maintainers ask you for changes. It happens to core contributors
+   too! So everyone can see the changes in the Pull request, work in your local
+   branch and push the changes to your fork. They will automatically appear in
+   the pull request.
+### Checklist
+1. The title of your pull request should be a summary of its contribution;
+2. If your pull request addresses an issue, please mention the issue number in
+   the pull request description to make sure they are linked (and people
+   consulting the issue know you are working on it);
+3. To indicate a work in progress please prefix the title with `[WIP]`. These
+   are useful to avoid duplicated work, and to differentiate it from PRs ready
+   to be merged;
+4. Make sure existing tests pass;
+5. All public methods must have informative docstrings.
+### Style guide
+For documentation strings, BigCode follows the [google style](https://google.github.io/styleguide/pyguide.html).
+**This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**
+### Develop on Windows
+On windows, you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings:
+`git config core.autocrlf input`
+One way one can run the make command on Window is to pass by MSYS2:
+1. [Download MSYS2](https://www.msys2.org/), we assume to have it installed in C:\msys64
+2. Open the command line C:\msys64\msys2.exe (it should be available from the start menu)
+3. Run in the shell: `pacman -Syu` and install make with `pacman -S make`
+4. Add `C:\msys64\usr\bin` to your PATH environment variable.
+You can now use `make` from any terminal (Powershell, cmd.exe, etc) 🎉
+### Syncing forked main with upstream `main`
+To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnecessary notifications to the developers involved in these PRs,
+when syncing the main branch of a forked repository, please, follow these steps:
+1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead merge directly into the forked main.
+2. If a PR is absolutely necessary, use the following steps after checking out your branch:
+```
+$ git checkout -b your-branch-for-syncing
+$ git pull --squash --no-commit upstream main
+$ git commit -m '<your message without GitHub references>'
+$ git push --set-upstream origin your-branch-for-syncing
+```
--- a/Bigcode-Evaluation-Harness-240327/Dockerfile
+++ b/Bigcode-Evaluation-Harness-240327/Dockerfile
+FROM ubuntu:22.04
+RUN apt-get update && apt-get install -y python3 python3-pip
+COPY . /app
+WORKDIR /app
+RUN test -f /app/generations.json && rm /app/generations.json || true
+RUN pip3 install .
+CMD ["python3", "main.py"]
--- a/Bigcode-Evaluation-Harness-240327/Dockerfile-multiple
+++ b/Bigcode-Evaluation-Harness-240327/Dockerfile-multiple
+FROM ubuntu:22.04
+RUN apt-get update -yqq && apt-get install -yqq curl build-essential python3-pip python3-tqdm
+RUN apt-get install racket -yqq
+ARG DEBIAN_FRONTEND=noninteractive
+ENV TZ=Etc/UTC
+RUN apt-get install -yqq \
+    default-jdk-headless \
+    golang-go \
+    php-cli \
+    ruby \
+    lua5.3 \
+    r-base \
+    rustc \
+    scala
+RUN apt-get install -yqq libtest-deep-perl 
+RUN apt-get install -yqq wget 
+# JS/TS
+RUN curl -fsSL https://deb.nodesource.com/setup_current.x | bash - 
+RUN apt-get install -y nodejs
+RUN npm install -g typescript
+# Dlang
+RUN wget https://netcologne.dl.sourceforge.net/project/d-apt/files/d-apt.list -O /etc/apt/sources.list.d/d-apt.list
+RUN apt-get update --allow-insecure-repositories
+RUN apt-get -y --allow-unauthenticated install --reinstall d-apt-keyring
+RUN apt-get update && apt-get install -yqq dmd-compiler dub
+# C#
+RUN apt install gnupg ca-certificates
+RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
+RUN echo "deb https://download.mono-project.com/repo/ubuntu stable-focal main" | tee /etc/apt/sources.list.d/mono-official-stable.list
+RUN apt update
+RUN apt install -yqq mono-devel
+# Post-processing
+# Julia
+RUN curl https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz | tar xz
+ENV PATH="/julia-1.8.2/bin:${PATH}"
+# Swift
+RUN curl https://download.swift.org/swift-5.7-release/ubuntu2204/swift-5.7-RELEASE/swift-5.7-RELEASE-ubuntu22.04.tar.gz | tar xz
+ENV PATH="/swift-5.7-RELEASE-ubuntu22.04/usr/bin:${PATH}"
+# Javatuples
+RUN mkdir /usr/multiple && wget https://repo.mavenlibs.com/maven/org/javatuples/javatuples/1.2/javatuples-1.2.jar -O /usr/multiple/javatuples-1.2.jar
+# Luaunit
+RUN apt-get update -yqq && apt-get install -yqq lua-unit
+# Standard requirements
+COPY . /app
+WORKDIR /app
+RUN test -f /app/generations.json && rm /app/generations.json || true
+RUN pip3 install .
+CMD ["python3", "main.py"]
--- a/Bigcode-Evaluation-Harness-240327/LICENSE
+++ b/Bigcode-Evaluation-Harness-240327/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "{}"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright {yyyy} {name of copyright owner}
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/Bigcode-Evaluation-Harness-240327/README.md
+++ b/Bigcode-Evaluation-Harness-240327/README.md
+<h1 align="center">Code Generation LM Evaluation Harness</h1>
+<h4 align="center">
+    <p>
+        <a href="#features">Tasks</a> |
+        <a href="#setup">Usage</a> |
+        <a href="#implementing-new-tasks">Contribution</a> |
+        <a href="#documentation">Documentation</a> |
+        <a href="https://huggingface.co/bigcode">BigCode</a>
+    <p>
+</h4>
+<h3 align="center">
+    <img style="float: middle; padding: 10px 10px 10px 10px;" width="50" height="50" src="https://user-images.githubusercontent.com/44069155/191557209-6219acb8-a766-448c-9bd6-284d22b1e398.png" /></a>
+</h3>
+## Features
+This is a framework for the evaluation of code generation models. This work is inspired from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find contribution guides in [`docs/guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md) and [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md) and more documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md). 
+Below are the features and tasks of this framework:
+- Features:
+    - Any autoregressive model available on [Hugging Face hub](https://huggingface.co/) can be used, but we recommend using code generation models trained specifically on Code such as [SantaCoder](https://huggingface.co/bigcode/santacoder), [InCoder](https://huggingface.co/facebook/incoder-6B) and [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono).
+    - We provide Multi-GPU text generation with `accelerate` and Dockerfiles for evaluating on Docker containers for security and reproducibility.
+- Tasks:
+    - 7 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
+    - [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack) extends HumanEval to **3** scenarios across **6** languages via human translations and was released with [OctoPack](https://arxiv.org/abs/2308.07124).
+    - [MultiPL-E](https://github.com/nuprl/MultiPL-E) evaluation suite (HumanEval translated into **18** programming languages).
+    - [Recode](https://github.com/amazon-science/recode/tree/main) applied to the HumanEval benchmark. It evaluates the robustness of code-generation models.
+    - [Pal](https://github.com/reasoning-machines/pal) Program-aided Language Models evaluation for grade school math problems : [GSM8K](https://huggingface.co/datasets/gsm8k) and [GSM-HARD](https://huggingface.co/datasets/reasoning-machines/gsm-hard). These problems are solved by generating reasoning chains of text and code.
+    - Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text) (zero-shot & fine-tuning) for 6 languages: **Python, Go, Ruby, Java, JavaScript and PHP.**  Documentation translation task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_tt_text_to_text).
+    - [CoNaLa](https://huggingface.co/datasets/neulab/conala) for **Python** code generation (2-shot setting and evaluation with BLEU score).
+    - [Concode](https://huggingface.co/datasets/code_x_glue_tc_text_to_code) for **Java** code generation (2-shot setting and evaluation with BLEU score).
+    - 3 multilingual downstream classification tasks: [Java Complexity prediction](https://huggingface.co/datasets/codeparrot/codecomplex), [Java code equivalence prediction](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench), [C code defect prediction](https://huggingface.co/datasets/code_x_glue_cc_defect_detection).
+    - [SantaCoder-FIM](https://huggingface.co/datasets/bigcode/santacoder-fim-task) for evaluating FIM on **Python** code using Exact Match. Further details are described in [SantaCoder](https://arxiv.org/abs/2301.03988). Includes two tasks:
+        - `StarCoderFIM`: which uses the default FIM tokens `"<fim_prefix>", "<fim_middle>", "<fim_suffix>"`, and
+        - `SantaCoderFIM`: which uses SantaCoder FIM tokens `"<fim-prefix>", "<fim-middle>", "<fim-suffix>"`
+More details about each task can be found in  the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
+## Setup
+```bash
+git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
+cd bigcode-evaluation-harness
+```
+Install [`torch`](https://pytorch.org/get-started/locally/) based on your device type, and install the other packages using:
+```
+pip install -e .
+```
+To run the `DS-1000` benchmark, additional constraints must be resolved.
+```
+# python version must be 3.7.10
+pip install -e ".[ds1000]" # installs all additional dependencies except PyTorch
+# torch==1.12.1 required. Download version with relevant GPU support etc., e.g.,
+pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+# to suppress any tensorflow optimization warnings, 
+# precede call to "accelerate launch" with "TF_CPP_MIN_LOG_LEVEL=3"
+# on some systems, tensorflow will attempt to allocate all GPU memory
+# to its process at import which will raise a CUDA out-of-memory error
+# setting "export TF_FORCE_GPU_ALLOW_GROWTH=true" resolves this
+```
+Also make sure you have `git-lfs` installed and are logged in the Hub
+```
+huggingface-cli login
+````
+We use [`accelerate`](https://huggingface.co/docs/accelerate/index) to generate code/text in parallel when multiple GPUs are present (multi-GPU mode). You can configure it using:
+```bash
+accelerate config
+```
+This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For large models, we recommend specifying the precision of the model using the `--precision` flag instead of accelerate config to have only one copy of the model in memory. You can also load models in 8bit with the flag `--load_in_8bit` or 4bit with `--load_in_4bit` if you have `bitsandbytes` installed with the required transformers and accelerate versions.
+The evaluation part (solutions execution) for [MultiPL-E](https://github.com/nuprl/MultiPL-E) requires extra dependencies for some programming languages, we provide a Dockerfile with all dependencies, see section [Docker](#docker-containers) for more details.
+## Usage
+You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is better to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.
+For more details on how to evaluate on the tasks, please refer to the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md). 
+### Generation and evaluation
+Below is an example to generate and evaluate on a task.
+```bash
+accelerate launch  main.py \
+  --model <MODEL_NAME> \
+  --tasks <TASK_NAME> \
+  --limit <NUMBER_PROBLEMS> \
+  --max_length_generation <MAX_LENGTH> \
+  --temperature <TEMPERATURE> \
+  --do_sample True \
+  --n_samples 100 \
+  --batch_size 10 \
+  --precision <PRECISION> \
+  --allow_code_execution \
+  --save_generations
+```
+* `limit` represents the number of problems to solve, if it's not provided all problems in the benchmark are selected. 
+* `allow_code_execution` is for executing the generated code: it is off by default, read the displayed warning before calling it to enable execution. 
+* Some models with custom code on the HF hub like [SantaCoder](https://huggingface.co/bigcode/santacoder) require calling `--trust_remote_code`, for private models add `--use_auth_token`.
+* `save_generations` saves the post-processed generations in a json file at `save_generations_path` (by default `generations.json`). You can also save references by calling `--save_references`
+* `max_length_generation` is the maximum token length of generation including the input token length. The default is 512, but for some tasks like GSM8K and GSM-Hard, the complete prompt with 8 shot examples (as used in [PAL](https://github.com/reasoning-machines/pal)) take up `~1500` tokens, hence the value should be greater than that and the recommended value of `max_length_generation` is `2048` for these tasks.
+Some tasks don't require code execution such as
+`codexglue_code_to_text-<LANGUAGE>`/`codexglue_code_to_text-python-left`/`conala`/`concode` that use BLEU evaluation. In addition, we generate one candidate solution for each problem in these tasks, so use `n_samples=1` and `batch_size=1`. (Note that `batch_size` should always be equal or less than `n_samples`).
+* For APPS tasks, you can use `n_samples=1` for strict and average accuracies (from the original APPS paper) and `n_samples>1` for pass@k.
+### Generation only
+If you want to generate solutions without executing and evaluating the code, call `--generation_only`, in addition to the instructions above. This will save the solutions in a json file provided in `save_generation_path` in the working directory. 
+This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine or docker container for the execution.
+### Evaluation only
+If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `load_generations_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs.
+Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that `model` value here only serves for documenting the experiment. Also add `--n_samples` to specify the number of samples to evaluate per problem (usually the same value used in generation).
+```bash
+accelerate launch  main.py   --tasks mbpp  --allow_code_execution  --load_generations_path generations.json  --model incoder-temperature-08
+```
+## Docker containers
+For safety, we provide a Dockerfiles to do the execution inside a docker container. To do that, first, do the generation on your machine and save them in `generations.json` for example by adding the flag `--generation_only` to the command. Then use the Docker image that we provide:
+```bash
+$ docker pull ghcr.io/bigcode-project/evaluation-harness
+$ docker tag ghcr.io/bigcode-project/evaluation-harness evaluation-harness
+```
+If you want to evaluate on MultiPL-E, we have a different Dockerfile since it requires more dependencies, use:
+```bash
+$ docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
+$ docker tag ghcr.io/bigcode-project/evaluation-harness-multiple evaluation-harness-multiple
+```
+### Building  Docker images
+If you modify the evaluation harness, you may want to rebuild the docker images.
+Here's how to build a docker image for the evaluation harness:
+```bash
+$ sudo make DOCKERFILE=Dockerfile  all
+```
+This creates an image called `evaluation-harness`, and runs a test on it. To skip the test remove `all` form the command.
+For MultiPL-E:
+```bash
+$ sudo make DOCKERFILE=Dockerfile-multiple all
+```
+This creates an image called `evaluation-harness-multiple`.
+### Evaluating inside a container
+Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
+```bash
+accelerate launch  main.py \
+    --model bigcode/santacoder  \
+    --tasks multiple-py  \
+    --max_length_generation 650 \
+    --temperature 0.8   \
+    --do_sample True  \
+    --n_samples 200  \
+    --batch_size 200  \
+    --trust_remote_code \
+    --generation_only \
+    --save_generations \
+    --save_generations_path generations_py.json
+```
+To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit`  if it was used during generation):
+```bash
+$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
+    --model bigcode/santacoder \
+    --tasks multiple-py \
+    --load_generations_path /app/generations_py.json \
+    --allow_code_execution  \
+    --temperature 0.8 \
+    --n_samples 200
+```
+## Implementing new tasks
+To implement a new task in this evaluation harness, see the guide in [`docs/guide`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md). The are also contribution guidelines in this [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md)
+## Documentation
+We provide documentation for the existing benchmarks and how to run the evaluation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
+## Remarks
+* Currenltly, we use data parallel evaluation across multiple GPUs using `accelerate`, this assumes that you can fit the model in one GPU. 
+## Acknowledgements
+We thank EleutherAI for their work on the [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) from which this repository is inspired.
+## Cite as
+```
+@misc{bigcode-evaluation-harness,
+  author       = {Ben Allal, Loubna and
+                  Muennighoff, Niklas and
+                  Kumar Umapathi, Logesh and
+                  Lipkin, Ben and
+                  von Werra, Leandro},
+  title = {A framework for the evaluation of code generation models},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/bigcode-project/bigcode-evaluation-harness}},
+  year = 2022,
+}
+```
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/__init__.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/__init__.py
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/arguments.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/arguments.py
+from dataclasses import dataclass, field
+from typing import Optional
+@dataclass
+class EvalArguments:
+    """
+    Configuration for running the evaluation.
+    """
+    prefix: Optional[str] = field(
+        default="",
+        metadata={
+            "help": "Prefix to add to the prompt. For example InCoder needs prefix='<| file ext=.py |>\n'"
+        },
+    )
+    do_sample: Optional[bool] = field(
+        default=True,
+        metadata={"help": "Sample from the language model's output distribution."},
+    )
+    temperature: Optional[float] = field(
+        default=0.2, metadata={"help": "Sampling temperature used for generation."}
+    )
+    top_k: Optional[int] = field(
+        default=0, metadata={"help": "Top-k parameter used for generation."}
+    )
+    top_p: Optional[float] = field(
+        default=0.95, metadata={"help": "Top-p parameter used for nucleus sampling."}
+    )
+    n_samples: Optional[int] = field(
+        default=1,
+        metadata={"help": "Number of completions to generate for each sample."},
+    )
+    eos: Optional[str] = field(
+        default="<|endoftext|>", metadata={"help": "end of sentence token."}
+    )
+    seed: Optional[int] = field(
+        default=0, metadata={"help": "Random seed used for evaluation."}
+    )
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/base.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/base.py
+from abc import ABC, abstractmethod
+from warnings import warn
+from datasets import load_dataset
+class Task(ABC):
+    """A task represents an entire benchmark including its dataset, problems,
+    answers, generation settings and evaluation methods.
+    """
+    # The name of the `Task` benchmark as denoted in the HuggingFace datasets Hub
+    DATASET_PATH: str = None
+    # The name of a subset within `DATASET_PATH`.
+    DATASET_NAME: str = None
+    def __init__(self, stop_words=None, requires_execution=True):
+        """
+        :param stop_words: list
+            list of stop words if the generation uses a stopping criteria during generation
+        :param requires_execution: bool
+            wheter the task requires code execution during evaluation or not
+        """
+        self.stop_words = stop_words
+        self.requires_execution = requires_execution
+        try:
+            dataset_kwargs = {}
+            if "humaneval" in self.DATASET_PATH:
+                dataset_kwargs['data_files'] = {
+                        'test': "/workspace/openai_humaneval/0.0.0/7dce6050a7d6d172f3cc5c32aa97f52fa1a2e544/openai_humaneval-test.arrow"
+                        }
+            elif "mbpp" in self.DATASET_PATH:
+                dataset_kwargs['data_files'] = {
+                        'train': "/workspace/mbpp/full/0.0.0/4bb6404fdc6cacfda99d4ac4205087b89d32030c/mbpp-train.arrow",
+                        'test': "/workspace/mbpp/full/0.0.0/4bb6404fdc6cacfda99d4ac4205087b89d32030c/mbpp-test.arrow",
+                        'validation': "/workspace/mbpp/full/0.0.0/4bb6404fdc6cacfda99d4ac4205087b89d32030c/mbpp-validation.arrow"
+                        }
+            self.dataset = load_dataset("arrow", **dataset_kwargs if dataset_kwargs is not None else {})
+        except Exception as e:
+            warn(
+                f"Loading the dataset failed with {str(e)}. This task will use a locally downloaded dataset, not from the HF hub. \
+                This is expected behavior for the DS-1000 benchmark but not for other benchmarks!"
+            )
+    @abstractmethod
+    def get_dataset(self):
+        """Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
+        return []
+    def fewshot_examples(self):
+        """Loads and returns the few-shot examples for the task if they exist."""
+        pass
+    @abstractmethod
+    def get_prompt(self, doc):
+        """Builds the prompt for the LM to generate from.
+        :param doc: dict[str: str]
+            sample from the test dataset
+        """
+        pass
+    @abstractmethod
+    def get_reference(self, doc):
+        """Builds the reference solution for the doc.
+        :param doc: dict[str: str]
+            sample from the test dataset
+        """
+        pass
+    @abstractmethod
+    def postprocess_generation(self, generation, idx):
+        """Defines the postprocessing for a LM generation.
+        :param generation: str
+            code generation from LM
+        :param idx: int
+            index of doc in the dataset to which the generation belongs
+        """
+        pass
+    @abstractmethod
+    def process_results(self, generations, references):
+        """Takes the list of LM generations and evaluates them against ground truth references,
+        returning the metric for the generations as in {"metric_name": result}.
+        :param generations: list(list(str))
+            list of lists containing generations
+        :param references: list(str)
+            list of str containing refrences
+        :return: dict[str: float]
+        """
+        pass
+    @staticmethod
+    def _stop_at_stop_token(decoded_string, stop_tokens):
+        """
+        Produces the prefix of decoded_string that ends at the first occurrence of
+        a stop_token.
+        WARNING: the decoded_string *must not* include the prompt, which may have stop tokens
+        itself.
+        """
+        min_stop_index = len(decoded_string)
+        for stop_token in stop_tokens:
+            stop_index = decoded_string.find(stop_token)
+            if stop_index != -1 and stop_index < min_stop_index:
+                min_stop_index = stop_index
+        return decoded_string[:min_stop_index]
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/evaluator.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/evaluator.py
+import inspect
+import json
+import os
+import warnings
+from typing import List
+from bigcode_eval import tasks
+from bigcode_eval.generation import parallel_generations
+_WARNING = """
+################################################################################
+                                  !!!WARNING!!!
+################################################################################
+The "code_eval"/"apps_metric" you are about to use, execute untrusted 
+model-generated code in Python.
+Although it is highly unlikely that model-generated code will do something
+overtly malicious in response to this test suite, model-generated code may act
+destructively due to a lack of model capability or alignment.
+Users are strongly encouraged to sandbox this evaluation suite so that it
+does not perform destructive actions on their host or network. For more
+information on how OpenAI sandboxes its code, see the paper "Evaluating Large
+Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).
+Once you have read this disclaimer and taken appropriate precautions, set the argument 
+"allow_code_execution" to True.
+################################################################################\
+"""
+class Evaluator:
+    def __init__(self, accelerator, model, tokenizer, args):
+        self.accelerator = accelerator
+        self.model = model
+        self.tokenizer = tokenizer
+        self.args = args
+        # setup arguments
+        self.metric_output_path = args.metric_output_path
+        # code evaluation permission
+        self.allow_code_execution = args.allow_code_execution
+    def generate_text(self, task_name, intermediate_generations=None):
+        task = tasks.get_task(task_name, self.args)
+        dataset = task.get_dataset()
+        # if args.limit is None, use all samples
+        # if args.limit is used, make sure args.limit_start + args.limit <= len(dataset)
+        n_tasks = min(self.args.limit, len(dataset) - self.args.limit_start) if self.args.limit else len(dataset)
+        # when args.limit is None
+        # adjust n_tasks by args.limit_start to prevent out of bounds issues 
+        if not self.args.limit:
+            n_tasks -= self.args.limit_start
+        references = [task.get_reference(dataset[i]) for i in range(self.args.limit_start, self.args.limit_start+n_tasks)]
+        if self.args.check_references:
+            if "get_solution" in inspect.signature(task.get_reference).parameters:
+                solutions = [[task.get_reference(dataset[i], get_solution=True)] for i in range(self.args.limit_start, self.args.limit_start+n_tasks)]
+            else:
+                solutions = [[ref] for ref in references]
+            return solutions, references
+        curr_generations = []  # list[list[str | None] | None]
+        if intermediate_generations:
+            curr_generations = [gen for gen in intermediate_generations if gen]
+            n_tasks -= len(curr_generations)
+        intermediate_save_generations_path = f"{os.path.splitext(self.args.save_generations_path)[0]}_{task_name}_intermediate.json"
+        curr_sample_idx = len(curr_generations)
+        generations = parallel_generations(
+            task,
+            dataset,
+            self.accelerator,
+            self.model,
+            self.tokenizer,
+            n_tasks=n_tasks,
+            args=self.args,
+            curr_sample_idx=curr_sample_idx,  # curr_sample_idx will added to limit_start to fix indexing
+            save_every_k_tasks=self.args.save_every_k_tasks,
+            intermediate_generations=curr_generations,
+            intermediate_save_generations_path=intermediate_save_generations_path,
+        )
+        if len(generations[0]) > self.args.n_samples:
+            generations = [l[: self.args.n_samples] for l in generations]
+            warnings.warn(
+                f"Number of tasks wasn't proportional to number of devices, we removed extra predictions to only keep nsamples={self.args.n_samples}"
+            )
+        return generations, references
+    def evaluate(self, task_name, intermediate_generations=None):
+        task = tasks.get_task(task_name, self.args)
+        if task.requires_execution and not self.allow_code_execution:
+            raise ValueError(_WARNING)
+        generations, references = self.generate_text(task_name, intermediate_generations=intermediate_generations)
+        if self.accelerator.is_main_process:
+            if not self.args.load_generations_path:
+                save_generations_path = f"{os.path.splitext(self.args.save_generations_path)[0]}_{task_name}.json"
+                self.save_json_files(generations, references, save_generations_path, f"references_{task_name}.json")
+            # make sure tokenizer plays nice with multiprocessing
+            os.environ["TOKENIZERS_PARALLELISM"] = "false"
+            if self.allow_code_execution and task.requires_execution:
+                os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+            print("Evaluating generations...")
+            results = task.process_results(generations, references)
+            return results
+    def save_json_files(
+        self,
+        generations: List[str],
+        references: List[str],
+        save_generations_path: str,
+        save_references_path: str,
+    ) -> None:
+        if self.args.save_generations:
+            with open(save_generations_path, "w") as fp:
+                json.dump(generations, fp)
+                print(f"generations were saved at {save_generations_path}")
+        if self.args.save_references:
+            with open(save_references_path, "w") as fp:
+                json.dump(references, fp)
+                print(f"references were saved at {save_references_path}")
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/generation.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/generation.py
+import json
+from math import ceil
+from typing import List, Optional
+from accelerate.utils import set_seed
+from torch.utils.data.dataloader import DataLoader
+from transformers import StoppingCriteria, StoppingCriteriaList
+from bigcode_eval.utils import TokenizedDataset, complete_code
+class EndOfFunctionCriteria(StoppingCriteria):
+    """Custom `StoppingCriteria` which checks if all generated functions in the batch are completed."""
+    def __init__(self, start_length, eof_strings, tokenizer, check_fn=None):
+        self.start_length = start_length
+        self.eof_strings = eof_strings
+        self.tokenizer = tokenizer
+        if check_fn is None:
+            check_fn = lambda decoded_generation: any(
+                [stop_string in decoded_generation for stop_string in self.eof_strings]
+            )
+        self.check_fn = check_fn
+    def __call__(self, input_ids, scores, **kwargs):
+        """Returns true if all generated sequences contain any of the end-of-function strings."""
+        decoded_generations = self.tokenizer.batch_decode(input_ids[:, self.start_length :])
+        return all([self.check_fn(decoded_generation) for decoded_generation in decoded_generations])
+class TooLongFunctionCriteria(StoppingCriteria):
+    """Custom `StoppingCriteria` which checks if the generated function is too long by a certain multiplier based on input length."""
+    def __init__(self, input_length, multiplier):
+        self.input_length = input_length
+        self.multiplier = multiplier
+    def __call__(self, input_ids, scores, **kwargs):
+        """Returns true if generated sequence is too long."""
+        return input_ids.shape[1] > int(self.input_length * self.multiplier)
+def parallel_generations(
+        task,
+        dataset,
+        accelerator,
+        model,
+        tokenizer,
+        n_tasks,
+        args,
+        curr_sample_idx: int = 0,
+        save_every_k_tasks: int = -1,
+        intermediate_generations: Optional[List[Optional[List[Optional[str]]]]] = None,
+        intermediate_save_generations_path: Optional[str] = None,
+):
+    if args.load_generations_path:
+        # load generated code
+        with open(args.load_generations_path) as fp:
+            generations = json.load(fp)
+            if accelerator.is_main_process:
+                print(
+                    f"generations loaded, {n_tasks} selected from {len(generations)} with {len(generations[0])} candidates"
+                )
+        return generations[:n_tasks]
+    set_seed(args.seed, device_specific=True)
+    # Setup generation settings
+    gen_kwargs = {
+        "do_sample": args.do_sample,
+        "temperature": args.temperature,
+        "top_p": args.top_p,
+        "top_k": args.top_k,
+        "max_length": args.max_length_generation,
+    }
+    stopping_criteria = []
+    # The input_length / start_length set to 0 for now will be adjusted later
+    # Check if the task has a custom check_fn method for the stopping criteria
+    if task.stop_words and tokenizer.eos_token:
+        task.stop_words.append(tokenizer.eos_token)    
+    if hasattr(task, "check_fn"):
+        stopping_criteria.append(
+            EndOfFunctionCriteria(0, task.stop_words, tokenizer, task.check_fn)
+        )
+    elif task.stop_words:
+        stopping_criteria.append(
+            EndOfFunctionCriteria(0, task.stop_words, tokenizer)
+        )
+    if hasattr(task, "max_length_multiplier") and task.max_length_multiplier:
+        stopping_criteria.append(
+            TooLongFunctionCriteria(0, task.max_length_multiplier)
+        )
+    if stopping_criteria:
+        gen_kwargs["stopping_criteria"] = StoppingCriteriaList(stopping_criteria)
+    if args.instruction_tokens:
+        instruction_tokens = args.instruction_tokens.split(",")
+        if len(instruction_tokens) != 3:
+            raise ValueError(
+                "Instruction tokens should contain exactly 3 tokens separated by a comma. If a token is empty, represent it as ''"
+            )
+        for token in instruction_tokens:
+            if token.strip() != "":
+                task.stop_words.append(token)
+    else:
+        instruction_tokens = None
+    if accelerator.is_main_process:
+        print(f"number of problems for this task is {n_tasks}")
+    n_copies = ceil(args.n_samples / args.batch_size)
+    ds_tokenized = TokenizedDataset(
+        task,
+        dataset,
+        tokenizer,
+        num_devices=accelerator.state.num_processes,
+        max_length=args.max_length_generation,
+        limit_start=args.limit_start + curr_sample_idx,
+        n_tasks=n_tasks,
+        n_copies=n_copies,
+        prefix=args.prefix,
+        has_encoder=args.modeltype == "seq2seq",
+        instruction_tokens=instruction_tokens,
+    )
+    # do not confuse args.batch_size, which is actually the num_return_sequences
+    ds_loader = DataLoader(ds_tokenized, batch_size=1)
+    is_loaded_in_8bit = getattr(model, "is_loaded_in_8bit", False)
+    is_loaded_in_4bit = getattr(model, "is_loaded_in_4bit", False)
+    if args.max_memory_per_gpu is not None:
+        # The model is already sharded across multiple GPUs
+        ds_loader = accelerator.prepare(ds_loader)
+    elif not is_loaded_in_8bit and not is_loaded_in_4bit:
+        # we only wrap data loader to avoid extra memory occupation
+        model = model.to(accelerator.device)
+        ds_loader = accelerator.prepare(ds_loader)
+    else:
+        # model.to() is not supported for 8bit and 4bit models
+        model, ds_loader = accelerator.prepare(model, ds_loader)
+    generations = complete_code(
+        task,
+        accelerator,
+        model,
+        tokenizer,
+        ds_loader,
+        n_tasks=n_tasks,
+        limit_start=args.limit_start + curr_sample_idx,
+        batch_size=args.batch_size,
+        prefix=args.prefix,
+        instruction_tokens=instruction_tokens,
+        postprocess=args.postprocess,
+        is_wrapped=is_loaded_in_8bit or is_loaded_in_4bit,
+        save_every_k_tasks=save_every_k_tasks,
+        intermediate_generations=intermediate_generations,
+        intermediate_save_generations_path=intermediate_save_generations_path,
+        **gen_kwargs,
+    )
+    return generations
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/__init__.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/__init__.py
+import inspect
+from pprint import pprint
+from . import (apps, codexglue_code_to_text, codexglue_text_to_text, conala,
+               concode, ds1000, gsm, humaneval, humanevalplus, humanevalpack,
+               instruct_humaneval, instruct_wizard_humaneval, mbpp, mbppplus,
+               multiple, parity, python_bugs, quixbugs, recode, santacoder_fim)
+TASK_REGISTRY = {
+    **apps.create_all_tasks(),
+    **codexglue_code_to_text.create_all_tasks(),
+    **codexglue_text_to_text.create_all_tasks(),
+    **multiple.create_all_tasks(),
+    "codexglue_code_to_text-python-left": codexglue_code_to_text.LeftCodeToText,
+    "conala": conala.Conala,
+    "concode": concode.Concode,
+    **ds1000.create_all_tasks(),
+    **humaneval.create_all_tasks(),
+    **humanevalplus.create_all_tasks(),
+    **humanevalpack.create_all_tasks(),
+    "mbpp": mbpp.MBPP,
+    "mbppplus": mbppplus.MBPPPlus,
+    "parity": parity.Parity,
+    "python_bugs": python_bugs.PythonBugs,
+    "quixbugs": quixbugs.QuixBugs,
+    "instruct_wizard_humaneval": instruct_wizard_humaneval.HumanEvalWizardCoder,
+    **gsm.create_all_tasks(),
+    **instruct_humaneval.create_all_tasks(),
+    **recode.create_all_tasks(),
+    **santacoder_fim.create_all_tasks(),
+}
+ALL_TASKS = sorted(list(TASK_REGISTRY))
+def get_task(task_name, args=None):
+    try:
+        kwargs = {}
+        if "prompt" in inspect.signature(TASK_REGISTRY[task_name]).parameters:
+            kwargs["prompt"] = args.prompt
+        if "load_data_path" in inspect.signature(TASK_REGISTRY[task_name]).parameters:
+            kwargs["load_data_path"] = args.load_data_path
+        return TASK_REGISTRY[task_name](**kwargs)
+    except KeyError:
+        print("Available tasks:")
+        pprint(TASK_REGISTRY)
+        raise KeyError(f"Missing task {task_name}")
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/apps.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/apps.py
+"""Measuring Coding Challenge Competence With APPS
+https://arxiv.org/abs/2105.09938
+APPS is a benchmark for code generation with 10000 problems. With three difficulty levels: introductory, interview and competition.
+It can be used to evaluate the ability of language models to generate code from natural language specifications.
+Homepage: https://github.com/hendrycks/apps
+"""
+import json
+from evaluate import load
+from bigcode_eval.base import Task
+_CITATION = """
+@article{hendrycksapps2021,
+  title={Measuring Coding Challenge Competence With APPS},
+  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},
+  journal={NeurIPS},
+  year={2021}
+}
+"""
+LEVELS = ["introductory", "interview", "competition"]
+def create_all_tasks():
+    """Creates a dictionary of tasks from a list of levels
+    :return: {task_name: task}
+        e.g. {apps-interview: Task, apps-competitoon: Task}
+    """
+    return {f"apps-{level}": create_task(level) for level in LEVELS}
+def create_task(level):
+    class APPS(GeneralAPPS):
+        def __init__(self, **kwargs):
+            super().__init__(level, **kwargs)
+    return APPS
+class GeneralAPPS(Task):
+    """A task represents an entire benchmark including its dataset, problems,
+    answers, generation settings and evaluation methods.
+    """
+    DATASET_PATH = "codeparrot/apps"
+    DATASET_NAME = None
+    def __init__(self, level, k_list=[1, 10, 100]):
+        self.DATASET_NAME = level
+        super().__init__(
+            stop_words=["\nQUESTION", "\n---", "\nANSWER"],
+            requires_execution=True,
+        )
+        self.k_list = k_list
+    def get_dataset(self):
+        """Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
+        return self.dataset["test"]
+    def get_prompt(self, doc):
+        """Generate prompts for APPS
+        Finetuning setup: prompt=question  with some starter code and function name if they exist.
+        We also specify the type of the prompt, i.e. whether it is call-based or standard input-based.
+        """
+        starter_code = None if len(doc["starter_code"]) == 0 else doc["starter_code"]
+        try:
+            input_outpout = json.loads(doc["input_output"])
+            fn_name = (
+                None if not input_outpout.get("fn_name") else input_outpout["fn_name"]
+            )
+        except ValueError:
+            fn_name = None
+        prompt = "\nQUESTION:\n"
+        prompt += doc["question"]
+        if starter_code:
+            prompt += starter_code
+        if not fn_name:
+            call_format = "\nUse Standard Input format"
+            prompt += call_format
+        else:
+            call_format = "\nUse Call-Based format"
+            prompt += call_format
+        prompt += "\nANSWER:\n"
+        return prompt
+    def get_reference(self, doc):
+        """Builds the reference solution for the doc (sample from the test dataset)."""
+        return None
+    def postprocess_generation(self, generation, idx):
+        """Defines the postprocessing for a LM generation.
+        :param generation: str
+            code generation from LM
+        :param idx: int
+            index of doc in the dataset to which the generation belongs
+            (not used for APPS)
+        """
+        try:
+            generation = generation.split("\nANSWER:", 1)[1]
+        except IndexError:
+            # happens when prompts were very long and got truncated
+            pass
+        return generation
+    def process_results(self, generations, references):
+        """Takes the list of LM generations and evaluates them against ground truth references,
+        returning the metric for the generations.
+        :param generations: list(list(str))
+            list of lists containing generations
+        :param references: list(str)
+            list of str containing refrences (not needed for APPS Task)
+        """
+        code_metric = load("codeparrot/apps_metric")
+        if level is None:
+            level = self.DATASET_NAME
+        results = code_metric.compute(
+            predictions=generations, k_list=self.k_list, level=self.DATASET_NAME
+        )
+        return results
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/codexglue_code_to_text.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/codexglue_code_to_text.py
+"""CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
+https://arxiv.org/abs/2102.04664
+Code to text task from CodeXGlue (documentation generation):
+* for all subsets ("python", "java", "javascript", "ruby", "php", "go") where the whole function body (without docstring) is given as a prompt
+* for Python subset where only function signature is used as a prompt (this setting can give better results).
+"""
+import os
+import re
+import typing
+from bigcode_eval.base import Task
+_CITATION = """
+@article{husain2019codesearchnet,
+  title={Codesearchnet challenge: Evaluating the state of semantic code search},
+  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
+  journal={arXiv preprint arXiv:1909.09436},
+  year={2019}
+}
+"""
+LANGUAGES = ["python", "java", "javascript", "ruby", "php", "go"]
+TRIPLE_QUOTE = '"""'
+SINGLE_TRIPLE_QUOTE = "'''"
+SPACES4 = " " * 4
+SUFFIX_PROMPT = {
+    "python": '\n""" The goal of this function is to:\n',
+    "ruby": "\n=begin The goal of this function is to:\n",
+    "other": "\n/* The goal of this function is to:\n",
+}
+def create_all_tasks():
+    """Creates a dictionary of tasks from a list of languages
+    :return: {task_name: task}
+        e.g. {codexglue_code_to_text-python: Task, codexglue_code_to_text-java: Task}
+    """
+    return {
+        f"codexglue_code_to_text-{language}": create_task(language)
+        for language in LANGUAGES
+    }
+def create_task(language):
+    class CodeToText(GeneralCodeToText):
+        def __init__(self, **kwargs):
+            super().__init__(language, **kwargs)
+    return CodeToText
+def compute_codexglue_code_to_text_bleu(
+    gold_and_predicted_items: typing.List[typing.Tuple[str, str]]
+):
+    """
+    Compute BLEU scores using codexglue_code_to_text_bleu.computeMaps (codexglue_summarization_evaluator)
+    This uses a specific BLEU tokenization and preprocessing necessary for this task by
+    the original authors of the dataset.
+    Taken from: https://github.com/dpfried/lm-evaluation-harness/blob/5d9a6aaaaa929bcad95bb73d85e78fe75eb64b4e/lm_eval/tasks/codexglue_summarization.py#L102
+    """
+    from bigcode_eval.tasks.custom_metrics import codexglue_code_to_text_bleu
+    predicted_map = {}
+    gold_map = {}
+    for ix, (gold_str, predicted_str) in enumerate(gold_and_predicted_items):
+        gold, *rest = gold_str.strip().split("\t")
+        if len(rest) > 0:
+            print(f"warning: gold instance {ix} contains a tab; ignoring text after")
+        gold_map[ix] = [codexglue_code_to_text_bleu.splitPuncts(gold.strip().lower())]
+        pred, *rest = predicted_str.strip().split("\t")
+        if len(rest) > 0:
+            print(f"warning: gold instance {ix} contains a tab; ignoring text after")
+        predicted_map[ix] = [
+            codexglue_code_to_text_bleu.splitPuncts(pred.strip().lower())
+        ]
+    return codexglue_code_to_text_bleu.bleuFromMaps(gold_map, predicted_map)[0]
+class GeneralCodeToText(Task):
+    """Code to text task from CodeXGlue for all subsets where the whole
+    function body (without docstring) is given as a prompt
+    """
+    DATASET_PATH = "code_x_glue_ct_code_to_text"
+    DATASET_NAME = None
+    def __init__(self, language):
+        self.DATASET_NAME = language
+        stop_words = ["'''", '"""'] if language == "python" else ["\n"]
+        super().__init__(
+            stop_words=stop_words,
+            requires_execution=False,
+        )
+    def get_dataset(self):
+        """Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
+        return self.dataset["test"]
+    @staticmethod
+    def standardize_docstring_prompt(prefix):
+        """Strips any existing docstring delimiters from the prompt prefix
+        and adds our own delimiter (triple quote) and whitespace.
+        Note an edge cases being handled here:
+        - codexglue docstring text sometimes contains the docstring delimiters, inconsistently
+        source: InCoder evaluation code https://github.com/dpfried/lm-evaluation-harness/
+        """
+        for delim in [TRIPLE_QUOTE, SINGLE_TRIPLE_QUOTE]:
+            if delim in prefix:
+                prefix = prefix[: prefix.index(delim)]
+                break
+        single_single_quote_with_trailing_spaces = re.compile(r'[^\'"][\']\s*$')
+        if single_single_quote_with_trailing_spaces.search(prefix):
+            prefix = prefix[
+                : single_single_quote_with_trailing_spaces.search(prefix).start()
+            ]
+        single_double_quote_with_trailing_spaces = re.compile(r'[^\'"]["]\s*$')
+        if single_double_quote_with_trailing_spaces.search(prefix):
+            prefix = prefix[
+                : single_double_quote_with_trailing_spaces.search(prefix).start()
+            ]
+        prefix += TRIPLE_QUOTE
+        return prefix
+    def get_prompt(self, doc):
+        """Generate prompts for Code to text benchmark (documentation generation)
+        Prompt = full function body (withoout the docstring) + '\n[Delimiter] The goal of this function is to:\n'
+        where delimiter is  \""" for python, =begin for ruby and /* for the rest (see SUFFIX_PROMPT).
+        :param doc: dict[str: str])
+        """
+        code = doc["code"]
+        if self.DATASET_NAME == "python":
+            # python code includes the docstring
+            text = doc["docstring"]
+            prompt_prefix = code[: code.index(text)]
+            prompt_prefix = self.standardize_docstring_prompt(prompt_prefix)
+            prompt_suffix = code[code.index(text) + len(text) :]
+            prompt_suffix = prompt_suffix.replace(TRIPLE_QUOTE, "")
+            prompt_suffix = prompt_suffix.replace(SINGLE_TRIPLE_QUOTE, "")
+            prompt_prefix = prompt_prefix.strip().removesuffix(TRIPLE_QUOTE)
+            prompt_prefix = prompt_prefix.strip().removesuffix(SINGLE_TRIPLE_QUOTE)
+            prompt = prompt_prefix + prompt_suffix + SUFFIX_PROMPT["python"]
+            return prompt
+        elif self.DATASET_NAME == "ruby":
+            return code + SUFFIX_PROMPT["ruby"]
+        else:
+            return code + SUFFIX_PROMPT["other"]
+    def get_reference(self, doc):
+        """Builds the reference solution for the doc (sample from the test dataset).
+        :param doc: dict[str: str]
+        """
+        from mosestokenizer import MosesDetokenizer
+        # deactivate tokenizer parallelism when calling MosesDetokenizer TODO: do it for all refs once
+        os.environ["TOKENIZERS_PARALLELISM"] = "false"
+        # docstring_tokens are preprocessed and don't have extra context like variable defs
+        docstring = " ".join(doc["docstring_tokens"]).replace("\n", "")
+        # some docstrings started with r""" before tokenization but r was kept
+        if docstring[0] == "r":
+            docstring = docstring[1:]
+        with MosesDetokenizer("en") as detokenize:
+            docstring = detokenize(docstring.strip().split())
+        return docstring
+    def postprocess_generation(self, generation, idx):
+        """Defines the postprocessing for a LM generation.
+        :param generation: str
+            code generation from LM
+        :param idx: int
+            index of doc in the dataset to which the generation belongs
+            (not used for this Task)
+        """
+        delimiters = {language: SUFFIX_PROMPT["other"] for language in LANGUAGES}
+        delimiters.update(SUFFIX_PROMPT)
+        output = generation.split(delimiters[self.DATASET_NAME])[1].strip()
+        output = output.split("\n")[0]
+        return output
+    def process_results(self, generations, references):
+        """Takes the list of LM generations and evaluates them against ground truth references,
+        returning the metric for the generations.
+        :param generations: list(list(str))
+            list of lists containing generations
+        :param references: list(str)
+            list of str containing refrences (not needed for APPS Task)
+        """
+        bleu_score = compute_codexglue_code_to_text_bleu(
+            (ref, gen[0]) for ref, gen in zip(references, generations)
+        )
+        return {"blue": bleu_score}
+class LeftCodeToText(GeneralCodeToText):
+    """Code to text task from CodeXGlue for Python subset in a left only setting:
+    only the function signature is given as prompt similarly to Fried et al. (InCoder)
+    TODO: implement function signature extraction for other languages in the dataset
+    """
+    def __init__(self):
+        super().__init__("python")
+    @staticmethod
+    def standardize_docstring_prompt(prefix):
+        """Strips any existing docstring delimiters from the prompt prefix and
+        and adds our own delimiter (triple quote) and whitespace.
+        Note an edge cases being handled here:
+        - codexglue docstring text sometimes contains the docstring delimiters, inconsistently
+        source: InCoder evaluation code https://github.com/dpfried/lm-evaluation-harness/
+        """
+        for delim in [TRIPLE_QUOTE, SINGLE_TRIPLE_QUOTE]:
+            if delim in prefix:
+                prefix = prefix[: prefix.index(delim)]
+                break
+        single_single_quote_with_trailing_spaces = re.compile(r'[^\'"][\']\s*$')
+        if single_single_quote_with_trailing_spaces.search(prefix):
+            prefix = prefix[
+                : single_single_quote_with_trailing_spaces.search(prefix).start()
+            ]
+        single_double_quote_with_trailing_spaces = re.compile(r'[^\'"]["]\s*$')
+        if single_double_quote_with_trailing_spaces.search(prefix):
+            prefix = prefix[
+                : single_double_quote_with_trailing_spaces.search(prefix).start()
+            ]
+        prefix += TRIPLE_QUOTE
+        return prefix
+    def get_prompt(self, doc):
+        """Generate prompts for Code to text benchmark (documentation generation)
+        Prompt =  function signature.
+        :param doc: dict[str: str]
+        """
+        code = doc["code"]
+        # python code includes the docstring
+        text = doc["docstring"]
+        prompt_prefix = code[: code.index(text)]
+        prompt_prefix = self.standardize_docstring_prompt(prompt_prefix)
+        return prompt_prefix
+    def postprocess_generation(self, generation, idx):
+        """Defines the postprocessing for a LM generation.
+        :param generation: str
+            code generation from LM
+        :param idx: int
+            index of doc in the dataset to which the generation belongs
+            (not used for this Task)
+        """
+        output = generation.strip().split("\n")[0].strip()
+        for delimiter in [TRIPLE_QUOTE, SINGLE_TRIPLE_QUOTE]:
+            if delimiter in generation:
+                generation = generation[generation.index(delimiter) + 3 :]
+                output = generation.strip().split("\n")[0].strip()
+                output = output.split(delimiter, 1)[0]
+        return output
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/codexglue_text_to_text.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/codexglue_text_to_text.py
+"""
+CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
+https://arxiv.org/abs/2102.04664
+Text to text task from CodeXGlue (documentation translation)
+"""
+import json
+import os
+import re
+from evaluate import load
+from bigcode_eval.base import Task
+_CITATION = """
+@article{CodeXGLUE,
+         title={CodeXGLUE: A Benchmark Dataset and Open Challenge for Code Intelligence},
+         year={2020},}
+"""
+SOURCE_LANG = {
+    "da_en": "danish",
+    "zh_en": "chinese",
+    "no_en": "norwegian",
+    "lv_en": "latvian",
+}
+def create_all_tasks():
+    """Creates a dictionary of tasks from a list of languages
+    :return: {task_name: task}
+        e.g. {codexglue_text_to_text-da_en: Task, codexglue_text_to_text-zh_en: Task}
+    """
+    return {
+        f"codexglue_text_to_text-{translation_task}": create_task(translation_task)
+        for translation_task in SOURCE_LANG
+    }
+def create_task(translation_task):
+    class CodexglueTextToTextTask(CodexglueTextToText):
+        def __init__(self, **kwargs):
+            super().__init__(translation_task, **kwargs)
+    return CodexglueTextToTextTask
+class CodexglueTextToText(Task):
+    DATASET_PATH = "code_x_glue_tt_text_to_text"
+    DATASET_NAME = None
+    def __init__(self, translation_task, max_order=4, smooth=True):
+        self.DATASET_NAME = translation_task
+        stop_words = ["\n"]
+        requires_execution = False
+        super().__init__(stop_words, requires_execution)
+        self.max_order = max_order
+        self.smooth = smooth
+    def get_dataset(self):
+        """Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
+        return self.dataset["test"]
+    def fewshot_examples(self):
+        """Loads and returns the few-shot examples for the task if they exist."""
+        with open(
+            "bigcode_eval/tasks/few_shot_examples/codexglue_text_to_text_few_shot_prompts.json",
+            "r",
+        ) as file:
+            examples = json.load(file)
+        return examples
+    @staticmethod
+    def two_shot_prompt(entry, text, examples, language):
+        """Two shot prompt format as source & target language documentation"""
+        prompt = f"\n{language.title()}:\n{examples['source1']}\
+                   \nEnglish:\n{examples['target1']}\
+                   \n{language.title()}:\n{examples['source2']}\
+                   \nEnglish:\n{examples['target2']}\
+                   \n{language.title()}:\n{text}\
+                   \nEnglish:\n"
+        return entry + prompt
+    def get_prompt(self, doc):
+        """Builds the prompt for the LM to generate from."""
+        language = SOURCE_LANG[self.DATASET_NAME]
+        text = doc["source"]
+        entry = f"Translate the following documentation from {language.title()} to English:\n"
+        examples = self.fewshot_examples()
+        examples = examples[language]
+        prompt = self.two_shot_prompt(entry, text, examples, language)
+        return prompt
+    def get_reference(self, doc):
+        """Builds the reference solution for the doc (sample from the test dataset)."""
+        return doc["target"].strip()
+    def postprocess_generation(self, generation, idx):
+        """Defines the postprocessing for a LM generation.
+        :param generation: str
+            code generation from LM
+        :param idx: int
+            index of doc in the dataset to which the generation belongs
+            (not used for this task)
+        """
+        output = generation.split("\nEnglish:\n", 3)[-1].strip()
+        return output
+    def process_results(self, generations, references):
+        """Takes the list of LM generations and evaluates them against ground truth references,
+        returning the metric for the generations.
+        :param generations: list(list(str))
+            list of lists containing generations
+        :param references: list(str)
+            list of str containing references
+        """
+        bleu = load("bleu")
+        gens = [gen[0] for gen in generations]
+        results = bleu.compute(
+            references=references, predictions=gens, max_order=self.max_order, smooth=self.smooth
+        )
+        return results
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/conala.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/conala.py
+"""Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
+https://arxiv.org/pdf/1805.08949.pdf
+Python Code generation with CoNaLa. It is a benchmark of code and natural language pairs, for the evaluation of code generation tasks. 
+The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators,
+split into 2,379 training and 500 test examples.
+Homepage: https://conala-corpus.github.io/
+Here we use two-shot evaluation (the original paper evaluates finetuned models)
+"""
+import json
+from evaluate import load
+from bigcode_eval.base import Task
+_CITATION = """
+@inproceedings{yin2018learning,
+  title={Learning to mine aligned code and natural language pairs from stack overflow},
+  author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
+  booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)},
+  pages={476--486},
+  year={2018},
+  organization={IEEE}
+}
+"""
+class Conala(Task):
+    """A task represents an entire benchmark including its dataset, problems,
+    answers, generation settings and evaluation methods.
+    """
+    DATASET_PATH = "neulab/conala"
+    def __init__(self, max_order=4, smooth=True):
+        super().__init__(
+            stop_words=["\n"],
+            requires_execution=False,
+        )
+        self.max_order = max_order
+        self.smooth = smooth
+    def get_dataset(self):
+        """Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
+        return self.dataset["test"]
+    def fewshot_examples(self):
+        """Loads and returns the few-shot examples for the task if they exist."""
+        with open(
+            "bigcode_eval/tasks/few_shot_examples/conala_few_shot_prompts.json", "r"
+        ) as file:
+            examples = json.load(file)
+        return examples
+    @staticmethod
+    def two_shot_prompt(entry, text, examples):
+        """Two shot prompt format as instructions & solutions"""
+        prompt = f"\nInstruction:\n{examples['instruction1']}\
+                   \nSolution:\n{examples['solution1']}\
+                   \nInstruction:\n{examples['instruction2']}\
+                   \nSolution:\n{examples['solution2']}\
+                   \nInstruction:\n{text}\
+                   \nSolution:\n"
+        assert (
+            prompt.count("Solution:\n") == 3
+        ), "Splitting operation in postprocess_generation is invalid"
+        return entry + prompt
+    def get_prompt(self, doc):
+        """Builds the prompt for the LM to generate from."""
+        examples = self.fewshot_examples()
+        text_column = "rewritten_intent" if doc["rewritten_intent"] else "intent"
+        text = doc[text_column].strip()
+        entry = "Answer the following instructions in one line of Python code:\n"
+        prompt = self.two_shot_prompt(entry, text, examples)
+        return prompt
+    def get_reference(self, doc):
+        """Builds the reference solution for the doc (sample from the test dataset)."""
+        return doc["snippet"]
+    def postprocess_generation(self, generation, idx):
+        """Defines the postprocessing for a LM generation.
+        :param generation: str
+            code generation from LM
+        :param idx: int
+            index of doc in the dataset to which the generation belongs
+            (not used for this task)
+        """
+        output = generation.split("Solution:\n", 3)[-1].strip()
+        return output
+    def process_results(self, generations, references):
+        """Takes the list of LM generations and evaluates them against ground truth references,
+        returning the metric for the generations.
+        :param generations: list(list(str))
+            list of lists containing generations
+        :param references: list(str)
+            list of str containing references
+        """
+        bleu = load("bleu")
+        gens = [gen[0] for gen in generations]
+        results = bleu.compute(
+            references=references, predictions=gens, max_order=self.max_order, smooth=self.smooth
+        )
+        return results
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/concode.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/concode.py
+"""Mapping Language to Code in Programmatic Context (Concode)
+https://arxiv.org/abs/1808.09588
+CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
+https://arxiv.org/abs/2102.04664
+Java code generation in CodeXGLUE text-to-code dataset (built from Concode dataset)
+Available at https://huggingface.co/datasets/code_x_glue_ct_code_to_text
+2000 samples are available in the test set.
+Here we use two-shot evaluation (the original paper evaluates finetuned models)
+"""
+import json
+from evaluate import load
+from bigcode_eval.base import Task
+_CITATION = """
+@article{iyer2018mapping,
+  title={Mapping language to code in programmatic context},
+  author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
+  journal={arXiv preprint arXiv:1808.09588},
+  year={2018}
+}
+"""
+class Concode(Task):
+    """A task represents an entire benchmark including its dataset, problems,
+    answers, generation settings and evaluation methods.
+    """
+    DATASET_PATH = "code_x_glue_tc_text_to_code"
+    def __init__(self, max_order=4, smooth=True):
+        super().__init__(
+            stop_words=["\n"],
+            requires_execution=False,
+        )
+        self.max_order = max_order
+        self.smooth = smooth
+    def get_dataset(self):
+        """Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
+        # test split of the dataset doesn't have targets
+        return self.dataset["validation"]
+    def fewshot_examples(self):
+        """Loads and returns the few-shot examples for the task if they exist."""
+        with open(
+            "bigcode_eval/tasks/few_shot_examples/concode_few_shot_prompts.json", "r"
+        ) as file:
+            examples = json.load(file)
+        return examples
+    @staticmethod
+    def two_shot_prompt(entry, text, examples):
+        """Two shot prompt format as instructions & solutions"""
+        prompt = f"\nInstruction:\n{examples['instruction1']}\
+                   \nSolution:\n{examples['solution1']}\
+                   \nInstruction:\n{examples['instruction2']}\
+                   \nSolution:\n{examples['solution2']}\
+                   \nInstruction:\n{text}\
+                   \nSolution:\n"
+        assert (
+            prompt.count("Solution:\n") == 3
+        ), "Splitting operation in postprocess_generation is invalid"
+        return entry + prompt
+    def get_prompt(self, doc):
+        """Builds the prompt for the LM to generate from."""
+        examples = self.fewshot_examples()
+        text = doc["nl"].split("concode_field_sep")[0].strip()
+        if text.endswith("."):
+            text = text[:-1].strip()
+        entry = "Answer the following instructions in a one line of Java code:\n"
+        prompt = self.two_shot_prompt(entry, text, examples)
+        return prompt
+    def get_reference(self, doc):
+        """Builds the reference solution for the doc (sample from the test dataset)."""
+        return doc["code"]
+    def postprocess_generation(self, generation, idx):
+        """Defines the postprocessing for a LM generation.
+        :param generation: str
+            code generation from LM
+        :param idx: int
+            index of doc in the dataset to which the generation belongs
+            (not used for this task)
+        """
+        output = generation.split("Solution:\n", 3)[-1].strip()
+        return output
+    def process_results(self, generations, references):
+        """Takes the list of LM generations and evaluates them against ground truth references,
+        returning the metric for the generations.
+        :param generations: list(list(str))
+            list of lists containing generations
+        :param references: list(str)
+            list of str containing references
+        """
+        bleu = load("bleu")
+        gens = [gen[0] for gen in generations]
+        results = bleu.compute(
+            references=references, predictions=gens, max_order=self.max_order, smooth=self.smooth
+        )
+        return results
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/custom_metrics/__init__.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/custom_metrics/__init__.py
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/custom_metrics/code_eval.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/custom_metrics/code_eval.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""The CodeEval metric estimates the pass@k metric for code synthesis.
+This is an evaluation harness for the HumanEval problem solving dataset
+described in the paper "Evaluating Large Language Models Trained on Code"
+(https://arxiv.org/abs/2107.03374)."""
+import itertools
+import os
+from collections import Counter, defaultdict
+from concurrent.futures import ThreadPoolExecutor, as_completed
+import numpy as np
+from .execute import check_correctness
+_CITATION = """\
+@misc{chen2021evaluating,
+      title={Evaluating Large Language Models Trained on Code},
+      author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan \
+and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards \
+and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray \
+and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf \
+and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray \
+and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser \
+and Mohammad Bavarian and Clemens Winter and Philippe Tillet \
+and Felipe Petroski Such and Dave Cummings and Matthias Plappert \
+and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss \
+and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak \
+and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain \
+and William Saunders and Christopher Hesse and Andrew N. Carr \
+and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa \
+and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati \
+and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei \
+and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
+      year={2021},
+      eprint={2107.03374},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+"""
+_DESCRIPTION = """\
+This metric implements the evaluation harness for the HumanEval problem solving dataset
+described in the paper "Evaluating Large Language Models Trained on Code"
+(https://arxiv.org/abs/2107.03374).
+"""
+_KWARGS_DESCRIPTION = """
+Calculates how good are predictions given some references, using certain scores
+Args:
+    predictions: list of candidates to evaluate. Each candidates should be a list
+        of strings with several code candidates to solve the problem.
+    references: a list with a test for each prediction. Each test should evaluate the
+        correctness of a code candidate.
+    k: number of code candidates to consider in the evaluation (Default: [1, 10, 100])
+    num_workers: number of workers used to evaluate the canidate programs (Default: 4).
+    timeout:
+Returns:
+    pass_at_k: dict with pass rates for each k
+    results: dict with granular results of each unittest
+Examples:
+    >>> test_cases = ["assert add(2,3)==5"]
+    >>> candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
+    >>> pass_at_k, results = compute_code_eval(references=test_cases, predictions=candidates, k=[1, 2])
+    >>> print(pass_at_k)
+    {'pass@1': 0.5, 'pass@2': 1.0}
+"""
+_WARNING = """
+################################################################################
+                                  !!!WARNING!!!
+################################################################################
+The "code_eval" metric executes untrusted model-generated code in Python.
+Although it is highly unlikely that model-generated code will do something
+overtly malicious in response to this test suite, model-generated code may act
+destructively due to a lack of model capability or alignment.
+Users are strongly encouraged to sandbox this evaluation suite so that it
+does not perform destructive actions on their host or network. For more
+information on how OpenAI sandboxes its code, see the paper "Evaluating Large
+Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).
+Once you have read this disclaimer and taken appropriate precautions,
+set the environment variable HF_ALLOW_CODE_EVAL="1". Within Python you can to this
+with:
+>>> import os
+>>> os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+################################################################################\
+"""
+_LICENSE = """The MIT License
+Copyright (c) OpenAI (https://openai.com)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE."""
+def compute_code_eval(predictions, references, k=[1, 10, 100], num_workers=4, timeout=3.0):
+    """Returns the scores"""
+    if os.getenv("HF_ALLOW_CODE_EVAL", 0) != "1":
+        raise ValueError(_WARNING)
+    if os.name == "nt":
+        raise NotImplementedError("This metric is currently not supported on Windows.")
+    with ThreadPoolExecutor(max_workers=num_workers) as executor:
+        futures = []
+        completion_id = Counter()
+        n_samples = 0
+        results = defaultdict(list)
+        for task_id, (candidates, test_case) in enumerate(zip(predictions, references)):
+            for candidate in candidates:
+                test_program = candidate + "\n" + test_case
+                args = (test_program, timeout, task_id, completion_id[task_id])
+                future = executor.submit(check_correctness, *args)
+                futures.append(future)
+                completion_id[task_id] += 1
+                n_samples += 1
+        for future in as_completed(futures):
+            result = future.result()
+            results[result["task_id"]].append((result["completion_id"], result))
+    total, correct = [], []
+    for result in results.values():
+        result.sort()
+        passed = [r[1]["passed"] for r in result]
+        total.append(len(passed))
+        correct.append(sum(passed))
+    total = np.array(total)
+    correct = np.array(correct)
+    ks = k
+    if not isinstance(ks, (list, tuple)):
+        ks = [ks]
+    pass_at_k = {f"pass@{k}": estimate_pass_at_k(total, correct, k).mean() for k in ks if (total >= k).all()}
+    return pass_at_k, results
+def estimate_pass_at_k(num_samples, num_correct, k):
+    """Estimates pass@k of each problem and returns them in an array."""
+    def estimator(n: int, c: int, k: int) -> float:
+        """Calculates 1 - comb(n - c, k) / comb(n, k)."""
+        if n - c < k:
+            return 1.0
+        return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
+    if isinstance(num_samples, int):
+        num_samples_it = itertools.repeat(num_samples, len(num_correct))
+    else:
+        assert len(num_samples) == len(num_correct)
+        num_samples_it = iter(num_samples)
+    return np.array([estimator(int(n), int(c), k) for n, c in zip(num_samples_it, num_correct)])
--- a/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/custom_metrics/codexglue_code_to_text_bleu.py
+++ b/Bigcode-Evaluation-Harness-240327/bigcode_eval/tasks/custom_metrics/codexglue_code_to_text_bleu.py
+#!/usr/bin/python
+"""From https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text (evaluator/evaluator.py)
+Call with:
+    python codexglue_bleu_evaluator.py ref_file < hyp_file
+"""
+"""
+This script was adapted from the original version by hieuhoang1972 which is part of MOSES. 
+"""
+# $Id: bleu.py 1307 2007-03-14 22:22:36Z hieuhoang1972 $
+"""Provides:
+cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
+cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
+score_cooked(alltest, n=4): Score a list of cooked test sentences.
+score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids.
+The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible.
+"""
+import math
+import os
+import re
+import subprocess
+import sys
+import xml.sax.saxutils
+# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
+nonorm = 0
+preserve_case = False
+eff_ref_len = "shortest"
+normalize1 = [
+    ("<skipped>", ""),  # strip "skipped" tags
+    (r"-\n", ""),  # strip end-of-line hyphenation and join lines
+    (r"\n", " "),  # join lines
+    #    (r'(\d)\s+(?=\d)', r'\1'), # join digits
+]
+normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1]
+normalize2 = [
+    (
+        r"([\{-\~\[-\` -\&\(-\+\:-\@\/])",
+        r" \1 ",
+    ),  # tokenize punctuation. apostrophe is missing
+    (
+        r"([^0-9])([\.,])",
+        r"\1 \2 ",
+    ),  # tokenize period and comma unless preceded by a digit
+    (
+        r"([\.,])([^0-9])",
+        r" \1 \2",
+    ),  # tokenize period and comma unless followed by a digit
+    (r"([0-9])(-)", r"\1 \2 "),  # tokenize dash when preceded by a digit
+]
+normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2]
+def normalize(s):
+    """Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl."""
+    # Added to bypass NIST-style pre-processing of hyp and ref files -- wade
+    if nonorm:
+        return s.split()
+    if type(s) is not str:
+        s = " ".join(s)
+    # language-independent part:
+    for (pattern, replace) in normalize1:
+        s = re.sub(pattern, replace, s)
+    s = xml.sax.saxutils.unescape(s, {"&quot;": '"'})
+    # language-dependent part (assuming Western languages):
+    s = " %s " % s
+    if not preserve_case:
+        s = s.lower()  # this might not be identical to the original
+    for (pattern, replace) in normalize2:
+        s = re.sub(pattern, replace, s)
+    return s.split()
+def count_ngrams(words, n=4):
+    counts = {}
+    for k in range(1, n + 1):
+        for i in range(len(words) - k + 1):
+            ngram = tuple(words[i : i + k])
+            counts[ngram] = counts.get(ngram, 0) + 1
+    return counts
+def cook_refs(refs, n=4):
+    """Takes a list of reference sentences for a single segment
+    and returns an object that encapsulates everything that BLEU
+    needs to know about them."""
+    refs = [normalize(ref) for ref in refs]
+    maxcounts = {}
+    for ref in refs:
+        counts = count_ngrams(ref, n)
+        for (ngram, count) in counts.items():
+            maxcounts[ngram] = max(maxcounts.get(ngram, 0), count)
+    return ([len(ref) for ref in refs], maxcounts)
+def cook_test(test, item, n=4):
+    """Takes a test sentence and returns an object that
+    encapsulates everything that BLEU needs to know about it."""
+    (reflens, refmaxcounts) = item
+    test = normalize(test)
+    result = {}
+    result["testlen"] = len(test)
+    # Calculate effective reference sentence length.
+    if eff_ref_len == "shortest":
+        result["reflen"] = min(reflens)
+    elif eff_ref_len == "average":
+        result["reflen"] = float(sum(reflens)) / len(reflens)
+    elif eff_ref_len == "closest":
+        min_diff = None
+        for reflen in reflens:
+            if min_diff is None or abs(reflen - len(test)) < min_diff:
+                min_diff = abs(reflen - len(test))
+                result["reflen"] = reflen
+    result["guess"] = [max(len(test) - k + 1, 0) for k in range(1, n + 1)]
+    result["correct"] = [0] * n
+    counts = count_ngrams(test, n)
+    for (ngram, count) in counts.items():
+        result["correct"][len(ngram) - 1] += min(refmaxcounts.get(ngram, 0), count)
+    return result
+def score_cooked(allcomps, n=4, ground=0, smooth=1):
+    totalcomps = {"testlen": 0, "reflen": 0, "guess": [0] * n, "correct": [0] * n}
+    for comps in allcomps:
+        for key in ["testlen", "reflen"]:
+            totalcomps[key] += comps[key]
+        for key in ["guess", "correct"]:
+            for k in range(n):
+                totalcomps[key][k] += comps[key][k]
+    logbleu = 0.0
+    all_bleus = []
+    for k in range(n):
+        correct = totalcomps["correct"][k]
+        guess = totalcomps["guess"][k]
+        addsmooth = 0
+        if smooth == 1 and k > 0:
+            addsmooth = 1
+        logbleu += math.log(correct + addsmooth + sys.float_info.min) - math.log(
+            guess + addsmooth + sys.float_info.min
+        )
+        if guess == 0:
+            all_bleus.append(-10000000)
+        else:
+            all_bleus.append(math.log(correct + sys.float_info.min) - math.log(guess))
+    logbleu /= float(n)
+    all_bleus.insert(0, logbleu)
+    brevPenalty = min(
+        0, 1 - float(totalcomps["reflen"] + 1) / (totalcomps["testlen"] + 1)
+    )
+    for i in range(len(all_bleus)):
+        if i == 0:
+            all_bleus[i] += brevPenalty
+        all_bleus[i] = math.exp(all_bleus[i])
+    return all_bleus
+def bleu(refs, candidate, ground=0, smooth=1):
+    refs = cook_refs(refs)
+    test = cook_test(candidate, refs)
+    return score_cooked([test], ground=ground, smooth=smooth)
+def splitPuncts(line):
+    return " ".join(re.findall(r"[\w]+|[^\s\w]", line))
+def computeMaps(predictions, goldfile):
+    predictionMap = {}
+    goldMap = {}
+    gf = open(goldfile, "r")
+    for row in predictions:
+        cols = row.strip().split("\t")
+        if len(cols) == 1:
+            (rid, pred) = (cols[0], "")
+        else:
+            (rid, pred) = (cols[0], cols[1])
+        predictionMap[rid] = [splitPuncts(pred.strip().lower())]
+    for i, row in enumerate(gf):
+        if len(row.split("\t")) != 2:
+            print(row)
+            print(i)
+        (rid, pred) = row.split("\t")
+        if rid in predictionMap:  # Only insert if the id exists for the method
+            if rid not in goldMap:
+                goldMap[rid] = []
+            goldMap[rid].append(splitPuncts(pred.strip().lower()))
+    sys.stderr.write("Total: " + str(len(goldMap)) + "\n")
+    return (goldMap, predictionMap)
+# m1 is the reference map
+# m2 is the prediction map
+def bleuFromMaps(m1, m2):
+    score = [0] * 5
+    num = 0.0
+    for key in m1:
+        if key in m2:
+            bl = bleu(m1[key], m2[key][0])
+            score = [score[i] + bl[i] for i in range(0, len(bl))]
+            num += 1
+    return [s * 100.0 / num for s in score]
+if __name__ == "__main__":
+    reference_file = sys.argv[1]
+    predictions = []
+    for row in sys.stdin:
+        predictions.append(row)
+    (goldMap, predictionMap) = computeMaps(predictions, reference_file)
+    print(bleuFromMaps(goldMap, predictionMap)[0])