Commit 5add46aa authored by hepj's avatar hepj
Browse files

添加Megatron项目

parent deb8370c
Pipeline #2199 failed with stages
in 0 seconds
...@@ -4,3 +4,4 @@ LM-Evaluation-Harness* ...@@ -4,3 +4,4 @@ LM-Evaluation-Harness*
Bigcode-Evaluation-Harness* Bigcode-Evaluation-Harness*
**/__pycache__ **/__pycache__
.vscode .vscode
"Megatron-LM-240405/tests/functional_tests/test_results/jet/dgx_h100/gpt3_345m_mcore-pyt_merge-request_bf16_nodes-1_gpus-8_bs-32_steps-50_tp-1_pp-1_args--recompute-granularity-full-recompute-method-uniform-recompute-num-layers-1-_mcore-true_te-false.json
\ No newline at end of file
# How to contribute to BigCode?
Everyone is welcome to contribute, and we value everybody's contribution. Code
is thus not the only way to help the community. Answering questions, helping
others, reaching out and improving the documentations are immensely valuable to
the community.
Whichever way you choose to contribute, please be mindful to respect our
[code of conduct](https://bigcode-project.org/docs/about/code_of_conduct/).
## You can contribute in so many ways!
There are 4 ways you can contribute to this repository:
* Fixing outstanding issues with the existing code;
* Implementing new models;
* Contributing to the examples or to the documentation;
* Submitting issues related to bugs or desired new features.
*All are equally valuable to the community.*
## License
Note that all contributions are licensed under Apache 2.0 by default. The
Technical Steering Committee (TSC) may approve the use of an alternative
license or licenses for inbound or outbound contributions on an exception basis.
To request an exception, please describe the contribution, the alternative
license, and the justification for using an alternative license for the
described contribution. License exceptions must be approved by the TSC.
Contributed files should contain license information indicating the open
source license or licenses pertaining to the file.
## Submitting a new issue or feature request
Do your best to follow these guidelines when submitting an issue or a feature
request. It will make it easier for us to come back to you quickly and with good
feedback.
### Did you find a bug?
First, we would really appreciate it if you could **make sure the bug was not
already reported** (use the search bar on Github under Issues).
Did not find it? :( So we can act quickly on it, please follow these steps:
* Include your **OS type and version**, the versions of **Python**, **PyTorch** and
**Tensorflow** when applicable;
* A short, self-contained, code snippet that allows us to reproduce the bug in
less than 30s;
* Provide the *full* traceback if an exception is raised.
### Do you want a new feature?
A world-class feature request addresses the following points:
1. Motivation first:
* Is it related to a problem/frustration with the current features? If so, please explain
why. Providing a code snippet that demonstrates the problem is best.
* Is it related to something you would need for a project? We'd love to hear
about it!
* Is it something you worked on and think could benefit the community?
Awesome! Tell us what problem it solved for you.
2. Write a *full paragraph* describing the feature;
3. Provide a **code snippet** that demonstrates its future use;
4. In case this is related to a paper, please attach a link;
5. Attach any additional information (drawings, screenshots, etc.) you think may help.
If your issue is well written we're already 80% of the way there by the time you
post it.
## Start contributing! (Pull Requests)
Before writing code, we strongly advise you to search through the existing PRs or
issues to make sure that nobody is already working on the same thing. If you are
unsure, it is always a good idea to open an issue to get some feedback.
You will need basic `git` proficiency to be able to contribute to
BigCode. `git` is not the easiest tool to use but it has the greatest
manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
Git](https://git-scm.com/book/en/v2) is a very good reference.
Follow these steps to start contributing:
1. Fork the repository by
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
under your GitHub user account.
2. Clone your fork to your local disk, and add the base repository as a remote:
```bash
$ git clone git@github.com:<your Github handle>/<Repo name>.git
$ cd <Repo name>
$ git remote add upstream https://github.com/bigcode-project/<Repo name>.git
```
3. Create a new branch to hold your development changes:
```bash
$ git checkout -b a-descriptive-name-for-my-changes
```
**Do not** work on the `main` branch.
4. Set up a development environment by running the following command in a virtual environment:
```bash
$ pip install -r requirements.txt
```
5. Develop the features on your branch.
Once you're happy with your changes, add changed files using `git add` and
make a commit with `git commit` to record your changes locally:
```bash
$ git add modified_file.py
$ git commit
```
Please write [good commit
messages](https://chris.beams.io/posts/git-commit/).
It is a good idea to sync your copy of the code with the original
repository regularly. This way you can quickly account for changes:
```bash
$ git fetch upstream
$ git rebase upstream/main
```
Push the changes to your account using:
```bash
$ git push -u origin a-descriptive-name-for-my-changes
```
6. Once you are satisfied (**and the checklist below is happy too**), go to the
webpage of your fork on GitHub. Click on 'Pull request' to send your changes
to the project maintainers for review.
7. It's ok if maintainers ask you for changes. It happens to core contributors
too! So everyone can see the changes in the Pull request, work in your local
branch and push the changes to your fork. They will automatically appear in
the pull request.
### Checklist
1. The title of your pull request should be a summary of its contribution;
2. If your pull request addresses an issue, please mention the issue number in
the pull request description to make sure they are linked (and people
consulting the issue know you are working on it);
3. To indicate a work in progress please prefix the title with `[WIP]`. These
are useful to avoid duplicated work, and to differentiate it from PRs ready
to be merged;
4. Make sure existing tests pass;
5. All public methods must have informative docstrings.
### Style guide
For documentation strings, BigCode follows the [google style](https://google.github.io/styleguide/pyguide.html).
**This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**
### Develop on Windows
On windows, you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings:
`git config core.autocrlf input`
One way one can run the make command on Window is to pass by MSYS2:
1. [Download MSYS2](https://www.msys2.org/), we assume to have it installed in C:\msys64
2. Open the command line C:\msys64\msys2.exe (it should be available from the start menu)
3. Run in the shell: `pacman -Syu` and install make with `pacman -S make`
4. Add `C:\msys64\usr\bin` to your PATH environment variable.
You can now use `make` from any terminal (Powershell, cmd.exe, etc) 🎉
### Syncing forked main with upstream `main`
To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnecessary notifications to the developers involved in these PRs,
when syncing the main branch of a forked repository, please, follow these steps:
1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead merge directly into the forked main.
2. If a PR is absolutely necessary, use the following steps after checking out your branch:
```
$ git checkout -b your-branch-for-syncing
$ git pull --squash --no-commit upstream main
$ git commit -m '<your message without GitHub references>'
$ git push --set-upstream origin your-branch-for-syncing
```
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y python3 python3-pip
COPY . /app
WORKDIR /app
RUN test -f /app/generations.json && rm /app/generations.json || true
RUN pip3 install .
CMD ["python3", "main.py"]
FROM ubuntu:22.04
RUN apt-get update -yqq && apt-get install -yqq curl build-essential python3-pip python3-tqdm
RUN apt-get install racket -yqq
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
RUN apt-get install -yqq \
default-jdk-headless \
golang-go \
php-cli \
ruby \
lua5.3 \
r-base \
rustc \
scala
RUN apt-get install -yqq libtest-deep-perl
RUN apt-get install -yqq wget
# JS/TS
RUN curl -fsSL https://deb.nodesource.com/setup_current.x | bash -
RUN apt-get install -y nodejs
RUN npm install -g typescript
# Dlang
RUN wget https://netcologne.dl.sourceforge.net/project/d-apt/files/d-apt.list -O /etc/apt/sources.list.d/d-apt.list
RUN apt-get update --allow-insecure-repositories
RUN apt-get -y --allow-unauthenticated install --reinstall d-apt-keyring
RUN apt-get update && apt-get install -yqq dmd-compiler dub
# C#
RUN apt install gnupg ca-certificates
RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
RUN echo "deb https://download.mono-project.com/repo/ubuntu stable-focal main" | tee /etc/apt/sources.list.d/mono-official-stable.list
RUN apt update
RUN apt install -yqq mono-devel
# Post-processing
# Julia
RUN curl https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz | tar xz
ENV PATH="/julia-1.8.2/bin:${PATH}"
# Swift
RUN curl https://download.swift.org/swift-5.7-release/ubuntu2204/swift-5.7-RELEASE/swift-5.7-RELEASE-ubuntu22.04.tar.gz | tar xz
ENV PATH="/swift-5.7-RELEASE-ubuntu22.04/usr/bin:${PATH}"
# Javatuples
RUN mkdir /usr/multiple && wget https://repo.mavenlibs.com/maven/org/javatuples/javatuples/1.2/javatuples-1.2.jar -O /usr/multiple/javatuples-1.2.jar
# Luaunit
RUN apt-get update -yqq && apt-get install -yqq lua-unit
# Standard requirements
COPY . /app
WORKDIR /app
RUN test -f /app/generations.json && rm /app/generations.json || true
RUN pip3 install .
CMD ["python3", "main.py"]
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
<h1 align="center">Code Generation LM Evaluation Harness</h1>
<h4 align="center">
<p>
<a href="#features">Tasks</a> |
<a href="#setup">Usage</a> |
<a href="#implementing-new-tasks">Contribution</a> |
<a href="#documentation">Documentation</a> |
<a href="https://huggingface.co/bigcode">BigCode</a>
<p>
</h4>
<h3 align="center">
<img style="float: middle; padding: 10px 10px 10px 10px;" width="50" height="50" src="https://user-images.githubusercontent.com/44069155/191557209-6219acb8-a766-448c-9bd6-284d22b1e398.png" /></a>
</h3>
## Features
This is a framework for the evaluation of code generation models. This work is inspired from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find contribution guides in [`docs/guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md) and [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md) and more documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
Below are the features and tasks of this framework:
- Features:
- Any autoregressive model available on [Hugging Face hub](https://huggingface.co/) can be used, but we recommend using code generation models trained specifically on Code such as [SantaCoder](https://huggingface.co/bigcode/santacoder), [InCoder](https://huggingface.co/facebook/incoder-6B) and [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono).
- We provide Multi-GPU text generation with `accelerate` and Dockerfiles for evaluating on Docker containers for security and reproducibility.
- Tasks:
- 7 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
- [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack) extends HumanEval to **3** scenarios across **6** languages via human translations and was released with [OctoPack](https://arxiv.org/abs/2308.07124).
- [MultiPL-E](https://github.com/nuprl/MultiPL-E) evaluation suite (HumanEval translated into **18** programming languages).
- [Recode](https://github.com/amazon-science/recode/tree/main) applied to the HumanEval benchmark. It evaluates the robustness of code-generation models.
- [Pal](https://github.com/reasoning-machines/pal) Program-aided Language Models evaluation for grade school math problems : [GSM8K](https://huggingface.co/datasets/gsm8k) and [GSM-HARD](https://huggingface.co/datasets/reasoning-machines/gsm-hard). These problems are solved by generating reasoning chains of text and code.
- Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text) (zero-shot & fine-tuning) for 6 languages: **Python, Go, Ruby, Java, JavaScript and PHP.** Documentation translation task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_tt_text_to_text).
- [CoNaLa](https://huggingface.co/datasets/neulab/conala) for **Python** code generation (2-shot setting and evaluation with BLEU score).
- [Concode](https://huggingface.co/datasets/code_x_glue_tc_text_to_code) for **Java** code generation (2-shot setting and evaluation with BLEU score).
- 3 multilingual downstream classification tasks: [Java Complexity prediction](https://huggingface.co/datasets/codeparrot/codecomplex), [Java code equivalence prediction](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench), [C code defect prediction](https://huggingface.co/datasets/code_x_glue_cc_defect_detection).
- [SantaCoder-FIM](https://huggingface.co/datasets/bigcode/santacoder-fim-task) for evaluating FIM on **Python** code using Exact Match. Further details are described in [SantaCoder](https://arxiv.org/abs/2301.03988). Includes two tasks:
- `StarCoderFIM`: which uses the default FIM tokens `"<fim_prefix>", "<fim_middle>", "<fim_suffix>"`, and
- `SantaCoderFIM`: which uses SantaCoder FIM tokens `"<fim-prefix>", "<fim-middle>", "<fim-suffix>"`
More details about each task can be found in the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
## Setup
```bash
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
```
Install [`torch`](https://pytorch.org/get-started/locally/) based on your device type, and install the other packages using:
```
pip install -e .
```
To run the `DS-1000` benchmark, additional constraints must be resolved.
```
# python version must be 3.7.10
pip install -e ".[ds1000]" # installs all additional dependencies except PyTorch
# torch==1.12.1 required. Download version with relevant GPU support etc., e.g.,
pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
# to suppress any tensorflow optimization warnings,
# precede call to "accelerate launch" with "TF_CPP_MIN_LOG_LEVEL=3"
# on some systems, tensorflow will attempt to allocate all GPU memory
# to its process at import which will raise a CUDA out-of-memory error
# setting "export TF_FORCE_GPU_ALLOW_GROWTH=true" resolves this
```
Also make sure you have `git-lfs` installed and are logged in the Hub
```
huggingface-cli login
````
We use [`accelerate`](https://huggingface.co/docs/accelerate/index) to generate code/text in parallel when multiple GPUs are present (multi-GPU mode). You can configure it using:
```bash
accelerate config
```
This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For large models, we recommend specifying the precision of the model using the `--precision` flag instead of accelerate config to have only one copy of the model in memory. You can also load models in 8bit with the flag `--load_in_8bit` or 4bit with `--load_in_4bit` if you have `bitsandbytes` installed with the required transformers and accelerate versions.
The evaluation part (solutions execution) for [MultiPL-E](https://github.com/nuprl/MultiPL-E) requires extra dependencies for some programming languages, we provide a Dockerfile with all dependencies, see section [Docker](#docker-containers) for more details.
## Usage
You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is better to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.
For more details on how to evaluate on the tasks, please refer to the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
### Generation and evaluation
Below is an example to generate and evaluate on a task.
```bash
accelerate launch main.py \
--model <MODEL_NAME> \
--tasks <TASK_NAME> \
--limit <NUMBER_PROBLEMS> \
--max_length_generation <MAX_LENGTH> \
--temperature <TEMPERATURE> \
--do_sample True \
--n_samples 100 \
--batch_size 10 \
--precision <PRECISION> \
--allow_code_execution \
--save_generations
```
* `limit` represents the number of problems to solve, if it's not provided all problems in the benchmark are selected.
* `allow_code_execution` is for executing the generated code: it is off by default, read the displayed warning before calling it to enable execution.
* Some models with custom code on the HF hub like [SantaCoder](https://huggingface.co/bigcode/santacoder) require calling `--trust_remote_code`, for private models add `--use_auth_token`.
* `save_generations` saves the post-processed generations in a json file at `save_generations_path` (by default `generations.json`). You can also save references by calling `--save_references`
* `max_length_generation` is the maximum token length of generation including the input token length. The default is 512, but for some tasks like GSM8K and GSM-Hard, the complete prompt with 8 shot examples (as used in [PAL](https://github.com/reasoning-machines/pal)) take up `~1500` tokens, hence the value should be greater than that and the recommended value of `max_length_generation` is `2048` for these tasks.
Some tasks don't require code execution such as
`codexglue_code_to_text-<LANGUAGE>`/`codexglue_code_to_text-python-left`/`conala`/`concode` that use BLEU evaluation. In addition, we generate one candidate solution for each problem in these tasks, so use `n_samples=1` and `batch_size=1`. (Note that `batch_size` should always be equal or less than `n_samples`).
* For APPS tasks, you can use `n_samples=1` for strict and average accuracies (from the original APPS paper) and `n_samples>1` for pass@k.
### Generation only
If you want to generate solutions without executing and evaluating the code, call `--generation_only`, in addition to the instructions above. This will save the solutions in a json file provided in `save_generation_path` in the working directory.
This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine or docker container for the execution.
### Evaluation only
If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `load_generations_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs.
Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that `model` value here only serves for documenting the experiment. Also add `--n_samples` to specify the number of samples to evaluate per problem (usually the same value used in generation).
```bash
accelerate launch main.py --tasks mbpp --allow_code_execution --load_generations_path generations.json --model incoder-temperature-08
```
## Docker containers
For safety, we provide a Dockerfiles to do the execution inside a docker container. To do that, first, do the generation on your machine and save them in `generations.json` for example by adding the flag `--generation_only` to the command. Then use the Docker image that we provide:
```bash
$ docker pull ghcr.io/bigcode-project/evaluation-harness
$ docker tag ghcr.io/bigcode-project/evaluation-harness evaluation-harness
```
If you want to evaluate on MultiPL-E, we have a different Dockerfile since it requires more dependencies, use:
```bash
$ docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
$ docker tag ghcr.io/bigcode-project/evaluation-harness-multiple evaluation-harness-multiple
```
### Building Docker images
If you modify the evaluation harness, you may want to rebuild the docker images.
Here's how to build a docker image for the evaluation harness:
```bash
$ sudo make DOCKERFILE=Dockerfile all
```
This creates an image called `evaluation-harness`, and runs a test on it. To skip the test remove `all` form the command.
For MultiPL-E:
```bash
$ sudo make DOCKERFILE=Dockerfile-multiple all
```
This creates an image called `evaluation-harness-multiple`.
### Evaluating inside a container
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
```bash
accelerate launch main.py \
--model bigcode/santacoder \
--tasks multiple-py \
--max_length_generation 650 \
--temperature 0.8 \
--do_sample True \
--n_samples 200 \
--batch_size 200 \
--trust_remote_code \
--generation_only \
--save_generations \
--save_generations_path generations_py.json
```
To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
```bash
$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
--model bigcode/santacoder \
--tasks multiple-py \
--load_generations_path /app/generations_py.json \
--allow_code_execution \
--temperature 0.8 \
--n_samples 200
```
## Implementing new tasks
To implement a new task in this evaluation harness, see the guide in [`docs/guide`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md). The are also contribution guidelines in this [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md)
## Documentation
We provide documentation for the existing benchmarks and how to run the evaluation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
## Remarks
* Currenltly, we use data parallel evaluation across multiple GPUs using `accelerate`, this assumes that you can fit the model in one GPU.
## Acknowledgements
We thank EleutherAI for their work on the [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) from which this repository is inspired.
## Cite as
```
@misc{bigcode-evaluation-harness,
author = {Ben Allal, Loubna and
Muennighoff, Niklas and
Kumar Umapathi, Logesh and
Lipkin, Ben and
von Werra, Leandro},
title = {A framework for the evaluation of code generation models},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/bigcode-project/bigcode-evaluation-harness}},
year = 2022,
}
```
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class EvalArguments:
"""
Configuration for running the evaluation.
"""
prefix: Optional[str] = field(
default="",
metadata={
"help": "Prefix to add to the prompt. For example InCoder needs prefix='<| file ext=.py |>\n'"
},
)
do_sample: Optional[bool] = field(
default=True,
metadata={"help": "Sample from the language model's output distribution."},
)
temperature: Optional[float] = field(
default=0.2, metadata={"help": "Sampling temperature used for generation."}
)
top_k: Optional[int] = field(
default=0, metadata={"help": "Top-k parameter used for generation."}
)
top_p: Optional[float] = field(
default=0.95, metadata={"help": "Top-p parameter used for nucleus sampling."}
)
n_samples: Optional[int] = field(
default=1,
metadata={"help": "Number of completions to generate for each sample."},
)
eos: Optional[str] = field(
default="<|endoftext|>", metadata={"help": "end of sentence token."}
)
seed: Optional[int] = field(
default=0, metadata={"help": "Random seed used for evaluation."}
)
from abc import ABC, abstractmethod
from warnings import warn
from datasets import load_dataset
class Task(ABC):
"""A task represents an entire benchmark including its dataset, problems,
answers, generation settings and evaluation methods.
"""
# The name of the `Task` benchmark as denoted in the HuggingFace datasets Hub
DATASET_PATH: str = None
# The name of a subset within `DATASET_PATH`.
DATASET_NAME: str = None
def __init__(self, stop_words=None, requires_execution=True):
"""
:param stop_words: list
list of stop words if the generation uses a stopping criteria during generation
:param requires_execution: bool
wheter the task requires code execution during evaluation or not
"""
self.stop_words = stop_words
self.requires_execution = requires_execution
try:
dataset_kwargs = {}
if "humaneval" in self.DATASET_PATH:
dataset_kwargs['data_files'] = {
'test': "/workspace/openai_humaneval/0.0.0/7dce6050a7d6d172f3cc5c32aa97f52fa1a2e544/openai_humaneval-test.arrow"
}
elif "mbpp" in self.DATASET_PATH:
dataset_kwargs['data_files'] = {
'train': "/workspace/mbpp/full/0.0.0/4bb6404fdc6cacfda99d4ac4205087b89d32030c/mbpp-train.arrow",
'test': "/workspace/mbpp/full/0.0.0/4bb6404fdc6cacfda99d4ac4205087b89d32030c/mbpp-test.arrow",
'validation': "/workspace/mbpp/full/0.0.0/4bb6404fdc6cacfda99d4ac4205087b89d32030c/mbpp-validation.arrow"
}
self.dataset = load_dataset("arrow", **dataset_kwargs if dataset_kwargs is not None else {})
except Exception as e:
warn(
f"Loading the dataset failed with {str(e)}. This task will use a locally downloaded dataset, not from the HF hub. \
This is expected behavior for the DS-1000 benchmark but not for other benchmarks!"
)
@abstractmethod
def get_dataset(self):
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
return []
def fewshot_examples(self):
"""Loads and returns the few-shot examples for the task if they exist."""
pass
@abstractmethod
def get_prompt(self, doc):
"""Builds the prompt for the LM to generate from.
:param doc: dict[str: str]
sample from the test dataset
"""
pass
@abstractmethod
def get_reference(self, doc):
"""Builds the reference solution for the doc.
:param doc: dict[str: str]
sample from the test dataset
"""
pass
@abstractmethod
def postprocess_generation(self, generation, idx):
"""Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int
index of doc in the dataset to which the generation belongs
"""
pass
@abstractmethod
def process_results(self, generations, references):
"""Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations as in {"metric_name": result}.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing refrences
:return: dict[str: float]
"""
pass
@staticmethod
def _stop_at_stop_token(decoded_string, stop_tokens):
"""
Produces the prefix of decoded_string that ends at the first occurrence of
a stop_token.
WARNING: the decoded_string *must not* include the prompt, which may have stop tokens
itself.
"""
min_stop_index = len(decoded_string)
for stop_token in stop_tokens:
stop_index = decoded_string.find(stop_token)
if stop_index != -1 and stop_index < min_stop_index:
min_stop_index = stop_index
return decoded_string[:min_stop_index]
import inspect
import json
import os
import warnings
from typing import List
from bigcode_eval import tasks
from bigcode_eval.generation import parallel_generations
_WARNING = """
################################################################################
!!!WARNING!!!
################################################################################
The "code_eval"/"apps_metric" you are about to use, execute untrusted
model-generated code in Python.
Although it is highly unlikely that model-generated code will do something
overtly malicious in response to this test suite, model-generated code may act
destructively due to a lack of model capability or alignment.
Users are strongly encouraged to sandbox this evaluation suite so that it
does not perform destructive actions on their host or network. For more
information on how OpenAI sandboxes its code, see the paper "Evaluating Large
Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).
Once you have read this disclaimer and taken appropriate precautions, set the argument
"allow_code_execution" to True.
################################################################################\
"""
class Evaluator:
def __init__(self, accelerator, model, tokenizer, args):
self.accelerator = accelerator
self.model = model
self.tokenizer = tokenizer
self.args = args
# setup arguments
self.metric_output_path = args.metric_output_path
# code evaluation permission
self.allow_code_execution = args.allow_code_execution
def generate_text(self, task_name, intermediate_generations=None):
task = tasks.get_task(task_name, self.args)
dataset = task.get_dataset()
# if args.limit is None, use all samples
# if args.limit is used, make sure args.limit_start + args.limit <= len(dataset)
n_tasks = min(self.args.limit, len(dataset) - self.args.limit_start) if self.args.limit else len(dataset)
# when args.limit is None
# adjust n_tasks by args.limit_start to prevent out of bounds issues
if not self.args.limit:
n_tasks -= self.args.limit_start
references = [task.get_reference(dataset[i]) for i in range(self.args.limit_start, self.args.limit_start+n_tasks)]
if self.args.check_references:
if "get_solution" in inspect.signature(task.get_reference).parameters:
solutions = [[task.get_reference(dataset[i], get_solution=True)] for i in range(self.args.limit_start, self.args.limit_start+n_tasks)]
else:
solutions = [[ref] for ref in references]
return solutions, references
curr_generations = [] # list[list[str | None] | None]
if intermediate_generations:
curr_generations = [gen for gen in intermediate_generations if gen]
n_tasks -= len(curr_generations)
intermediate_save_generations_path = f"{os.path.splitext(self.args.save_generations_path)[0]}_{task_name}_intermediate.json"
curr_sample_idx = len(curr_generations)
generations = parallel_generations(
task,
dataset,
self.accelerator,
self.model,
self.tokenizer,
n_tasks=n_tasks,
args=self.args,
curr_sample_idx=curr_sample_idx, # curr_sample_idx will added to limit_start to fix indexing
save_every_k_tasks=self.args.save_every_k_tasks,
intermediate_generations=curr_generations,
intermediate_save_generations_path=intermediate_save_generations_path,
)
if len(generations[0]) > self.args.n_samples:
generations = [l[: self.args.n_samples] for l in generations]
warnings.warn(
f"Number of tasks wasn't proportional to number of devices, we removed extra predictions to only keep nsamples={self.args.n_samples}"
)
return generations, references
def evaluate(self, task_name, intermediate_generations=None):
task = tasks.get_task(task_name, self.args)
if task.requires_execution and not self.allow_code_execution:
raise ValueError(_WARNING)
generations, references = self.generate_text(task_name, intermediate_generations=intermediate_generations)
if self.accelerator.is_main_process:
if not self.args.load_generations_path:
save_generations_path = f"{os.path.splitext(self.args.save_generations_path)[0]}_{task_name}.json"
self.save_json_files(generations, references, save_generations_path, f"references_{task_name}.json")
# make sure tokenizer plays nice with multiprocessing
os.environ["TOKENIZERS_PARALLELISM"] = "false"
if self.allow_code_execution and task.requires_execution:
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
print("Evaluating generations...")
results = task.process_results(generations, references)
return results
def save_json_files(
self,
generations: List[str],
references: List[str],
save_generations_path: str,
save_references_path: str,
) -> None:
if self.args.save_generations:
with open(save_generations_path, "w") as fp:
json.dump(generations, fp)
print(f"generations were saved at {save_generations_path}")
if self.args.save_references:
with open(save_references_path, "w") as fp:
json.dump(references, fp)
print(f"references were saved at {save_references_path}")
import json
from math import ceil
from typing import List, Optional
from accelerate.utils import set_seed
from torch.utils.data.dataloader import DataLoader
from transformers import StoppingCriteria, StoppingCriteriaList
from bigcode_eval.utils import TokenizedDataset, complete_code
class EndOfFunctionCriteria(StoppingCriteria):
"""Custom `StoppingCriteria` which checks if all generated functions in the batch are completed."""
def __init__(self, start_length, eof_strings, tokenizer, check_fn=None):
self.start_length = start_length
self.eof_strings = eof_strings
self.tokenizer = tokenizer
if check_fn is None:
check_fn = lambda decoded_generation: any(
[stop_string in decoded_generation for stop_string in self.eof_strings]
)
self.check_fn = check_fn
def __call__(self, input_ids, scores, **kwargs):
"""Returns true if all generated sequences contain any of the end-of-function strings."""
decoded_generations = self.tokenizer.batch_decode(input_ids[:, self.start_length :])
return all([self.check_fn(decoded_generation) for decoded_generation in decoded_generations])
class TooLongFunctionCriteria(StoppingCriteria):
"""Custom `StoppingCriteria` which checks if the generated function is too long by a certain multiplier based on input length."""
def __init__(self, input_length, multiplier):
self.input_length = input_length
self.multiplier = multiplier
def __call__(self, input_ids, scores, **kwargs):
"""Returns true if generated sequence is too long."""
return input_ids.shape[1] > int(self.input_length * self.multiplier)
def parallel_generations(
task,
dataset,
accelerator,
model,
tokenizer,
n_tasks,
args,
curr_sample_idx: int = 0,
save_every_k_tasks: int = -1,
intermediate_generations: Optional[List[Optional[List[Optional[str]]]]] = None,
intermediate_save_generations_path: Optional[str] = None,
):
if args.load_generations_path:
# load generated code
with open(args.load_generations_path) as fp:
generations = json.load(fp)
if accelerator.is_main_process:
print(
f"generations loaded, {n_tasks} selected from {len(generations)} with {len(generations[0])} candidates"
)
return generations[:n_tasks]
set_seed(args.seed, device_specific=True)
# Setup generation settings
gen_kwargs = {
"do_sample": args.do_sample,
"temperature": args.temperature,
"top_p": args.top_p,
"top_k": args.top_k,
"max_length": args.max_length_generation,
}
stopping_criteria = []
# The input_length / start_length set to 0 for now will be adjusted later
# Check if the task has a custom check_fn method for the stopping criteria
if task.stop_words and tokenizer.eos_token:
task.stop_words.append(tokenizer.eos_token)
if hasattr(task, "check_fn"):
stopping_criteria.append(
EndOfFunctionCriteria(0, task.stop_words, tokenizer, task.check_fn)
)
elif task.stop_words:
stopping_criteria.append(
EndOfFunctionCriteria(0, task.stop_words, tokenizer)
)
if hasattr(task, "max_length_multiplier") and task.max_length_multiplier:
stopping_criteria.append(
TooLongFunctionCriteria(0, task.max_length_multiplier)
)
if stopping_criteria:
gen_kwargs["stopping_criteria"] = StoppingCriteriaList(stopping_criteria)
if args.instruction_tokens:
instruction_tokens = args.instruction_tokens.split(",")
if len(instruction_tokens) != 3:
raise ValueError(
"Instruction tokens should contain exactly 3 tokens separated by a comma. If a token is empty, represent it as ''"
)
for token in instruction_tokens:
if token.strip() != "":
task.stop_words.append(token)
else:
instruction_tokens = None
if accelerator.is_main_process:
print(f"number of problems for this task is {n_tasks}")
n_copies = ceil(args.n_samples / args.batch_size)
ds_tokenized = TokenizedDataset(
task,
dataset,
tokenizer,
num_devices=accelerator.state.num_processes,
max_length=args.max_length_generation,
limit_start=args.limit_start + curr_sample_idx,
n_tasks=n_tasks,
n_copies=n_copies,
prefix=args.prefix,
has_encoder=args.modeltype == "seq2seq",
instruction_tokens=instruction_tokens,
)
# do not confuse args.batch_size, which is actually the num_return_sequences
ds_loader = DataLoader(ds_tokenized, batch_size=1)
is_loaded_in_8bit = getattr(model, "is_loaded_in_8bit", False)
is_loaded_in_4bit = getattr(model, "is_loaded_in_4bit", False)
if args.max_memory_per_gpu is not None:
# The model is already sharded across multiple GPUs
ds_loader = accelerator.prepare(ds_loader)
elif not is_loaded_in_8bit and not is_loaded_in_4bit:
# we only wrap data loader to avoid extra memory occupation
model = model.to(accelerator.device)
ds_loader = accelerator.prepare(ds_loader)
else:
# model.to() is not supported for 8bit and 4bit models
model, ds_loader = accelerator.prepare(model, ds_loader)
generations = complete_code(
task,
accelerator,
model,
tokenizer,
ds_loader,
n_tasks=n_tasks,
limit_start=args.limit_start + curr_sample_idx,
batch_size=args.batch_size,
prefix=args.prefix,
instruction_tokens=instruction_tokens,
postprocess=args.postprocess,
is_wrapped=is_loaded_in_8bit or is_loaded_in_4bit,
save_every_k_tasks=save_every_k_tasks,
intermediate_generations=intermediate_generations,
intermediate_save_generations_path=intermediate_save_generations_path,
**gen_kwargs,
)
return generations
import inspect
from pprint import pprint
from . import (apps, codexglue_code_to_text, codexglue_text_to_text, conala,
concode, ds1000, gsm, humaneval, humanevalplus, humanevalpack,
instruct_humaneval, instruct_wizard_humaneval, mbpp, mbppplus,
multiple, parity, python_bugs, quixbugs, recode, santacoder_fim)
TASK_REGISTRY = {
**apps.create_all_tasks(),
**codexglue_code_to_text.create_all_tasks(),
**codexglue_text_to_text.create_all_tasks(),
**multiple.create_all_tasks(),
"codexglue_code_to_text-python-left": codexglue_code_to_text.LeftCodeToText,
"conala": conala.Conala,
"concode": concode.Concode,
**ds1000.create_all_tasks(),
**humaneval.create_all_tasks(),
**humanevalplus.create_all_tasks(),
**humanevalpack.create_all_tasks(),
"mbpp": mbpp.MBPP,
"mbppplus": mbppplus.MBPPPlus,
"parity": parity.Parity,
"python_bugs": python_bugs.PythonBugs,
"quixbugs": quixbugs.QuixBugs,
"instruct_wizard_humaneval": instruct_wizard_humaneval.HumanEvalWizardCoder,
**gsm.create_all_tasks(),
**instruct_humaneval.create_all_tasks(),
**recode.create_all_tasks(),
**santacoder_fim.create_all_tasks(),
}
ALL_TASKS = sorted(list(TASK_REGISTRY))
def get_task(task_name, args=None):
try:
kwargs = {}
if "prompt" in inspect.signature(TASK_REGISTRY[task_name]).parameters:
kwargs["prompt"] = args.prompt
if "load_data_path" in inspect.signature(TASK_REGISTRY[task_name]).parameters:
kwargs["load_data_path"] = args.load_data_path
return TASK_REGISTRY[task_name](**kwargs)
except KeyError:
print("Available tasks:")
pprint(TASK_REGISTRY)
raise KeyError(f"Missing task {task_name}")
"""Measuring Coding Challenge Competence With APPS
https://arxiv.org/abs/2105.09938
APPS is a benchmark for code generation with 10000 problems. With three difficulty levels: introductory, interview and competition.
It can be used to evaluate the ability of language models to generate code from natural language specifications.
Homepage: https://github.com/hendrycks/apps
"""
import json
from evaluate import load
from bigcode_eval.base import Task
_CITATION = """
@article{hendrycksapps2021,
title={Measuring Coding Challenge Competence With APPS},
author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},
journal={NeurIPS},
year={2021}
}
"""
LEVELS = ["introductory", "interview", "competition"]
def create_all_tasks():
"""Creates a dictionary of tasks from a list of levels
:return: {task_name: task}
e.g. {apps-interview: Task, apps-competitoon: Task}
"""
return {f"apps-{level}": create_task(level) for level in LEVELS}
def create_task(level):
class APPS(GeneralAPPS):
def __init__(self, **kwargs):
super().__init__(level, **kwargs)
return APPS
class GeneralAPPS(Task):
"""A task represents an entire benchmark including its dataset, problems,
answers, generation settings and evaluation methods.
"""
DATASET_PATH = "codeparrot/apps"
DATASET_NAME = None
def __init__(self, level, k_list=[1, 10, 100]):
self.DATASET_NAME = level
super().__init__(
stop_words=["\nQUESTION", "\n---", "\nANSWER"],
requires_execution=True,
)
self.k_list = k_list
def get_dataset(self):
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
return self.dataset["test"]
def get_prompt(self, doc):
"""Generate prompts for APPS
Finetuning setup: prompt=question with some starter code and function name if they exist.
We also specify the type of the prompt, i.e. whether it is call-based or standard input-based.
"""
starter_code = None if len(doc["starter_code"]) == 0 else doc["starter_code"]
try:
input_outpout = json.loads(doc["input_output"])
fn_name = (
None if not input_outpout.get("fn_name") else input_outpout["fn_name"]
)
except ValueError:
fn_name = None
prompt = "\nQUESTION:\n"
prompt += doc["question"]
if starter_code:
prompt += starter_code
if not fn_name:
call_format = "\nUse Standard Input format"
prompt += call_format
else:
call_format = "\nUse Call-Based format"
prompt += call_format
prompt += "\nANSWER:\n"
return prompt
def get_reference(self, doc):
"""Builds the reference solution for the doc (sample from the test dataset)."""
return None
def postprocess_generation(self, generation, idx):
"""Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int
index of doc in the dataset to which the generation belongs
(not used for APPS)
"""
try:
generation = generation.split("\nANSWER:", 1)[1]
except IndexError:
# happens when prompts were very long and got truncated
pass
return generation
def process_results(self, generations, references):
"""Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing refrences (not needed for APPS Task)
"""
code_metric = load("codeparrot/apps_metric")
if level is None:
level = self.DATASET_NAME
results = code_metric.compute(
predictions=generations, k_list=self.k_list, level=self.DATASET_NAME
)
return results
"""CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
https://arxiv.org/abs/2102.04664
Code to text task from CodeXGlue (documentation generation):
* for all subsets ("python", "java", "javascript", "ruby", "php", "go") where the whole function body (without docstring) is given as a prompt
* for Python subset where only function signature is used as a prompt (this setting can give better results).
"""
import os
import re
import typing
from bigcode_eval.base import Task
_CITATION = """
@article{husain2019codesearchnet,
title={Codesearchnet challenge: Evaluating the state of semantic code search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}
"""
LANGUAGES = ["python", "java", "javascript", "ruby", "php", "go"]
TRIPLE_QUOTE = '"""'
SINGLE_TRIPLE_QUOTE = "'''"
SPACES4 = " " * 4
SUFFIX_PROMPT = {
"python": '\n""" The goal of this function is to:\n',
"ruby": "\n=begin The goal of this function is to:\n",
"other": "\n/* The goal of this function is to:\n",
}
def create_all_tasks():
"""Creates a dictionary of tasks from a list of languages
:return: {task_name: task}
e.g. {codexglue_code_to_text-python: Task, codexglue_code_to_text-java: Task}
"""
return {
f"codexglue_code_to_text-{language}": create_task(language)
for language in LANGUAGES
}
def create_task(language):
class CodeToText(GeneralCodeToText):
def __init__(self, **kwargs):
super().__init__(language, **kwargs)
return CodeToText
def compute_codexglue_code_to_text_bleu(
gold_and_predicted_items: typing.List[typing.Tuple[str, str]]
):
"""
Compute BLEU scores using codexglue_code_to_text_bleu.computeMaps (codexglue_summarization_evaluator)
This uses a specific BLEU tokenization and preprocessing necessary for this task by
the original authors of the dataset.
Taken from: https://github.com/dpfried/lm-evaluation-harness/blob/5d9a6aaaaa929bcad95bb73d85e78fe75eb64b4e/lm_eval/tasks/codexglue_summarization.py#L102
"""
from bigcode_eval.tasks.custom_metrics import codexglue_code_to_text_bleu
predicted_map = {}
gold_map = {}
for ix, (gold_str, predicted_str) in enumerate(gold_and_predicted_items):
gold, *rest = gold_str.strip().split("\t")
if len(rest) > 0:
print(f"warning: gold instance {ix} contains a tab; ignoring text after")
gold_map[ix] = [codexglue_code_to_text_bleu.splitPuncts(gold.strip().lower())]
pred, *rest = predicted_str.strip().split("\t")
if len(rest) > 0:
print(f"warning: gold instance {ix} contains a tab; ignoring text after")
predicted_map[ix] = [
codexglue_code_to_text_bleu.splitPuncts(pred.strip().lower())
]
return codexglue_code_to_text_bleu.bleuFromMaps(gold_map, predicted_map)[0]
class GeneralCodeToText(Task):
"""Code to text task from CodeXGlue for all subsets where the whole
function body (without docstring) is given as a prompt
"""
DATASET_PATH = "code_x_glue_ct_code_to_text"
DATASET_NAME = None
def __init__(self, language):
self.DATASET_NAME = language
stop_words = ["'''", '"""'] if language == "python" else ["\n"]
super().__init__(
stop_words=stop_words,
requires_execution=False,
)
def get_dataset(self):
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
return self.dataset["test"]
@staticmethod
def standardize_docstring_prompt(prefix):
"""Strips any existing docstring delimiters from the prompt prefix
and adds our own delimiter (triple quote) and whitespace.
Note an edge cases being handled here:
- codexglue docstring text sometimes contains the docstring delimiters, inconsistently
source: InCoder evaluation code https://github.com/dpfried/lm-evaluation-harness/
"""
for delim in [TRIPLE_QUOTE, SINGLE_TRIPLE_QUOTE]:
if delim in prefix:
prefix = prefix[: prefix.index(delim)]
break
single_single_quote_with_trailing_spaces = re.compile(r'[^\'"][\']\s*$')
if single_single_quote_with_trailing_spaces.search(prefix):
prefix = prefix[
: single_single_quote_with_trailing_spaces.search(prefix).start()
]
single_double_quote_with_trailing_spaces = re.compile(r'[^\'"]["]\s*$')
if single_double_quote_with_trailing_spaces.search(prefix):
prefix = prefix[
: single_double_quote_with_trailing_spaces.search(prefix).start()
]
prefix += TRIPLE_QUOTE
return prefix
def get_prompt(self, doc):
"""Generate prompts for Code to text benchmark (documentation generation)
Prompt = full function body (withoout the docstring) + '\n[Delimiter] The goal of this function is to:\n'
where delimiter is \""" for python, =begin for ruby and /* for the rest (see SUFFIX_PROMPT).
:param doc: dict[str: str])
"""
code = doc["code"]
if self.DATASET_NAME == "python":
# python code includes the docstring
text = doc["docstring"]
prompt_prefix = code[: code.index(text)]
prompt_prefix = self.standardize_docstring_prompt(prompt_prefix)
prompt_suffix = code[code.index(text) + len(text) :]
prompt_suffix = prompt_suffix.replace(TRIPLE_QUOTE, "")
prompt_suffix = prompt_suffix.replace(SINGLE_TRIPLE_QUOTE, "")
prompt_prefix = prompt_prefix.strip().removesuffix(TRIPLE_QUOTE)
prompt_prefix = prompt_prefix.strip().removesuffix(SINGLE_TRIPLE_QUOTE)
prompt = prompt_prefix + prompt_suffix + SUFFIX_PROMPT["python"]
return prompt
elif self.DATASET_NAME == "ruby":
return code + SUFFIX_PROMPT["ruby"]
else:
return code + SUFFIX_PROMPT["other"]
def get_reference(self, doc):
"""Builds the reference solution for the doc (sample from the test dataset).
:param doc: dict[str: str]
"""
from mosestokenizer import MosesDetokenizer
# deactivate tokenizer parallelism when calling MosesDetokenizer TODO: do it for all refs once
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# docstring_tokens are preprocessed and don't have extra context like variable defs
docstring = " ".join(doc["docstring_tokens"]).replace("\n", "")
# some docstrings started with r""" before tokenization but r was kept
if docstring[0] == "r":
docstring = docstring[1:]
with MosesDetokenizer("en") as detokenize:
docstring = detokenize(docstring.strip().split())
return docstring
def postprocess_generation(self, generation, idx):
"""Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int
index of doc in the dataset to which the generation belongs
(not used for this Task)
"""
delimiters = {language: SUFFIX_PROMPT["other"] for language in LANGUAGES}
delimiters.update(SUFFIX_PROMPT)
output = generation.split(delimiters[self.DATASET_NAME])[1].strip()
output = output.split("\n")[0]
return output
def process_results(self, generations, references):
"""Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing refrences (not needed for APPS Task)
"""
bleu_score = compute_codexglue_code_to_text_bleu(
(ref, gen[0]) for ref, gen in zip(references, generations)
)
return {"blue": bleu_score}
class LeftCodeToText(GeneralCodeToText):
"""Code to text task from CodeXGlue for Python subset in a left only setting:
only the function signature is given as prompt similarly to Fried et al. (InCoder)
TODO: implement function signature extraction for other languages in the dataset
"""
def __init__(self):
super().__init__("python")
@staticmethod
def standardize_docstring_prompt(prefix):
"""Strips any existing docstring delimiters from the prompt prefix and
and adds our own delimiter (triple quote) and whitespace.
Note an edge cases being handled here:
- codexglue docstring text sometimes contains the docstring delimiters, inconsistently
source: InCoder evaluation code https://github.com/dpfried/lm-evaluation-harness/
"""
for delim in [TRIPLE_QUOTE, SINGLE_TRIPLE_QUOTE]:
if delim in prefix:
prefix = prefix[: prefix.index(delim)]
break
single_single_quote_with_trailing_spaces = re.compile(r'[^\'"][\']\s*$')
if single_single_quote_with_trailing_spaces.search(prefix):
prefix = prefix[
: single_single_quote_with_trailing_spaces.search(prefix).start()
]
single_double_quote_with_trailing_spaces = re.compile(r'[^\'"]["]\s*$')
if single_double_quote_with_trailing_spaces.search(prefix):
prefix = prefix[
: single_double_quote_with_trailing_spaces.search(prefix).start()
]
prefix += TRIPLE_QUOTE
return prefix
def get_prompt(self, doc):
"""Generate prompts for Code to text benchmark (documentation generation)
Prompt = function signature.
:param doc: dict[str: str]
"""
code = doc["code"]
# python code includes the docstring
text = doc["docstring"]
prompt_prefix = code[: code.index(text)]
prompt_prefix = self.standardize_docstring_prompt(prompt_prefix)
return prompt_prefix
def postprocess_generation(self, generation, idx):
"""Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int
index of doc in the dataset to which the generation belongs
(not used for this Task)
"""
output = generation.strip().split("\n")[0].strip()
for delimiter in [TRIPLE_QUOTE, SINGLE_TRIPLE_QUOTE]:
if delimiter in generation:
generation = generation[generation.index(delimiter) + 3 :]
output = generation.strip().split("\n")[0].strip()
output = output.split(delimiter, 1)[0]
return output
"""
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
https://arxiv.org/abs/2102.04664
Text to text task from CodeXGlue (documentation translation)
"""
import json
import os
import re
from evaluate import load
from bigcode_eval.base import Task
_CITATION = """
@article{CodeXGLUE,
title={CodeXGLUE: A Benchmark Dataset and Open Challenge for Code Intelligence},
year={2020},}
"""
SOURCE_LANG = {
"da_en": "danish",
"zh_en": "chinese",
"no_en": "norwegian",
"lv_en": "latvian",
}
def create_all_tasks():
"""Creates a dictionary of tasks from a list of languages
:return: {task_name: task}
e.g. {codexglue_text_to_text-da_en: Task, codexglue_text_to_text-zh_en: Task}
"""
return {
f"codexglue_text_to_text-{translation_task}": create_task(translation_task)
for translation_task in SOURCE_LANG
}
def create_task(translation_task):
class CodexglueTextToTextTask(CodexglueTextToText):
def __init__(self, **kwargs):
super().__init__(translation_task, **kwargs)
return CodexglueTextToTextTask
class CodexglueTextToText(Task):
DATASET_PATH = "code_x_glue_tt_text_to_text"
DATASET_NAME = None
def __init__(self, translation_task, max_order=4, smooth=True):
self.DATASET_NAME = translation_task
stop_words = ["\n"]
requires_execution = False
super().__init__(stop_words, requires_execution)
self.max_order = max_order
self.smooth = smooth
def get_dataset(self):
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
return self.dataset["test"]
def fewshot_examples(self):
"""Loads and returns the few-shot examples for the task if they exist."""
with open(
"bigcode_eval/tasks/few_shot_examples/codexglue_text_to_text_few_shot_prompts.json",
"r",
) as file:
examples = json.load(file)
return examples
@staticmethod
def two_shot_prompt(entry, text, examples, language):
"""Two shot prompt format as source & target language documentation"""
prompt = f"\n{language.title()}:\n{examples['source1']}\
\nEnglish:\n{examples['target1']}\
\n{language.title()}:\n{examples['source2']}\
\nEnglish:\n{examples['target2']}\
\n{language.title()}:\n{text}\
\nEnglish:\n"
return entry + prompt
def get_prompt(self, doc):
"""Builds the prompt for the LM to generate from."""
language = SOURCE_LANG[self.DATASET_NAME]
text = doc["source"]
entry = f"Translate the following documentation from {language.title()} to English:\n"
examples = self.fewshot_examples()
examples = examples[language]
prompt = self.two_shot_prompt(entry, text, examples, language)
return prompt
def get_reference(self, doc):
"""Builds the reference solution for the doc (sample from the test dataset)."""
return doc["target"].strip()
def postprocess_generation(self, generation, idx):
"""Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int
index of doc in the dataset to which the generation belongs
(not used for this task)
"""
output = generation.split("\nEnglish:\n", 3)[-1].strip()
return output
def process_results(self, generations, references):
"""Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing references
"""
bleu = load("bleu")
gens = [gen[0] for gen in generations]
results = bleu.compute(
references=references, predictions=gens, max_order=self.max_order, smooth=self.smooth
)
return results
"""Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
https://arxiv.org/pdf/1805.08949.pdf
Python Code generation with CoNaLa. It is a benchmark of code and natural language pairs, for the evaluation of code generation tasks.
The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators,
split into 2,379 training and 500 test examples.
Homepage: https://conala-corpus.github.io/
Here we use two-shot evaluation (the original paper evaluates finetuned models)
"""
import json
from evaluate import load
from bigcode_eval.base import Task
_CITATION = """
@inproceedings{yin2018learning,
title={Learning to mine aligned code and natural language pairs from stack overflow},
author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)},
pages={476--486},
year={2018},
organization={IEEE}
}
"""
class Conala(Task):
"""A task represents an entire benchmark including its dataset, problems,
answers, generation settings and evaluation methods.
"""
DATASET_PATH = "neulab/conala"
def __init__(self, max_order=4, smooth=True):
super().__init__(
stop_words=["\n"],
requires_execution=False,
)
self.max_order = max_order
self.smooth = smooth
def get_dataset(self):
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
return self.dataset["test"]
def fewshot_examples(self):
"""Loads and returns the few-shot examples for the task if they exist."""
with open(
"bigcode_eval/tasks/few_shot_examples/conala_few_shot_prompts.json", "r"
) as file:
examples = json.load(file)
return examples
@staticmethod
def two_shot_prompt(entry, text, examples):
"""Two shot prompt format as instructions & solutions"""
prompt = f"\nInstruction:\n{examples['instruction1']}\
\nSolution:\n{examples['solution1']}\
\nInstruction:\n{examples['instruction2']}\
\nSolution:\n{examples['solution2']}\
\nInstruction:\n{text}\
\nSolution:\n"
assert (
prompt.count("Solution:\n") == 3
), "Splitting operation in postprocess_generation is invalid"
return entry + prompt
def get_prompt(self, doc):
"""Builds the prompt for the LM to generate from."""
examples = self.fewshot_examples()
text_column = "rewritten_intent" if doc["rewritten_intent"] else "intent"
text = doc[text_column].strip()
entry = "Answer the following instructions in one line of Python code:\n"
prompt = self.two_shot_prompt(entry, text, examples)
return prompt
def get_reference(self, doc):
"""Builds the reference solution for the doc (sample from the test dataset)."""
return doc["snippet"]
def postprocess_generation(self, generation, idx):
"""Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int
index of doc in the dataset to which the generation belongs
(not used for this task)
"""
output = generation.split("Solution:\n", 3)[-1].strip()
return output
def process_results(self, generations, references):
"""Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing references
"""
bleu = load("bleu")
gens = [gen[0] for gen in generations]
results = bleu.compute(
references=references, predictions=gens, max_order=self.max_order, smooth=self.smooth
)
return results
"""Mapping Language to Code in Programmatic Context (Concode)
https://arxiv.org/abs/1808.09588
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
https://arxiv.org/abs/2102.04664
Java code generation in CodeXGLUE text-to-code dataset (built from Concode dataset)
Available at https://huggingface.co/datasets/code_x_glue_ct_code_to_text
2000 samples are available in the test set.
Here we use two-shot evaluation (the original paper evaluates finetuned models)
"""
import json
from evaluate import load
from bigcode_eval.base import Task
_CITATION = """
@article{iyer2018mapping,
title={Mapping language to code in programmatic context},
author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:1808.09588},
year={2018}
}
"""
class Concode(Task):
"""A task represents an entire benchmark including its dataset, problems,
answers, generation settings and evaluation methods.
"""
DATASET_PATH = "code_x_glue_tc_text_to_code"
def __init__(self, max_order=4, smooth=True):
super().__init__(
stop_words=["\n"],
requires_execution=False,
)
self.max_order = max_order
self.smooth = smooth
def get_dataset(self):
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
# test split of the dataset doesn't have targets
return self.dataset["validation"]
def fewshot_examples(self):
"""Loads and returns the few-shot examples for the task if they exist."""
with open(
"bigcode_eval/tasks/few_shot_examples/concode_few_shot_prompts.json", "r"
) as file:
examples = json.load(file)
return examples
@staticmethod
def two_shot_prompt(entry, text, examples):
"""Two shot prompt format as instructions & solutions"""
prompt = f"\nInstruction:\n{examples['instruction1']}\
\nSolution:\n{examples['solution1']}\
\nInstruction:\n{examples['instruction2']}\
\nSolution:\n{examples['solution2']}\
\nInstruction:\n{text}\
\nSolution:\n"
assert (
prompt.count("Solution:\n") == 3
), "Splitting operation in postprocess_generation is invalid"
return entry + prompt
def get_prompt(self, doc):
"""Builds the prompt for the LM to generate from."""
examples = self.fewshot_examples()
text = doc["nl"].split("concode_field_sep")[0].strip()
if text.endswith("."):
text = text[:-1].strip()
entry = "Answer the following instructions in a one line of Java code:\n"
prompt = self.two_shot_prompt(entry, text, examples)
return prompt
def get_reference(self, doc):
"""Builds the reference solution for the doc (sample from the test dataset)."""
return doc["code"]
def postprocess_generation(self, generation, idx):
"""Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int
index of doc in the dataset to which the generation belongs
(not used for this task)
"""
output = generation.split("Solution:\n", 3)[-1].strip()
return output
def process_results(self, generations, references):
"""Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing references
"""
bleu = load("bleu")
gens = [gen[0] for gen in generations]
results = bleu.compute(
references=references, predictions=gens, max_order=self.max_order, smooth=self.smooth
)
return results
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""The CodeEval metric estimates the pass@k metric for code synthesis.
This is an evaluation harness for the HumanEval problem solving dataset
described in the paper "Evaluating Large Language Models Trained on Code"
(https://arxiv.org/abs/2107.03374)."""
import itertools
import os
from collections import Counter, defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
import numpy as np
from .execute import check_correctness
_CITATION = """\
@misc{chen2021evaluating,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan \
and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards \
and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray \
and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf \
and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray \
and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser \
and Mohammad Bavarian and Clemens Winter and Philippe Tillet \
and Felipe Petroski Such and Dave Cummings and Matthias Plappert \
and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss \
and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak \
and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain \
and William Saunders and Christopher Hesse and Andrew N. Carr \
and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa \
and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati \
and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei \
and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
"""
_DESCRIPTION = """\
This metric implements the evaluation harness for the HumanEval problem solving dataset
described in the paper "Evaluating Large Language Models Trained on Code"
(https://arxiv.org/abs/2107.03374).
"""
_KWARGS_DESCRIPTION = """
Calculates how good are predictions given some references, using certain scores
Args:
predictions: list of candidates to evaluate. Each candidates should be a list
of strings with several code candidates to solve the problem.
references: a list with a test for each prediction. Each test should evaluate the
correctness of a code candidate.
k: number of code candidates to consider in the evaluation (Default: [1, 10, 100])
num_workers: number of workers used to evaluate the canidate programs (Default: 4).
timeout:
Returns:
pass_at_k: dict with pass rates for each k
results: dict with granular results of each unittest
Examples:
>>> test_cases = ["assert add(2,3)==5"]
>>> candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
>>> pass_at_k, results = compute_code_eval(references=test_cases, predictions=candidates, k=[1, 2])
>>> print(pass_at_k)
{'pass@1': 0.5, 'pass@2': 1.0}
"""
_WARNING = """
################################################################################
!!!WARNING!!!
################################################################################
The "code_eval" metric executes untrusted model-generated code in Python.
Although it is highly unlikely that model-generated code will do something
overtly malicious in response to this test suite, model-generated code may act
destructively due to a lack of model capability or alignment.
Users are strongly encouraged to sandbox this evaluation suite so that it
does not perform destructive actions on their host or network. For more
information on how OpenAI sandboxes its code, see the paper "Evaluating Large
Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).
Once you have read this disclaimer and taken appropriate precautions,
set the environment variable HF_ALLOW_CODE_EVAL="1". Within Python you can to this
with:
>>> import os
>>> os.environ["HF_ALLOW_CODE_EVAL"] = "1"
################################################################################\
"""
_LICENSE = """The MIT License
Copyright (c) OpenAI (https://openai.com)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE."""
def compute_code_eval(predictions, references, k=[1, 10, 100], num_workers=4, timeout=3.0):
"""Returns the scores"""
if os.getenv("HF_ALLOW_CODE_EVAL", 0) != "1":
raise ValueError(_WARNING)
if os.name == "nt":
raise NotImplementedError("This metric is currently not supported on Windows.")
with ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = []
completion_id = Counter()
n_samples = 0
results = defaultdict(list)
for task_id, (candidates, test_case) in enumerate(zip(predictions, references)):
for candidate in candidates:
test_program = candidate + "\n" + test_case
args = (test_program, timeout, task_id, completion_id[task_id])
future = executor.submit(check_correctness, *args)
futures.append(future)
completion_id[task_id] += 1
n_samples += 1
for future in as_completed(futures):
result = future.result()
results[result["task_id"]].append((result["completion_id"], result))
total, correct = [], []
for result in results.values():
result.sort()
passed = [r[1]["passed"] for r in result]
total.append(len(passed))
correct.append(sum(passed))
total = np.array(total)
correct = np.array(correct)
ks = k
if not isinstance(ks, (list, tuple)):
ks = [ks]
pass_at_k = {f"pass@{k}": estimate_pass_at_k(total, correct, k).mean() for k in ks if (total >= k).all()}
return pass_at_k, results
def estimate_pass_at_k(num_samples, num_correct, k):
"""Estimates pass@k of each problem and returns them in an array."""
def estimator(n: int, c: int, k: int) -> float:
"""Calculates 1 - comb(n - c, k) / comb(n, k)."""
if n - c < k:
return 1.0
return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
if isinstance(num_samples, int):
num_samples_it = itertools.repeat(num_samples, len(num_correct))
else:
assert len(num_samples) == len(num_correct)
num_samples_it = iter(num_samples)
return np.array([estimator(int(n), int(c), k) for n, c in zip(num_samples_it, num_correct)])
#!/usr/bin/python
"""From https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text (evaluator/evaluator.py)
Call with:
python codexglue_bleu_evaluator.py ref_file < hyp_file
"""
"""
This script was adapted from the original version by hieuhoang1972 which is part of MOSES.
"""
# $Id: bleu.py 1307 2007-03-14 22:22:36Z hieuhoang1972 $
"""Provides:
cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test().
cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked().
score_cooked(alltest, n=4): Score a list of cooked test sentences.
score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids.
The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible.
"""
import math
import os
import re
import subprocess
import sys
import xml.sax.saxutils
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
nonorm = 0
preserve_case = False
eff_ref_len = "shortest"
normalize1 = [
("<skipped>", ""), # strip "skipped" tags
(r"-\n", ""), # strip end-of-line hyphenation and join lines
(r"\n", " "), # join lines
# (r'(\d)\s+(?=\d)', r'\1'), # join digits
]
normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1]
normalize2 = [
(
r"([\{-\~\[-\` -\&\(-\+\:-\@\/])",
r" \1 ",
), # tokenize punctuation. apostrophe is missing
(
r"([^0-9])([\.,])",
r"\1 \2 ",
), # tokenize period and comma unless preceded by a digit
(
r"([\.,])([^0-9])",
r" \1 \2",
), # tokenize period and comma unless followed by a digit
(r"([0-9])(-)", r"\1 \2 "), # tokenize dash when preceded by a digit
]
normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2]
def normalize(s):
"""Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl."""
# Added to bypass NIST-style pre-processing of hyp and ref files -- wade
if nonorm:
return s.split()
if type(s) is not str:
s = " ".join(s)
# language-independent part:
for (pattern, replace) in normalize1:
s = re.sub(pattern, replace, s)
s = xml.sax.saxutils.unescape(s, {"&quot;": '"'})
# language-dependent part (assuming Western languages):
s = " %s " % s
if not preserve_case:
s = s.lower() # this might not be identical to the original
for (pattern, replace) in normalize2:
s = re.sub(pattern, replace, s)
return s.split()
def count_ngrams(words, n=4):
counts = {}
for k in range(1, n + 1):
for i in range(len(words) - k + 1):
ngram = tuple(words[i : i + k])
counts[ngram] = counts.get(ngram, 0) + 1
return counts
def cook_refs(refs, n=4):
"""Takes a list of reference sentences for a single segment
and returns an object that encapsulates everything that BLEU
needs to know about them."""
refs = [normalize(ref) for ref in refs]
maxcounts = {}
for ref in refs:
counts = count_ngrams(ref, n)
for (ngram, count) in counts.items():
maxcounts[ngram] = max(maxcounts.get(ngram, 0), count)
return ([len(ref) for ref in refs], maxcounts)
def cook_test(test, item, n=4):
"""Takes a test sentence and returns an object that
encapsulates everything that BLEU needs to know about it."""
(reflens, refmaxcounts) = item
test = normalize(test)
result = {}
result["testlen"] = len(test)
# Calculate effective reference sentence length.
if eff_ref_len == "shortest":
result["reflen"] = min(reflens)
elif eff_ref_len == "average":
result["reflen"] = float(sum(reflens)) / len(reflens)
elif eff_ref_len == "closest":
min_diff = None
for reflen in reflens:
if min_diff is None or abs(reflen - len(test)) < min_diff:
min_diff = abs(reflen - len(test))
result["reflen"] = reflen
result["guess"] = [max(len(test) - k + 1, 0) for k in range(1, n + 1)]
result["correct"] = [0] * n
counts = count_ngrams(test, n)
for (ngram, count) in counts.items():
result["correct"][len(ngram) - 1] += min(refmaxcounts.get(ngram, 0), count)
return result
def score_cooked(allcomps, n=4, ground=0, smooth=1):
totalcomps = {"testlen": 0, "reflen": 0, "guess": [0] * n, "correct": [0] * n}
for comps in allcomps:
for key in ["testlen", "reflen"]:
totalcomps[key] += comps[key]
for key in ["guess", "correct"]:
for k in range(n):
totalcomps[key][k] += comps[key][k]
logbleu = 0.0
all_bleus = []
for k in range(n):
correct = totalcomps["correct"][k]
guess = totalcomps["guess"][k]
addsmooth = 0
if smooth == 1 and k > 0:
addsmooth = 1
logbleu += math.log(correct + addsmooth + sys.float_info.min) - math.log(
guess + addsmooth + sys.float_info.min
)
if guess == 0:
all_bleus.append(-10000000)
else:
all_bleus.append(math.log(correct + sys.float_info.min) - math.log(guess))
logbleu /= float(n)
all_bleus.insert(0, logbleu)
brevPenalty = min(
0, 1 - float(totalcomps["reflen"] + 1) / (totalcomps["testlen"] + 1)
)
for i in range(len(all_bleus)):
if i == 0:
all_bleus[i] += brevPenalty
all_bleus[i] = math.exp(all_bleus[i])
return all_bleus
def bleu(refs, candidate, ground=0, smooth=1):
refs = cook_refs(refs)
test = cook_test(candidate, refs)
return score_cooked([test], ground=ground, smooth=smooth)
def splitPuncts(line):
return " ".join(re.findall(r"[\w]+|[^\s\w]", line))
def computeMaps(predictions, goldfile):
predictionMap = {}
goldMap = {}
gf = open(goldfile, "r")
for row in predictions:
cols = row.strip().split("\t")
if len(cols) == 1:
(rid, pred) = (cols[0], "")
else:
(rid, pred) = (cols[0], cols[1])
predictionMap[rid] = [splitPuncts(pred.strip().lower())]
for i, row in enumerate(gf):
if len(row.split("\t")) != 2:
print(row)
print(i)
(rid, pred) = row.split("\t")
if rid in predictionMap: # Only insert if the id exists for the method
if rid not in goldMap:
goldMap[rid] = []
goldMap[rid].append(splitPuncts(pred.strip().lower()))
sys.stderr.write("Total: " + str(len(goldMap)) + "\n")
return (goldMap, predictionMap)
# m1 is the reference map
# m2 is the prediction map
def bleuFromMaps(m1, m2):
score = [0] * 5
num = 0.0
for key in m1:
if key in m2:
bl = bleu(m1[key], m2[key][0])
score = [score[i] + bl[i] for i in range(0, len(bl))]
num += 1
return [s * 100.0 / num for s in score]
if __name__ == "__main__":
reference_file = sys.argv[1]
predictions = []
for row in sys.stdin:
predictions.append(row)
(goldMap, predictionMap) = computeMaps(predictions, reference_file)
print(bleuFromMaps(goldMap, predictionMap)[0])
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment