first commit

f314e457 · dengjb · 50406f0b · f314e457 · f314e457 · f314e457
Commit f314e457 authored May 24, 2024 by dengjb
20 changed files
--- a/.gitignore
+++ b/.gitignore
+experiments/*
+results/*
+tb_logger/*
+wandb/*
+tmp/*
+modify_model.py
+hat/version.py
+
+*.DS_Store
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
--- a/Evaluation/DS-1000/README.md
+++ b/Evaluation/DS-1000/README.md
+## 1. Introduction
+
+We provide a test script to evaluate the performance of the **deepseek-coder** model on code completion benchmarks. We select the widely-used benchmarks: [**DS-1000**](https://github.com/xlang-ai/DS-1000).
+
+## 2. Evaluation
+
+We directly use the scripts provided by the DS-1000 repository to evaluate the performance of the models. You can refer to [**DS-1000**](https://github.com/xlang-ai/DS-1000) to find more details about the evaluation.
+
+
+## 3. Experimental Results
+
+We report experimental results here for the completion mode of DS-1000. We set the maximum length to **2048**, and employ the **greedy search strategy**.  To ensure a fair comparison, we apply identical hyper-parameters across all open-source models under evaluation.
+
+| Model                  | Size | Matplotlib | Numpy | Pandas | Pytorch | Scipy | Scikit-Learn | Tensorflow | Avg   |
+|------------------------|------|------------|-------|--------|---------|-------|-------------|------------|-------|
+| Codex-001              | -    | 41.8%      | 26.6% | 9.4%   | 9.7%    | 15.0% | 18.5%        | 17.2%      | 20.2% |
+| Codex-002              | -    | **57.0%**      | 43.1% | **26.5%**  | **41.8%**   | 31.8% | **44.8%**        | 39.3%      | 39.2% |
+| CodeShell              | 7B   | 34.1%      | 21.8% | 10.7%  | 11.8%   | 17.0% | 20.0%        | 15.6%      | 18.8% |
+| CodeGeeX2              | 6B   | 38.7%      | 26.8% | 14.4%  | 11.8%   | 19.8% | 27.0%        | 17.8%      | 22.9% |
+| StarCoder         | 16B  | 47.7%      | 31.4% | 12.7%  | 25%   | 22.6% | 35.7%        | 22.2%      | 27.2% |
+| CodeLLama-Base         | 7B   | 41.9%      | 24.6% | 14.8%  | 16.2%   | 18.9% | 17.4%        | 17.8%      | 22.1% |
+| CodeLLama-Base         | 13B  | 46.5%      | 28.6% | 18.2%  | 19.1%   | 18.9% | 27.8%        | 33.3%      | 26.8% |
+| CodeLLama-Base         | 34B  | 50.3%      | 42.7% | 23.0%  | 25.0%   | 28.3% | 33.9%        | 40.0%      | 34.3% |
+| | | | |  |  |  |  |  |  | |
+| DeepSeek-Coder-Base    | 1.3B   | 32.3%      | 21.4% | 9.3%   | 8.8%    | 8.5%  | 16.5%        | 8.9%       | 16.2% |
+| DeepSeek-Coder-Base    | 5.7B   | 51.1%      | 31.8% | 19.9%  | 14.7%   | 17.0% | 29.6%        | 15.6%      | 27.7% |
+| DeepSeek-Coder-Base    | 6.7B   | 48.4%      | 35.5% | 20.6%  | 19.1%   | 22.6% | 38.3%        | 24.4%      | 30.5% |
+| DeepSeek-Coder-Base    | 33B  | 56.1%      | **49.6%** | 25.8%  | 36.8%   | **36.8%** | 40.0%        | **46.7%**      | **40.2%** |
+
--- a/Evaluation/HumanEval/README.md
+++ b/Evaluation/HumanEval/README.md
+## 1. Introduction
+
+We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E)**.
+
+
+
+## 2. Setup
+
+```
+pip install accelerate
+pip install attrdict
+pip install transformers
+pip install pytorch
+```
+
+
+## 3. Evaluation
+
+We've created a sample script, **eval.sh**, that demonstrates how to test the **DeepSeek-Coder-1.3b-Base** model on the HumanEval dataset leveraging **8** GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs.
+
+Additionally, for various programming languages, the execution path may differ. Please ensure you update the appropriate paths in the **humaneval/execution.py** file accordingly.
+
+```bash
+MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
+DATASET_ROOT="data/"
+LANGUAGE="python"
+python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --language ${LANGUAGE} --dataroot ${DATASET_ROOT} 
+```
+
+To evaluate the instruction-based model, please follow the script below:
+```bash
+LANG="python"
+OUPUT_DIR="output"
+MODEL="deepseek-coder-33b-instruct"
+
+CUDA_VISIBLE_DEVICES=0,1 python eval_instruct.py \
+    --model "deepseek-ai/$MODEL" \
+    --output_path "$OUPUT_DIR/${LANG}.$MODEL.jsonl" \
+    --language $LANG \
+    --temp_dir $OUPUT_DIR
+```
+
+## 4. Experimental Results
+
+We report experimental results here for 8 main-stream programming languages, **python**, **c++**, **java**, **PHP**, **TypeScript**, **C#**, **Bash**, and **JavaScript**. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**.
+
+
+#### (1) Multilingual Base Models
+
+| Model             | Size | Python | C++   | Java | PHP  | TS   | C#   | Bash | JS   | Avg  |
+|-------------------|------|--------|-------|------|------|------|------|------|------|------|
+| code-cushman-001  | 12B  | 33.5%  | 31.9% | 30.6%| 28.9%| 31.3%| 22.1%| 11.7%| -    | -    |
+| CodeShell         | 7B   | 35.4%  | 32.9% | 34.2%| 31.7%| 30.2%| 38.0%| 7.0% | 33.5%| 30.4%|
+| CodeGeeX2         | 6B   | 36.0%  | 29.2% | 25.9%| 23.6%| 20.8%| 29.7%| 6.3% | 24.8%| 24.5%|
+| StarCoderBase     | 16B  | 31.7%  | 31.1% | 28.5%| 25.4%| 34.0%| 34.8%| 8.9% | 29.8%| 28.0%|
+| CodeLLama         | 7B   | 31.7%  | 29.8% | 34.2%| 23.6%| 36.5%| 36.7%| 12.0%| 29.2%| 29.2%|
+| CodeLLama         | 13B  | 36.0%  | 37.9% | 38.0%| 34.2%| 45.2%| 43.0%| 16.5%| 32.3%| 35.4%|
+| CodeLLama         | 34B  | 48.2%  | 44.7% | 44.9%| 41.0%| 42.1%| 48.7%| 15.8%| 42.2%| 41.0%|
+| | | | |  |  |  |  |  |  | |
+| DeepSeek-Coder-Base| 1.3B   | 34.8%  | 31.1% | 32.3%| 24.2%| 28.9%| 36.7%| 10.1%| 28.6%| 28.3%|
+| DeepSeek-Coder-Base| 5.7B   | 48.7%  | 45.3% | 41.1%| 39.7%| 44.7%| 41.1%| 27.8%| 42.2%| 41.3%|
+| DeepSeek-Coder-Base| 6.7B   | 49.4%  | 50.3% | 43.0%| 38.5%| 49.7%| 50.0%| 28.5%| 48.4%| 44.7%|
+| DeepSeek-Coder-Base|33B  | **56.1%**  | **58.4%** | **51.9%**| **44.1%**| **52.8%**| **51.3%**| **32.3%**| **55.3%**| **50.3%**|
+
+#### (2) Instruction-Tuned Models
+| Model               | Size | Python | C++   | Java | PHP  | TS   | C#   | Bash | JS   | Avg  |
+|---------------------|------|--------|-------|------|------|------|------|------|------|------|
+| GPT-3.5-Turbo         | -    | 76.2%  | 63.4% | 69.2%| 60.9%| 69.1%| 70.8%| 42.4%| 67.1%| 64.9%|
+| GPT-4               | -    | **84.1%**  | **76.4%** | **81.6%**| **77.2%**| **77.4%**| **79.1%**| **58.2%**| **78.0%**| **76.5%**|
+| | | | |  |  |  |  |  |  | |
+| DeepSeek-Coder-Instruct | 1.3B  | 65.2%      | 45.3%    | 51.9%    | 45.3%    | 59.7%   |55.1%    | 12.7%    | 52.2%    | 48.4%    |
+| DeepSeek-Coder-Instruct | 6.7B  | 78.9%  | 63.4% | 68.4% | 68.9%| 67.2%| 72.8%| 36.7%| 72.7%| 66.1%|
+| DeepSeek-Coder-Instruct | 33B | **79.3%**  | **68.9%** | **73.4%** | **72.7%**| **67.9%**| **74.1%**| **43.0%**| **73.9%**| **69.2%**|
+
--- a/Evaluation/HumanEval/data/humaneval-cpp
+++ b/Evaluation/HumanEval/data/humaneval-cpp
--- a/Evaluation/HumanEval/data/humaneval-cpp.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-cpp.jsonl
--- a/Evaluation/HumanEval/data/humaneval-cs
+++ b/Evaluation/HumanEval/data/humaneval-cs
--- a/Evaluation/HumanEval/data/humaneval-cs-bu.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-cs-bu.jsonl
--- a/Evaluation/HumanEval/data/humaneval-cs.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-cs.jsonl
--- a/Evaluation/HumanEval/data/humaneval-d.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-d.jsonl
--- a/Evaluation/HumanEval/data/humaneval-go.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-go.jsonl
--- a/Evaluation/HumanEval/data/humaneval-java
+++ b/Evaluation/HumanEval/data/humaneval-java
--- a/Evaluation/HumanEval/data/humaneval-java.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-java.jsonl
--- a/Evaluation/HumanEval/data/humaneval-jl.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-jl.jsonl
--- a/Evaluation/HumanEval/data/humaneval-js.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-js.jsonl
--- a/Evaluation/HumanEval/data/humaneval-lua.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-lua.jsonl
--- a/Evaluation/HumanEval/data/humaneval-php
+++ b/Evaluation/HumanEval/data/humaneval-php
--- a/Evaluation/HumanEval/data/humaneval-php.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-php.jsonl
--- a/Evaluation/HumanEval/data/humaneval-pl.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-pl.jsonl
--- a/Evaluation/HumanEval/data/humaneval-python.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-python.jsonl
--- a/Evaluation/HumanEval/data/humaneval-r.jsonl
+++ b/Evaluation/HumanEval/data/humaneval-r.jsonl