Commits · eb16be415a74328e5ab62e050330a43054f6bd11 · chenpangpang / transformers

10 Jun, 2022 1 commit
- update README.md (#17657) · 3114df41
  Loubna Ben Allal authored Jun 10, 2022
```
- use CodeParrot scores of v1.1
- change evaluation command to use accelerate
```
  3114df41
23 May, 2022 1 commit

Fix CodeParrot training script (#17291) · b48ac1a0

Loubna Ben Allal authored May 23, 2022



* average loss over batches and accumulated steps for tracking

* fix layernorm weight decay

* use AdamW from Pytorch instead of Transformers

* add shuffling of sequences inside the batches

* add shuffling of sequences inside the batches

* add logging dir and reformat code

* fix lr tracking

* remove Mistral scaling

* keep Mistral scaling

* reformat code

* fix error

* fix error

* use shuffling function from Pytorch

* remove argument for shuffling batch sequences as it isn't optional

* update package versions and install accelerate from source

* remove unused package

* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use one shuffle buffer argument

* compute avg_loss in one line
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

b48ac1a0

16 May, 2022 2 commits

CodeParrot data pretokenization (#16932) · 05a90579

Loubna Ben Allal authored May 16, 2022



* add pretokenization arguments

* add pretokenization script

* add support for pretokenized data

* reformat code

* fix run command for training

* fix model call from config

* remove a package

* add comments on pretokenization in the readme

* remove explicit parallelization
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* keep data parallelization

* reformat code

* reformat code

* update readme

* reformat code

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

05a90579

Update codeparrot data preprocessing (#16944) · e730e125

Loubna Ben Allal authored May 16, 2022



* add new preprocessing arguments

* add new filters

* add new filters to readme

* fix config and test count, update function names and docstrings

* reformat code

* update readme

* Update readme

* rename config_test filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename few_assignments filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename tokenizer in arguments
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename functions and add limit_line argument for config_test filter

* update threshold for config_test filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

e730e125

12 May, 2022 1 commit

Black preview (#17217) · afe5d42d

Sylvain Gugger authored May 12, 2022

* Black preview

* Fixup too!

* Fix check copies

* Use the same version as the CI

* Bump black

afe5d42d

04 May, 2022 1 commit
- Fix hashing for deduplication (#17048) · db034660
  Thomas Wang authored May 04, 2022
  
  db034660
21 Apr, 2022 1 commit

New features for CodeParrot training script (#16851) · d9184131

Loubna Ben Allal authored Apr 21, 2022



* add tflops logging and fix grad accumulation

* add accelerate tracking and checkpointing

* scale loss of last batch correctly

* fix typo

* compress loss computation
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add resume from checkpoint argument

* add load_state accelerate from checkpoint, register lr scheduler and add tflops function

* reformat code

* reformat code

* add condition on path for resume checkpoint

* combine if conditions
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add source for tflops formula
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

d9184131

11 Apr, 2022 1 commit

Jia multi gpu eval (#16428) · 4868a830

Jia LI authored Apr 11, 2022



* add simple multi gpu complet

* add human_eval_multi_gpu

* use copy strategy to distribute across gpu, to avoid padding

* add doc string

* update code style

* use task id to arrange output

* truncate input to avoid zero pad

* Stop the copy mechanism

* update style

* restore copies to scale better in distributed mode

* update style

* replace human eval

* Apply suggestions from code review

1. Tokenize all input at the same time
2. use attention_mask to get the input length
3. other small fixes
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* correct typo and update docstring

* update code style

* remove num sample division constraint

* remove max len calculation

* use accelerator.gather once to speed up

* use accelerate set_seed; update accelerate version

* correct gather bug
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

4868a830

24 Mar, 2022 1 commit

Update readme with how to train offline and fix BPE command (#15897) · f5e8c9bd

Nathan Cooper authored Mar 24, 2022



* Update readme with how to train offline and fix BPE command

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

f5e8c9bd

12 Jan, 2022 1 commit
- fix: switch from slow to generic tokenizer class (#15122) · aa0135f2
  Leandro von Werra authored Jan 12, 2022
  
  aa0135f2
23 Dec, 2021 1 commit
- add custom stopping criteria to human eval script (#14897) · 1d651868
  Leandro von Werra authored Dec 23, 2021
  
  1d651868
13 Dec, 2021 1 commit

Code parrot minor fixes/niceties (#14666) · 48bf7e47

Nathan Cooper authored Dec 13, 2021



* Add some nicety flags for better controlling evaluation.

* Fix dependency issue with outdated requirement

* Add additional flag to example to ensure eval is done

* Wrap code into main function for accelerate launcher to find

* Fix valid batch size flag in readme

* Add note to install git-lfs when initializing/training the model

* Update examples/research_projects/codeparrot/scripts/arguments.py
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Revert "Wrap code into main function for accelerate launcher to find"

This reverts commit ff11df1c810d4df198d04b827538eb4572147ba3.

* Fix formatting issue

* Move git-lfs instructions to installation section

* Add a quick check before code generation for code evaluation

* Fix styling issue

* Update examples/research_projects/codeparrot/scripts/human_eval.py
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Make iterable dataset use passed in tokenizer rather than globally defined one
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: ncoop57 <nac33@students.uwf.edu>

48bf7e47

02 Dec, 2021 1 commit

Add CodeParrot 🦜 codebase (#14536) · 43f953cc

Leandro von Werra authored Dec 02, 2021



* add readme skeleton

* update readme

* add initialization script

* add deduplication script

* add codeparrot training script

* add code generation evaluation

* add validation loss script

* add requirements

* update readme

* tweak readme

* make style

* add highlights to readme

* add CLIs to scripts

* add tokenizer training script

* add docstring to constant length dataset

* fix defaults in arguments

* update readme with cli

* move image to hub

* tweaks of readme

* fix cli commands

* add author

* explain env variables

* fix formatting

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Apply suggestions from code review
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* replace generic with gpt2 tokenizer
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

43f953cc