Commits · e9eeedaf3be650ef8875dc73d334c3a5b684eeb3 · chenpangpang / transformers

"tests/quantization/vscode:/vscode.git/clone" did not exist on "55db70c63de2c07b6ffe36f24c0e7df8f967e935"

10 Jul, 2024 1 commit
- remove duplicate words in msg (#31876) · e9eeedaf
  yukionfire authored Jul 10, 2024
  
  e9eeedaf
16 Feb, 2024 1 commit
- Update all references to canonical models (#29001) · f497f564
  Lysandre Debut authored Feb 16, 2024
```
* Script & Manual edition

* Update
```
  f497f564
02 Feb, 2024 1 commit

[Docs] Fix spelling and grammar mistakes (#28825) · 721ee783

Klaus Hipp authored Feb 02, 2024

* Fix typos and grammar mistakes in docs and examples

* Fix typos in docstrings and comments

* Fix spelling of `tokenizer` in model tests

* Remove erroneous spaces in decorators

* Remove extra spaces in Markdown link texts

721ee783

16 Nov, 2023 1 commit

Set `usedforsecurity=False` in hashlib methods (FIPS compliance) (#27483) · fd65aa98

Lucain authored Nov 16, 2023

* Set usedforsecurity=False in hashlib methods (FIPS compliance)

* trigger ci

* tokenizers version

* deps

* bump hfh version

* let's try this

fd65aa98

27 Jun, 2023 1 commit

Fix TypeError: Object of type int64 is not JSON serializable (#24340) · 239ace15

Xiaoli Wang authored Jun 27, 2023

* Fix TypeError: Object of type int64 is not JSON serializable

* Convert numpy.float64 and numpy.int64 to float and int for json serialization

* Black reformatted examples/pytorch/token-classification/run_ner_no_trainer.py

* * make style

239ace15

06 Jun, 2023 1 commit
- Act on deprecations in Accelerate no_trainer examples (#24053) · 072188d6
  Zachary Mueller authored Jun 06, 2023
```
Act on deprecation
```
  072188d6
22 Feb, 2023 1 commit
- Apply ruff flake8-comprehensions (#21694) · 5e8c8eb5
  Aaron Gokaslan authored Feb 22, 2023
  
  5e8c8eb5
06 Feb, 2023 1 commit

Update quality tooling for formatting (#21480) · 6f79d264

Sylvain Gugger authored Feb 06, 2023

* Result of black 23.1

* Update target to Python 3.7

* Switch flake8 to ruff

* Configure isort

* Configure isort

* Apply isort with line limit

* Put the right black version

* adapt black in check copies

* Fix copies

6f79d264

18 Aug, 2022 1 commit

Add an examples folder for code downstream tasks (#18679) · bbbb453e

Loubna Ben Allal authored Aug 18, 2022

* add examples subfolder

* mention examples in codeparrot readme

* use Trainer optimizer and scheduler type and add output_dir as argument

* add example of text-to-python and python-to-text models

* mention the downstream examples in the readme

* fix typo

bbbb453e

28 Jul, 2022 1 commit
- Fix codeparrot deduplication - ignore whitespaces (#18023) · 286a18fa
  Loubna Ben Allal authored Jul 28, 2022
```
* ignore whitspaces for hash

* reformat code

* Update README.md
```
  286a18fa
27 Jul, 2022 1 commit

Update CodeParrot readme to include training in Megatron (#17798) · 1d71ad89

Loubna Ben Allal authored Jul 27, 2022



* add info about megatron training

* upload models and datasets from CodeParrot organization

* upload models and datasets from CodeParrot organization

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fix typo and add comment about codeparrot vs megatron
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

1d71ad89

21 Jun, 2022 1 commit

[CodeParrot] Near-deduplication with jaccard similarity (#17054) · da2bd2ae

Jia LI authored Jun 21, 2022



* deduplication draft

* update style

* update style test

* dummy test main

* rename modules

* rename functions

* return extremes in deduplicate_clusters

* update style

* cast str for gzip

* update doc string

* time processing

* use dataset map to compute minhash

* fill value for short token

* remove da map method

* update style

* use share object to multiprocess

* update style

* use f-string and minor fix
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* update style

* use module parameters

* change ds_dedup to ds_filter

* save ds_dedup

* mv test to script tests

* make jaccard threshold a parameter of deduplicate_dataset

* update style

* add doc strings

* update style

* add doc string for DuplicationIndex

* save files into data dir

* update readme

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* make near deduplication optional

* move near deduplication in README

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use f string
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

da2bd2ae

23 May, 2022 1 commit

Fix CodeParrot training script (#17291) · b48ac1a0

Loubna Ben Allal authored May 23, 2022



* average loss over batches and accumulated steps for tracking

* fix layernorm weight decay

* use AdamW from Pytorch instead of Transformers

* add shuffling of sequences inside the batches

* add shuffling of sequences inside the batches

* add logging dir and reformat code

* fix lr tracking

* remove Mistral scaling

* keep Mistral scaling

* reformat code

* fix error

* fix error

* use shuffling function from Pytorch

* remove argument for shuffling batch sequences as it isn't optional

* update package versions and install accelerate from source

* remove unused package

* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use one shuffle buffer argument

* compute avg_loss in one line
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

b48ac1a0

16 May, 2022 2 commits

CodeParrot data pretokenization (#16932) · 05a90579

Loubna Ben Allal authored May 16, 2022



* add pretokenization arguments

* add pretokenization script

* add support for pretokenized data

* reformat code

* fix run command for training

* fix model call from config

* remove a package

* add comments on pretokenization in the readme

* remove explicit parallelization
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* keep data parallelization

* reformat code

* reformat code

* update readme

* reformat code

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

05a90579

Update codeparrot data preprocessing (#16944) · e730e125

Loubna Ben Allal authored May 16, 2022



* add new preprocessing arguments

* add new filters

* add new filters to readme

* fix config and test count, update function names and docstrings

* reformat code

* update readme

* Update readme

* rename config_test filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename few_assignments filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename tokenizer in arguments
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename functions and add limit_line argument for config_test filter

* update threshold for config_test filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

e730e125

12 May, 2022 1 commit

Black preview (#17217) · afe5d42d

Sylvain Gugger authored May 12, 2022

* Black preview

* Fixup too!

* Fix check copies

* Use the same version as the CI

* Bump black

afe5d42d

04 May, 2022 1 commit
- Fix hashing for deduplication (#17048) · db034660
  Thomas Wang authored May 04, 2022
  
  db034660
21 Apr, 2022 1 commit

New features for CodeParrot training script (#16851) · d9184131

Loubna Ben Allal authored Apr 21, 2022



* add tflops logging and fix grad accumulation

* add accelerate tracking and checkpointing

* scale loss of last batch correctly

* fix typo

* compress loss computation
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add resume from checkpoint argument

* add load_state accelerate from checkpoint, register lr scheduler and add tflops function

* reformat code

* reformat code

* add condition on path for resume checkpoint

* combine if conditions
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add source for tflops formula
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

d9184131

11 Apr, 2022 1 commit

Jia multi gpu eval (#16428) · 4868a830

Jia LI authored Apr 11, 2022



* add simple multi gpu complet

* add human_eval_multi_gpu

* use copy strategy to distribute across gpu, to avoid padding

* add doc string

* update code style

* use task id to arrange output

* truncate input to avoid zero pad

* Stop the copy mechanism

* update style

* restore copies to scale better in distributed mode

* update style

* replace human eval

* Apply suggestions from code review

1. Tokenize all input at the same time
2. use attention_mask to get the input length
3. other small fixes
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* correct typo and update docstring

* update code style

* remove num sample division constraint

* remove max len calculation

* use accelerator.gather once to speed up

* use accelerate set_seed; update accelerate version

* correct gather bug
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

4868a830

12 Jan, 2022 1 commit
- fix: switch from slow to generic tokenizer class (#15122) · aa0135f2
  Leandro von Werra authored Jan 12, 2022
  
  aa0135f2
23 Dec, 2021 1 commit
- add custom stopping criteria to human eval script (#14897) · 1d651868
  Leandro von Werra authored Dec 23, 2021
  
  1d651868
13 Dec, 2021 1 commit

Code parrot minor fixes/niceties (#14666) · 48bf7e47

Nathan Cooper authored Dec 13, 2021



* Add some nicety flags for better controlling evaluation.

* Fix dependency issue with outdated requirement

* Add additional flag to example to ensure eval is done

* Wrap code into main function for accelerate launcher to find

* Fix valid batch size flag in readme

* Add note to install git-lfs when initializing/training the model

* Update examples/research_projects/codeparrot/scripts/arguments.py
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Revert "Wrap code into main function for accelerate launcher to find"

This reverts commit ff11df1c810d4df198d04b827538eb4572147ba3.

* Fix formatting issue

* Move git-lfs instructions to installation section

* Add a quick check before code generation for code evaluation

* Fix styling issue

* Update examples/research_projects/codeparrot/scripts/human_eval.py
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Make iterable dataset use passed in tokenizer rather than globally defined one
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: ncoop57 <nac33@students.uwf.edu>

48bf7e47

02 Dec, 2021 1 commit

Add CodeParrot 🦜 codebase (#14536) · 43f953cc

Leandro von Werra authored Dec 02, 2021



* add readme skeleton

* update readme

* add initialization script

* add deduplication script

* add codeparrot training script

* add code generation evaluation

* add validation loss script

* add requirements

* update readme

* tweak readme

* make style

* add highlights to readme

* add CLIs to scripts

* add tokenizer training script

* add docstring to constant length dataset

* fix defaults in arguments

* update readme with cli

* move image to hub

* tweaks of readme

* fix cli commands

* add author

* explain env variables

* fix formatting

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Apply suggestions from code review
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* replace generic with gpt2 tokenizer
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

43f953cc