• Matt's avatar
    Terminator strings for generate() (#28932) · 0d84901c
    Matt authored
    
    
    * stash commit (will discard all of this)
    
    * stash commit
    
    * First commit - needs a lot of testing!
    
    * Add a test
    
    * Fix imports and make the tests actually test something
    
    * Tests pass!
    
    * Rearrange test
    
    * Add comments (but it's still a bit confusing)
    
    * Stop storing the tokenizer
    
    * Comment fixup
    
    * Fix for input_ids with a single sequence
    
    * Update tests to test single sequences
    
    * make fixup
    
    * Fix incorrect use of isin()
    
    * Expand tests to catch more cases
    
    * Expand tests to catch more cases
    
    * make fixup
    
    * Fix length calculation and update tests
    
    * Handle 臓 as a space replacement too
    
    * Update src/transformers/generation/stopping_criteria.py
    Co-authored-by: default avatarJoao Gante <joaofranciscocardosogante@gmail.com>
    
    * Add optimizations from Joao's suggestion
    
    * Remove TODO
    
    * Update src/transformers/generation/stopping_criteria.py
    Co-authored-by: default avatarJoao Gante <joaofranciscocardosogante@gmail.com>
    
    * Update tests/generation/test_stopping_criteria.py
    Co-authored-by: default avatarJoao Gante <joaofranciscocardosogante@gmail.com>
    
    * make fixup
    
    * Rename some variables and remove some debugging clauses for clarity
    
    * Add tests for the sub-methods
    
    * Clarify one test slightly
    
    * Add stop_strings to GenerationConfig
    
    * generate() supports stop_string arg, asks for tokenizer if not provided
    
    * make fixup
    
    * Cleanup code and rename variables for clarity
    
    * Update tokenizer error
    
    * Update tokenizer passing, handle generation on GPU
    
    * Slightly more explanation cleanup
    
    * More comment cleanup
    
    * Factor out the token cleanup so it's more obvious what we're doing, and we can change it later
    
    * Careful with that cleanup!
    
    * Cleanup + optimizations to _get_matching_positions
    
    * More minor performance tweaks
    
    * Implement caching and eliminate some expensive ops (startup time: 200ms -> 9ms)
    
    * Remove the pin_memory call
    
    * Parallelize across all stop strings!
    
    * Quick fix for tensor devices
    
    * Update embeddings test for the new format
    
    * Fix test imports
    
    * Manual patching for BERT-like tokenizers
    
    * Return a bool vector instead of a single True/False
    
    * Better comment
    
    * Better comment
    
    * Add tests from @zucchini-nlp
    
    * Amy's list creation nit
    
    * tok_list -> token_list
    
    * Push a big expanded docstring (should we put it somewhere else?)
    
    * Expand docstrings
    
    * Docstring fixups
    
    * Rebase
    
    * make fixup
    
    * Make a properly general method for figuring out token strings
    
    * Fix naming throughout the functions
    
    * Move cache, refactor, fix tests
    
    * Add comment
    
    * Remove finished TODO
    
    * Remove finished TODO
    
    * make fixup
    
    * Update src/transformers/generation/stopping_criteria.py
    Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
    
    * Update and shorten docstring
    
    * Update tests to be shorter/clearer and test specific cases
    
    ---------
    Co-authored-by: default avatarJoao Gante <joaofranciscocardosogante@gmail.com>
    Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
    0d84901c
test_stopping_criteria.py 10.1 KB