• Ita Zaporozhets's avatar
    SPLIT PR: add user defined symbols and control symbols (#31305) · 1e79eade
    Ita Zaporozhets authored
    * PR SPLIT: moving origina changes for adding user defined symbols
    
    * adding gemma test and generalizing gemma converter
    
    * ruff
    
    * update common test
    
    * update serialization test
    
    * deberta v2 tests updates as rust version adds '.' as a user added token, so a space is not added
    
    * removing commented lines
    
    * applying feedback - user only added_tokens to add and check piece.type instead of trainer_spec for user_defined_symbols
    
    * add comment referencing sentencepiece
    1e79eade
test_tokenization_common.py 220 KB