"...lm-evaluation-harness.git" did not exist on "14a29adec81df0d1395b67cc48b3a516d912437e"
  • Nicolas Patry's avatar
    Adding Llama FastTokenizer support. (#22264) · 1670be4b
    Nicolas Patry authored
    * Adding Llama FastTokenizer support.
    
    - Requires https://github.com/huggingface/tokenizers/pull/1183 version
    - Only support byte_fallback for llama, raise otherwise (safety net).
    - Lots of questions are special tokens
    
    How to test:
    
    ```python
    
    from transformers.convert_slow_tokenizer import convert_slow_tokenizer
    from transformers import AutoTokenizer
    from tokenizers import Tokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b")
    
    if False:
        new_tokenizer = Tokenizer.from_file("tok.json")
    else:
        new_tokenizer = convert_slow_tokenizer(tokenizer)
        new_tokenizer.save("tok.json")
    
    strings = [
        "This is a test",
        "生活的真谛是",
        "生活的真谛是[MASK]。",
        # XXX: This one is problematic because of special tokens
        # "<s> Something something",
    ]
    
    for string in strings:
        encoded = tokenizer(string)["input_ids"]
        encoded2 = new_tokenizer.encode(string).ids
    
        assert encoded == encoded2, f"{encoded} != {encoded2}"
    
        decoded = tokenizer.decode(encoded)
        decoded2 = new_tokenizer.decode(encoded2)
    
        assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}"
    ```
    
    The converter + some test script.
    
    The test script.
    
    Tmp save.
    
    Adding Fast tokenizer + tests.
    
    Adding the tokenization tests.
    
    Correct combination.
    
    Small fix.
    
    Fixing tests.
    
    Fixing with latest update.
    
    Rebased.
    
    fix copies + normalized added tokens  + copies.
    
    Adding doc.
    
    TMP.
    
    Doc + split files.
    
    Doc.
    
    Versions + try import.
    
    Fix Camembert + warnings -> Error.
    
    Fix by ArthurZucker.
    
    Not a decorator.
    
    * Fixing comments.
    
    * Adding more to docstring.
    
    * Doc rewriting.
    1670be4b
test_tokenization_llama.py 19.7 KB