• Nicolas Patry's avatar
    This will reduce "Already borrowed error": (#12550) · cc12e1db
    Nicolas Patry authored
    * This will reduce "Already borrowed error":
    
    Original issue https://github.com/huggingface/tokenizers/issues/537
    
    
    
    The original issue is caused by transformers calling many times
    mutable functions on the rust tokenizers.
    Rust needs to guarantee that only 1 agent has a mutable reference
    to memory at a given time (for many reasons which don't need explaining
    here). Usually, the rust compiler can guarantee that this property is
    true at compile time.
    
    Unfortunately, this is impossible for Python to do that, so PyO3, the
    bridge between rust and python used by `tokenizers`, will change the
    compile guarantee for a dynamic guarantee, so if multiple agents try
    to have multiple mutable borrows at the same time, then the runtime will
    yell with "Already borrowed".
    
    The proposed fix here in transformers, is simply to reduce the actual
    number of calls that really need mutable borrows. By reducing them,
    we reduce the risk of running into "Already borrowed" error.
    The caveat is now we add a call to read the current configuration of the
    `_tokenizer`, so worst case we have 2 calls instead of 1, and best case
    we simply have 1 + a Python comparison of a dict (should be negligible).
    
    * Adding a test.
    
    * trivial error :(.
    
    * Update tests/test_tokenization_fast.py
    Co-authored-by: default avatarSaulLu <55560583+SaulLu@users.noreply.github.com>
    
    * Adding reference to original issues in the tests.
    
    * Update the tests with fast tokenizer.
    Co-authored-by: default avatarSaulLu <55560583+SaulLu@users.noreply.github.com>
    cc12e1db
test_tokenization_fast.py 4.93 KB