• Thibault Douzon's avatar
    fix LayoutLMv3TokenizerFast subword label after '臓' token (#21695) · 4e441e52
    Thibault Douzon authored
    LayoutLMv3TokenizerFast produces empty '臓' token with `offset_mapping = (0, 0)`.
    Next token is wrongly assumed to also be beginning of word and isn't
    correctly assigned `pad_token_label`.
    Modify test with text that produce '臓' token.
    Remove copy check from LayoutLMv2TokenizerFast for `_batch_encode_plus`.
    
    solves issue: #19978
    4e441e52
test_tokenization_layoutlmv3.py 123 KB