src/transformers/pipelines/token_classification.py · 042f420364fe14d0b6731028e6e7884062abeccd · chenpangpang / transformers

Update pipeline word heuristic to work with whitespace in token offsets (#18402) · 042f4203

David authored Aug 02, 2022

* Update pipeline word heuristic to work with whitespace in token offsets

This change checks for whitespace in the input string at either the
character preceding the token or in the first character of the token.
This works with tokenizers that return offsets excluding whitespace
between words or with offsets including whitespace.

fixes #18111

starting

* Use smaller model, ensure expected tokenization

* Re-run CI (please squash)

042f4203

token_classification.py 20.4 KB

Replace token_classification.py