src/transformers/pipelines/token_classification.py · b3f95dceca0ca5c10e25b16aaa4e5344f6443511 · chenpangpang / transformers

Better heuristic for token-classification pipeline. (#12611) · a3bd7637

Nicolas Patry authored Jul 26, 2021

* Better heuristic for token-classification pipeline.

Relooking at the problem makes thing actually much simpler,
when we look at ids from a tokenizer, we have no way in **general**
to recover if some substring is part of a word or not.

However, within the pipeline, with offsets we still have access to the
original string, so we can simply look if previous character (if it
exists) of a token, is actually a space. This will obviously be wrong
for tokenizers that contain spaces within tokens, tokenizers where
offsets include spaces too (Don't think there are a lot).

This heuristic hopefully is fully bc and still can handle non-word based
tokenizers.

* Updating test with real values.

* We still need the older "correct" heuristic to prevent fusing
punctuation.

* Adding a real warning when important.

a3bd7637

token_classification.py 19.8 KB

Replace token_classification.py