• Nicolas Patry's avatar
    Better heuristic for token-classification pipeline. (#12611) · a3bd7637
    Nicolas Patry authored
    * Better heuristic for token-classification pipeline.
    
    Relooking at the problem makes thing actually much simpler,
    when we look at ids from a tokenizer, we have no way in **general**
    to recover if some substring is part of a word or not.
    
    However, within the pipeline, with offsets we still have access to the
    original string, so we can simply look if previous character (if it
    exists) of a token, is actually a space. This will obviously be wrong
    for tokenizers that contain spaces within tokens, tokenizers where
    offsets include spaces too (Don't think there are a lot).
    
    This heuristic hopefully is fully bc and still can handle non-word based
    tokenizers.
    
    * Updating test with real values.
    
    * We still need the older "correct" heuristic to prevent fusing
    punctuation.
    
    * Adding a real warning when important.
    a3bd7637
token_classification.py 19.8 KB