• David's avatar
    Update pipeline word heuristic to work with whitespace in token offsets (#18402) · 042f4203
    David authored
    * Update pipeline word heuristic to work with whitespace in token offsets
    
    This change checks for whitespace in the input string at either the
    character preceding the token or in the first character of the token.
    This works with tokenizers that return offsets excluding whitespace
    between words or with offsets including whitespace.
    
    fixes #18111
    
    starting
    
    * Use smaller model, ensure expected tokenization
    
    * Re-run CI (please squash)
    042f4203
token_classification.py 20.4 KB