• Nicolas Patry's avatar
    NerPipeline (TokenClassification) now outputs offsets of words (#8781) · d8fc26e9
    Nicolas Patry authored
    * NerPipeline (TokenClassification) now outputs offsets of words
    
    - It happens that the offsets are missing, it forces the user to pattern
    match the "word" from his input, which is not always feasible.
    For instance if a sentence contains the same word twice, then there
    is no way to know which is which.
    - This PR proposes to fix that by outputting 2 new keys for this
    pipelines outputs, "start" and "end", which correspond to the string
    offsets of the word. That means that we should always have the
    invariant:
    
    ```python
    input[entity["start"]: entity["end"]] == entity["entity_group"]
                                        # or entity["entity"] if not grouped
    ```
    
    * Fixing doc style
    d8fc26e9
test_pipelines_ner.py 14.4 KB