-
Nicolas Patry authored
* NerPipeline (TokenClassification) now outputs offsets of words - It happens that the offsets are missing, it forces the user to pattern match the "word" from his input, which is not always feasible. For instance if a sentence contains the same word twice, then there is no way to know which is which. - This PR proposes to fix that by outputting 2 new keys for this pipelines outputs, "start" and "end", which correspond to the string offsets of the word. That means that we should always have the invariant: ```python input[entity["start"]: entity["end"]] == entity["entity_group"] # or entity["entity"] if not grouped ``` * Fixing doc styled8fc26e9