run_ner.py / bert-base-multilingual-cased can output empty tokens (#2991)
* Use tokenizer.num_added_tokens to count number of added special_tokens instead of hardcoded numbers. Signed-off-by:Morgan Funtowicz <morgan@huggingface.co> * run_ner.py - Do not add a label to the labels_ids if word_tokens is empty. This can happen when using bert-base-multilingual-cased with an input containing an unique space. In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior over the labels_ids tokens adding one more tokens than tokens vector. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co>
Showing
Please register or sign in to comment