• Connor Henderson's avatar
    fix: Text splitting in the BasicTokenizer (#22280) · 5739726f
    Connor Henderson authored
    * fix: Apostraphe splitting in the BasicTokenizer for CLIPTokenizer
    
    * account for apostrophe at start of new word
    
    * remove _run_split_on_punc, use re.findall instead
    
    * remove debugging, make style and quality
    
    * use pattern and punc splitting, repo-consistency will fail
    
    * remove commented out debugging
    
    * adds bool args to BasicTokenizer, remove pattern
    
    * do_split_on_punc default True
    
    * clean stray comments and line breaks
    
    * rebase, repo-consistency
    
    * update to just do punctuation split
    
    * add unicode normalizing back
    
    * remove redundant line
    5739726f
test_tokenization_bert.py 13.9 KB