Unverified Commit cf3cf304 authored by Paul O'Leary McCann's avatar Paul O'Leary McCann Committed by GitHub
Browse files

Replace mecab-python3 with fugashi for Japanese tokenization (#6086)



* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo
Co-authored-by: default avatarsgugger <sylvain.gugger@gmail.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent f250beb8
...@@ -56,7 +56,7 @@ jobs: ...@@ -56,7 +56,7 @@ jobs:
RUN_CUSTOM_TOKENIZERS: yes RUN_CUSTOM_TOKENIZERS: yes
steps: steps:
- checkout - checkout
- run: sudo pip install .[mecab,testing] - run: sudo pip install .[ja,testing]
- run: python -m pytest -s ./tests/test_tokenization_bert_japanese.py | tee output.txt - run: python -m pytest -s ./tests/test_tokenization_bert_japanese.py | tee output.txt
- store_artifacts: - store_artifacts:
path: ~/transformers/output.txt path: ~/transformers/output.txt
......
...@@ -74,14 +74,16 @@ For a list that includes community-uploaded models, refer to `https://huggingfac ...@@ -74,14 +74,16 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). | | | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``cl-tohoku/bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``cl-tohoku/bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece. | | | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies, |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. | | | | | `fugashi <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__. |
| | | | Use ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install from source) to install them. |
| | | | | | | |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). | | | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``cl-tohoku/bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | ``cl-tohoku/bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece. | | | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies, |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. | | | | | `fugashi <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__. |
| | | | Use ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install from source) to install them. |
| | | | | | | |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). | | | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
......
...@@ -10,10 +10,10 @@ known_third_party = ...@@ -10,10 +10,10 @@ known_third_party =
faiss faiss
fastprogress fastprogress
fire fire
fugashi
git git
h5py h5py
matplotlib matplotlib
MeCab
nlp nlp
nltk nltk
numpy numpy
......
...@@ -65,7 +65,7 @@ if stale_egg_info.exists(): ...@@ -65,7 +65,7 @@ if stale_egg_info.exists():
extras = {} extras = {}
extras["mecab"] = ["mecab-python3<1"] extras["ja"] = ["fugashi>=1.0", "ipadic>=1.0,<2.0"]
extras["sklearn"] = ["scikit-learn"] extras["sklearn"] = ["scikit-learn"]
# keras2onnx and onnxconverter-common version is specific through a commit until 1.7.0 lands on pypi # keras2onnx and onnxconverter-common version is specific through a commit until 1.7.0 lands on pypi
...@@ -97,7 +97,7 @@ extras["quality"] = [ ...@@ -97,7 +97,7 @@ extras["quality"] = [
"isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort", "isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
"flake8", "flake8",
] ]
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3<1", "scikit-learn", "tensorflow", "torch"] extras["dev"] = extras["testing"] + extras["quality"] + extras["ja"] + ["scikit-learn", "tensorflow", "torch"]
setup( setup(
name="transformers", name="transformers",
......
...@@ -185,9 +185,14 @@ class MecabTokenizer: ...@@ -185,9 +185,14 @@ class MecabTokenizer:
self.never_split = never_split if never_split is not None else [] self.never_split = never_split if never_split is not None else []
self.normalize_text = normalize_text self.normalize_text = normalize_text
import MeCab import fugashi
import ipadic
self.mecab = MeCab.Tagger(mecab_option) if mecab_option is not None else MeCab.Tagger() # Use ipadic by default (later options can override it)
mecab_option = mecab_option or ""
mecab_option = ipadic.MECAB_ARGS + " " + mecab_option
self.mecab = fugashi.GenericTagger(mecab_option)
def tokenize(self, text, never_split=None, **kwargs): def tokenize(self, text, never_split=None, **kwargs):
"""Tokenizes a piece of text.""" """Tokenizes a piece of text."""
...@@ -197,21 +202,13 @@ class MecabTokenizer: ...@@ -197,21 +202,13 @@ class MecabTokenizer:
never_split = self.never_split + (never_split if never_split is not None else []) never_split = self.never_split + (never_split if never_split is not None else [])
tokens = [] tokens = []
mecab_output = self.mecab.parse(text) for word in self.mecab(text):
token = word.surface
cursor = 0
for line in mecab_output.split("\n"):
if line == "EOS":
break
token, _ = line.split("\t")
token_start = text.index(token, cursor)
token_end = token_start + len(token)
if self.do_lower_case and token not in never_split: if self.do_lower_case and token not in never_split:
token = token.lower() token = token.lower()
tokens.append(token) tokens.append(token)
cursor = token_end
return tokens return tokens
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment