Unverified Commit 9519f0cd authored by Jeroen Steggink's avatar Jeroen Steggink Committed by GitHub
Browse files

Wrong model is used in example, should be character instead of subword model (#12676)



* Wrong model is used, should be character instead of subword

In the original Google repo for CANINE there was mixup in the model names in the README.md, which was fixed 2 weeks ago. Since this transformer model was created before, it probably resulted in wrong use in this example.

s = subword, c = character

* canine.rst style fix

* Update docs/source/model_doc/canine.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Styling canine.rst

* Added links to model cards.

* Fixed links to model cards.
Co-authored-by: default avatarJeroen Steggink <978411+jsteggink@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 5803a2a7
...@@ -48,6 +48,12 @@ Tips: ...@@ -48,6 +48,12 @@ Tips:
(which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
details for this can be found in the paper. details for this can be found in the paper.
- Models:
- `google/canine-c <https://huggingface.co/google/canine-c>`__: Pre-trained with autoregressive character loss,
12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
- `google/canine-s <https://huggingface.co/google/canine-s>`__: Pre-trained with subword loss, 12-layer,
768-hidden, 12-heads, 121M parameters (size ~500 MB).
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/google-research/language/tree/master/language/canine>`__. <https://github.com/google-research/language/tree/master/language/canine>`__.
...@@ -63,7 +69,7 @@ CANINE works on raw characters, so it can be used without a tokenizer: ...@@ -63,7 +69,7 @@ CANINE works on raw characters, so it can be used without a tokenizer:
from transformers import CanineModel from transformers import CanineModel
import torch import torch
model = CanineModel.from_pretrained('google/canine-s') # model pre-trained with autoregressive character loss model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
text = "hello world" text = "hello world"
# use Python's built-in ord() function to turn each character into its unicode code point id # use Python's built-in ord() function to turn each character into its unicode code point id
...@@ -81,8 +87,8 @@ sequences to the same length): ...@@ -81,8 +87,8 @@ sequences to the same length):
from transformers import CanineTokenizer, CanineModel from transformers import CanineTokenizer, CanineModel
model = CanineModel.from_pretrained('google/canine-s') model = CanineModel.from_pretrained('google/canine-c')
tokenizer = CanineTokenizer.from_pretrained('google/canine-s') tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."] inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt") encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment