Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
cbad305c
Unverified
Commit
cbad305c
authored
Apr 10, 2020
by
Julien Chaumond
Committed by
GitHub
Apr 10, 2020
Browse files
[docs] The use of `do_lower_case` in scripts is on its way to deprecation (#3738)
parent
b169ac9c
Changes
4
Show whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
4 additions
and
20 deletions
+4
-20
README.md
README.md
+0
-3
docs/source/serialization.rst
docs/source/serialization.rst
+4
-4
examples/README.md
examples/README.md
+0
-10
valohai.yaml
valohai.yaml
+0
-3
No files found.
README.md
View file @
cbad305c
...
...
@@ -337,7 +337,6 @@ python ./examples/run_glue.py \
--task_name
$TASK_NAME
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$GLUE_DIR
/
$TASK_NAME
\
--max_seq_length
128
\
--per_gpu_eval_batch_size
=
8
\
...
...
@@ -391,7 +390,6 @@ python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \
--task_name
MRPC
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$GLUE_DIR
/MRPC/
\
--max_seq_length
128
\
--per_gpu_eval_batch_size
=
8
\
...
...
@@ -424,7 +422,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_name_or_path
bert-large-uncased-whole-word-masking
\
--do_train
\
--do_eval
\
--do_lower_case
\
--train_file
$SQUAD_DIR
/train-v1.1.json
\
--predict_file
$SQUAD_DIR
/dev-v1.1.json
\
--learning_rate
3e-5
\
...
...
docs/source/serialization.rst
View file @
cbad305c
...
...
@@ -58,14 +58,14 @@ where
``
Uncased
``
means
that
the
text
has
been
lowercased
before
WordPiece
tokenization
,
e
.
g
.,
``
John
Smith
``
becomes
``
john
smith
``.
The
Uncased
model
also
strips
out
any
accent
markers
.
``
Cased
``
means
that
the
true
case
and
accent
markers
are
preserved
.
Typically
,
the
Uncased
model
is
better
unless
you
know
that
case
information
is
important
for
your
task
(
e
.
g
.,
Named
Entity
Recognition
or
Part
-
of
-
Speech
tagging
).
For
information
about
the
Multilingual
and
Chinese
model
,
see
the
`
Multilingual
README
<
https
://
github
.
com
/
google
-
research
/
bert
/
blob
/
master
/
multilingual
.
md
>`
__
or
the
original
TensorFlow
repository
.
When
using
an
``
uncased
model
``\
,
make
sure
to
pass
``--
do_lower_case
``
to
the
example
training
scripts
(
or
pass
``
do_lower_case
=
True
``
to
FullTokenizer
if
you
're using your own script and loading the tokenizer your-self.
).
When
using
an
``
uncased
model
``\
,
make
sure
your
tokenizer
has
``
do_lower_case
=
True
``
(
either
in
its
configuration
,
or
passed
as
an
additional
parameter
).
Examples
:
..
code
-
block
::
python
#
BERT
tokenizer = BertTokenizer.from_pretrained('
bert
-
base
-
uncased
',
do_lower_case=True,
do_basic_tokenize=True)
tokenizer
=
BertTokenizer
.
from_pretrained
(
'bert-base-uncased'
,
do_basic_tokenize
=
True
)
model
=
BertForSequenceClassification
.
from_pretrained
(
'bert-base-uncased'
)
#
OpenAI
GPT
...
...
@@ -140,13 +140,13 @@ Here is the recommended way of saving the model, configuration and vocabulary to
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer
.
save_
vocabulary
(
output_dir
)
tokenizer.save_
pretrained
(output_dir)
# Step 2: Re-load the saved model and vocabulary
# Example for a Bert model
model = BertForQuestionAnswering.from_pretrained(output_dir)
tokenizer
=
BertTokenizer
.
from_pretrained
(
output_dir
,
do_lower_case
=
args
.
do_lower_case
)
#
Add
specific
options
if
needed
tokenizer = BertTokenizer.from_pretrained(output_dir) # Add specific options if needed
# Example for a GPT model
model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
...
...
examples/README.md
View file @
cbad305c
...
...
@@ -168,7 +168,6 @@ python run_glue.py \
--task_name
$TASK_NAME
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$GLUE_DIR
/
$TASK_NAME
\
--max_seq_length
128
\
--per_gpu_train_batch_size
32
\
...
...
@@ -209,7 +208,6 @@ python run_glue.py \
--task_name
MRPC
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$GLUE_DIR
/MRPC/
\
--max_seq_length
128
\
--per_gpu_train_batch_size
32
\
...
...
@@ -236,7 +234,6 @@ python run_glue.py \
--task_name
MRPC
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$GLUE_DIR
/MRPC/
\
--max_seq_length
128
\
--per_gpu_train_batch_size
32
\
...
...
@@ -261,7 +258,6 @@ python -m torch.distributed.launch \
--task_name
MRPC
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$GLUE_DIR
/MRPC/
\
--max_seq_length
128
\
--per_gpu_train_batch_size
8
\
...
...
@@ -295,7 +291,6 @@ python -m torch.distributed.launch \
--task_name
mnli
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$GLUE_DIR
/MNLI/
\
--max_seq_length
128
\
--per_gpu_train_batch_size
8
\
...
...
@@ -336,7 +331,6 @@ python ./examples/run_multiple_choice.py \
--model_name_or_path
roberta-base
\
--do_train
\
--do_eval
\
--do_lower_case
\
--data_dir
$SWAG_DIR
\
--learning_rate
5e-5
\
--num_train_epochs
3
\
...
...
@@ -382,7 +376,6 @@ python run_squad.py \
--model_name_or_path
bert-base-uncased
\
--do_train
\
--do_eval
\
--do_lower_case
\
--train_file
$SQUAD_DIR
/train-v1.1.json
\
--predict_file
$SQUAD_DIR
/dev-v1.1.json
\
--per_gpu_train_batch_size
12
\
...
...
@@ -411,7 +404,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_name_or_path
bert-large-uncased-whole-word-masking
\
--do_train
\
--do_eval
\
--do_lower_case
\
--train_file
$SQUAD_DIR
/train-v1.1.json
\
--predict_file
$SQUAD_DIR
/dev-v1.1.json
\
--learning_rate
3e-5
\
...
...
@@ -447,7 +439,6 @@ python run_squad.py \
--model_name_or_path
xlnet-large-cased
\
--do_train
\
--do_eval
\
--do_lower_case
\
--train_file
$SQUAD_DIR
/train-v1.1.json
\
--predict_file
$SQUAD_DIR
/dev-v1.1.json
\
--learning_rate
3e-5
\
...
...
@@ -597,7 +588,6 @@ python examples/hans/test_hans.py \
--task_name
hans
\
--model_type
$MODEL_TYPE
\
--do_eval
\
--do_lower_case
\
--data_dir
$HANS_DIR
\
--model_name_or_path
$MODEL_PATH
\
--max_seq_length
128
\
...
...
valohai.yaml
View file @
cbad305c
...
...
@@ -89,6 +89,3 @@
description
:
Run evaluation during training at each logging step.
type
:
flag
default
:
true
-
name
:
do_lower_case
description
:
Set this flag if you are using an uncased model.
type
:
flag
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment