Unverified Commit cbad305c authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[docs] The use of `do_lower_case` in scripts is on its way to deprecation (#3738)

parent b169ac9c
...@@ -337,7 +337,6 @@ python ./examples/run_glue.py \ ...@@ -337,7 +337,6 @@ python ./examples/run_glue.py \
--task_name $TASK_NAME \ --task_name $TASK_NAME \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \ --data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \ --max_seq_length 128 \
--per_gpu_eval_batch_size=8 \ --per_gpu_eval_batch_size=8 \
...@@ -391,7 +390,6 @@ python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \ ...@@ -391,7 +390,6 @@ python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \ --data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \ --max_seq_length 128 \
--per_gpu_eval_batch_size=8 \ --per_gpu_eval_batch_size=8 \
...@@ -424,7 +422,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \ ...@@ -424,7 +422,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \ --model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \ --train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \ --predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \ --learning_rate 3e-5 \
......
...@@ -58,14 +58,14 @@ where ...@@ -58,14 +58,14 @@ where
``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository. ``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.). When using an ``uncased model``\ , make sure your tokenizer has ``do_lower_case=True`` (either in its configuration, or passed as an additional parameter).
Examples: Examples:
.. code-block:: python .. code-block:: python
# BERT # BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# OpenAI GPT # OpenAI GPT
...@@ -140,13 +140,13 @@ Here is the recommended way of saving the model, configuration and vocabulary to ...@@ -140,13 +140,13 @@ Here is the recommended way of saving the model, configuration and vocabulary to
torch.save(model_to_save.state_dict(), output_model_file) torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file) model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir) tokenizer.save_pretrained(output_dir)
# Step 2: Re-load the saved model and vocabulary # Step 2: Re-load the saved model and vocabulary
# Example for a Bert model # Example for a Bert model
model = BertForQuestionAnswering.from_pretrained(output_dir) model = BertForQuestionAnswering.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case) # Add specific options if needed tokenizer = BertTokenizer.from_pretrained(output_dir) # Add specific options if needed
# Example for a GPT model # Example for a GPT model
model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir) model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir) tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
......
...@@ -168,7 +168,6 @@ python run_glue.py \ ...@@ -168,7 +168,6 @@ python run_glue.py \
--task_name $TASK_NAME \ --task_name $TASK_NAME \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \ --data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \ --max_seq_length 128 \
--per_gpu_train_batch_size 32 \ --per_gpu_train_batch_size 32 \
...@@ -209,7 +208,6 @@ python run_glue.py \ ...@@ -209,7 +208,6 @@ python run_glue.py \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \ --data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \ --max_seq_length 128 \
--per_gpu_train_batch_size 32 \ --per_gpu_train_batch_size 32 \
...@@ -236,7 +234,6 @@ python run_glue.py \ ...@@ -236,7 +234,6 @@ python run_glue.py \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \ --data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \ --max_seq_length 128 \
--per_gpu_train_batch_size 32 \ --per_gpu_train_batch_size 32 \
...@@ -261,7 +258,6 @@ python -m torch.distributed.launch \ ...@@ -261,7 +258,6 @@ python -m torch.distributed.launch \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \ --data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \ --max_seq_length 128 \
--per_gpu_train_batch_size 8 \ --per_gpu_train_batch_size 8 \
...@@ -295,7 +291,6 @@ python -m torch.distributed.launch \ ...@@ -295,7 +291,6 @@ python -m torch.distributed.launch \
--task_name mnli \ --task_name mnli \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MNLI/ \ --data_dir $GLUE_DIR/MNLI/ \
--max_seq_length 128 \ --max_seq_length 128 \
--per_gpu_train_batch_size 8 \ --per_gpu_train_batch_size 8 \
...@@ -336,7 +331,6 @@ python ./examples/run_multiple_choice.py \ ...@@ -336,7 +331,6 @@ python ./examples/run_multiple_choice.py \
--model_name_or_path roberta-base \ --model_name_or_path roberta-base \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $SWAG_DIR \ --data_dir $SWAG_DIR \
--learning_rate 5e-5 \ --learning_rate 5e-5 \
--num_train_epochs 3 \ --num_train_epochs 3 \
...@@ -382,7 +376,6 @@ python run_squad.py \ ...@@ -382,7 +376,6 @@ python run_squad.py \
--model_name_or_path bert-base-uncased \ --model_name_or_path bert-base-uncased \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \ --train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \ --predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \ --per_gpu_train_batch_size 12 \
...@@ -411,7 +404,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \ ...@@ -411,7 +404,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \ --model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \ --train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \ --predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \ --learning_rate 3e-5 \
...@@ -447,7 +439,6 @@ python run_squad.py \ ...@@ -447,7 +439,6 @@ python run_squad.py \
--model_name_or_path xlnet-large-cased \ --model_name_or_path xlnet-large-cased \
--do_train \ --do_train \
--do_eval \ --do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \ --train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \ --predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \ --learning_rate 3e-5 \
...@@ -597,7 +588,6 @@ python examples/hans/test_hans.py \ ...@@ -597,7 +588,6 @@ python examples/hans/test_hans.py \
--task_name hans \ --task_name hans \
--model_type $MODEL_TYPE \ --model_type $MODEL_TYPE \
--do_eval \ --do_eval \
--do_lower_case \
--data_dir $HANS_DIR \ --data_dir $HANS_DIR \
--model_name_or_path $MODEL_PATH \ --model_name_or_path $MODEL_PATH \
--max_seq_length 128 \ --max_seq_length 128 \
......
...@@ -89,6 +89,3 @@ ...@@ -89,6 +89,3 @@
description: Run evaluation during training at each logging step. description: Run evaluation during training at each logging step.
type: flag type: flag
default: true default: true
- name: do_lower_case
description: Set this flag if you are using an uncased model.
type: flag
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment