Update run_mlm.py (#12344)

Before the code could not be used for validation only because of this line: extension = data_args.train_file.split(".")[-1] was assuming that extension must be extracted from the training dataset. This line would run regardless of the training or validation options of the user. This would lead to an error if the user only wants to run an evaluation only and does not want to do train (because the training file does not exist). I modified it to extract extension from the training file if the user wants to do train and extract it from the validation file if the user wants to run eval. This way the code can be used for both training and validation separately.

Update run_mlm.py (#12344)
Before the code could not be used for validation only because of this line: extension = data_args.train_file.split(".")[-1] was assuming that extension must be extracted from the training dataset. This line would run regardless of the training or validation options of the user. This would lead to an error if the user only wants to run an evaluation only and does not want to do train (because the training file does not exist). I modified it to extract extension from the training file if the user wants to do train and extract it from the validation file if the user wants to run eval. This way the code can be used for both training and validation separately.
9490d668 · Taha ValizadehAslani · GitHub · c7faf2cc · 9490d668
Unverified Commit 9490d668 authored Jun 28, 2021 by Taha ValizadehAslani Committed by GitHub Jun 28, 2021
Show whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

examples/pytorch/language-modeling/run_mlm.py examples/pytorch/language-modeling/run_mlm.py +2 -1

No files found.
--- a/examples/pytorch/language-modeling/run_mlm.py
+++ b/examples/pytorch/language-modeling/run_mlm.py
@@ -278,9 +278,10 @@ def main():
        data_files = {}
        if data_args.train_file is not None:
            data_files["train"] = data_args.train_file
+            extension = data_args.train_file.split(".")[-1]
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
-        extension = data_args.train_file.split(".")[-1]
+            extension = data_args.validation_file.split(".")[-1]
        if extension == "txt":
            extension = "text"
        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)