Use text_column_name variable instead of "text" (#12132)

* Use text_column_name variable instead of "text" `text_column_name` was already defined above where I made the changes and it was also used below where I made changes. This is a very minor change. If a dataset does not use "text" as the column name, then the `tokenize_function` will now use whatever column is assigned to `text_column_name`. `text_column_name` is just the first column name if "text" is not a column name. It makes the function a little more robust, though I would assume that 90% + of datasets use "text" anyway. * black formatting * make style Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>

Use text_column_name variable instead of "text" (#12132)
* Use text_column_name variable instead of "text" `text_column_name` was already defined above where I made the changes and it was also used below where I made changes. This is a very minor change. If a dataset does not use "text" as the column name, then the `tokenize_function` will now use whatever column is assigned to `text_column_name`. `text_column_name` is just the first column name if "text" is not a column name. It makes the function a little more robust, though I would assume that 90% + of datasets use "text" anyway. * black formatting * make style Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>
cd7961b6 · Nicholas Broad · GitHub · b8ab5413 · cd7961b6 · cd7961b6
Unverified Commit cd7961b6 authored Jun 14, 2021 by Nicholas Broad Committed by GitHub Jun 14, 2021
Showing with 8 additions and 4 deletions

examples/pytorch/language-modeling/run_mlm.py examples/pytorch/language-modeling/run_mlm.py +4 -2

examples/pytorch/language-modeling/run_mlm_no_trainer.py examples/pytorch/language-modeling/run_mlm_no_trainer.py +4 -2

No files found.
--- a/examples/pytorch/language-modeling/run_mlm.py
+++ b/examples/pytorch/language-modeling/run_mlm.py
@@ -345,9 +345,11 @@ def main():
        def tokenize_function(examples):
            # Remove empty lines
-            examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
+            examples[text_column_name] = [
+                line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()
+            ]
            return tokenizer(
-                examples["text"],
+                examples[text_column_name],
                padding=padding,
                truncation=True,
                max_length=max_seq_length,

--- a/examples/pytorch/language-modeling/run_mlm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_mlm_no_trainer.py
@@ -327,9 +327,11 @@ def main():
        def tokenize_function(examples):
            # Remove empty lines
-            examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
+            examples[text_column_name] = [
+                line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()
+            ]
            return tokenizer(
-                examples["text"],
+                examples[text_column_name],
                padding=padding,
                truncation=True,
                max_length=max_seq_length,