"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "be3fd8a262fb1bfdbe2aaf1b00ab78e243632cba"
Unverified Commit 8fd8c8e4 authored by ranchlai's avatar ranchlai Committed by GitHub
Browse files

Add multi-label text classification support to pytorch example (#24770)

* Add text classification example

* set the problem type and finetuning task

* ruff reformated

* fix bug for unseting label_to_id for regression

* update README.md

* fixed finetuning task

* update comment

* check if label exists in feature before removing

* add useful logging
parent 7381987f
...@@ -81,6 +81,55 @@ python run_glue.py \ ...@@ -81,6 +81,55 @@ python run_glue.py \
> If your model classification head dimensions do not fit the number of labels in the dataset, you can specify `--ignore_mismatched_sizes` to adapt it. > If your model classification head dimensions do not fit the number of labels in the dataset, you can specify `--ignore_mismatched_sizes` to adapt it.
## Text classification
As an alternative, we can use the script [`run_classification.py`](./run_classification.py) to fine-tune models on a single/multi-label classification task.
The following example fine-tunes BERT on the `en` subset of [`amazon_reviews_multi`](https://huggingface.co/datasets/amazon_reviews_multi) dataset.
We can specify the metric, the label column and aso choose which text columns to use jointly for classification.
```bash
dataset="amazon_reviews_multi"
subset="en"
python run_classification.py \
--model_name_or_path bert-base-uncased \
--dataset_name ${dataset} \
--dataset_config_name ${subset} \
--shuffle_train_dataset \
--metric_name accuracy \
--text_column_name "review_title,review_body,product_category" \
--text_column_delimiter "\n" \
--label_column_name stars \
--do_train \
--do_eval \
--max_seq_length 512 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--output_dir /tmp/${dataset}_${subset}/
```
Training for 1 epoch results in acc of around 0.5958 for review_body only and 0.659 for title+body+category.
The following is a multi-label classification example. It fine-tunes BERT on the `reuters21578` dataset hosted on our [hub](https://huggingface.co/datasets/reuters21578):
```bash
dataset="reuters21578"
subset="ModApte"
python run_classification.py \
--model_name_or_path bert-base-uncased \
--dataset_name ${dataset} \
--dataset_config_name ${subset} \
--shuffle_train_dataset \
--remove_splits "unused" \
--metric_name f1 \
--text_column_name text \
--label_column_name topics \
--do_train \
--do_eval \
--max_seq_length 512 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 15 \
--output_dir /tmp/${dataset}_${subset}/
```
It results in a Micro F1 score of around 0.82 without any text and label filtering. Note that you have to explictly remove the "unused" split from the dataset, since it is not used for classification.
### Mixed precision training ### Mixed precision training
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment