"vscode:/vscode.git/clone" did not exist on "5f1d18d49e01f1349e323e3e291dd5b3642a5ce3"
README.rst 15.2 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217

Hyperparameter Optimization
===========================

The :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` supports hyperparameter optimization using ``transformers``, which in turn supports four hyperparameter search backends: `optuna <https://optuna.org/>`_, `sigopt <https://sigopt.org/>`_, `raytune <https://docs.ray.io/en/latest/tune/index.html>`_, and `wandb <https://wandb.ai/site/sweeps>`_. You should install your backend of choice before using it::

    pip install optuna/sigopt/wandb/ray[tune] 

On this page, we'll show you how to use the hyperparameter optimization feature with the `optuna` backend. The other backends are similar to use, but you should refer to their respective documentation or the `transformers HPO documentation <https://huggingface.co/docs/transformers/en/hpo_train>`_ for more information.

HPO Components
--------------

The hyperparameter optimization process consists of the following components:

.. raw:: html

    <div class="components">
        <a href="#hyperparameter-search-space" class="box">
            <div class="header">Hyperparameter Search Space</div>
            Specify ranges for hyperparameter values.
        </a>
        <a href="#model-initialization" class="box">
            <div class="header">Model Initialization</div>
            Initialize a SentenceTransformer model for a trial.
        </a>
        <a href="#loss-initialization" class="box">
            <div class="header">Loss Initialization</div>
            Initialize a loss function given a model.
        </a>
        <a href="#compute-objective" class="box">
            <div class="header">Compute Objective</div>
            Determines the value to be minimized or maximized.
        </a>
    </div>
    <br>

Hyperparameter Search Space
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The hyperparameter search space is defined by a function that returns a dictionary of hyperparameters and their respective search spaces. Here's an example using ``optuna`` of a search space function that defines the hyperparameters for a `SentenceTransformer` model::

    def hpo_search_space(trial):
        return {
            "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2),
            "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128),
            "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3),
            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        }

Model Initialization
~~~~~~~~~~~~~~~~~~~~

The model initialization function is a function that takes the hyperparameters of the current "trial" as input and returns a `SentenceTransformer` model. Generally, this function is quite simple. Here's an example of a model initialization function::

    def hpo_model_init(trial):
        return SentenceTransformer("distilbert-base-uncased")

Loss Initialization
~~~~~~~~~~~~~~~~~~~

The loss initialization function is a function that takes the model initialized for the current trial and returns a loss function. Here's an example of a loss initialization function::

    def hpo_loss_init(model):
        return losses.CosineSimilarityLoss(model)

Compute Objective
~~~~~~~~~~~~~~~~~

The compute objective function is a function that takes the evaluation ``metrics`` and returns the float value to be minimized or maximized. Here's an example of a compute objective function::

    def hpo_compute_objective(metrics):
        return metrics["eval_sts-dev_spearman_cosine"]

.. note:

    The dictionary keys of ``metrics`` are all prepended with ``eval_``. Additionally, if you're interested in maximizing the performance of an evaluator, note that the ``name`` of the evaluator is also prepended with a ``-``. So, to optimize on ``spearman_cosine`` from :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator` which was initialized with ``name="stsb_dev"``, then you would use the key ``eval_sts-dev_spearman_cosine`` in your ``hpo_compute_objective``.

    Another common option is to use ``eval_loss``.

Putting It All Together
------------------------

You can perform HPO on any regular training loop, the only difference being that you don't call :meth:`SentenceTransformerTrainer.train <sentence_transformers.trainer.SentenceTransformerTrainer.train>`, but :meth:`SentenceTransformerTrainer.hyperparameter_search <sentence_transformers.trainer.SentenceTransformerTrainer.hyperparameter_search>` instead. Here's an example of how to put it all together:

.. sidebar:: Documentation

    #. `sentence-transformers/all-nli <https://huggingface.co/datasets/sentence-transformers/all-nli>`_
    #. :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator`
    #. `Hyperparameter Search Space <#hyperparameter-search-space>`_
    #. `Model Initialization <#model-initialization>`_
    #. `Loss Initialization <#loss-initialization>`_
    #. `Compute Objective <#compute-objective>`_
    #. :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments`
    #. :class:`~sentence_transformers.trainer.SentenceTransformerTrainer`
    #. :meth:`~sentence_transformers.trainer.SentenceTransformerTrainer.hyperparameter_search`

::

    from sentence_transformers import losses
    from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
    from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
    from sentence_transformers.training_args import BatchSamplers
    from datasets import load_dataset

    # 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, only 10k train and 1k dev
    train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]")
    eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev[:1000]")

    # 2. Create an evaluator to perform useful HPO
    stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
    dev_evaluator = EmbeddingSimilarityEvaluator(
        sentences1=stsb_eval_dataset["sentence1"],
        sentences2=stsb_eval_dataset["sentence2"],
        scores=stsb_eval_dataset["score"],
        main_similarity=SimilarityFunction.COSINE,
        name="sts-dev",
    )

    # 3. Define the Hyperparameter Search Space
    def hpo_search_space(trial):
        return {
            "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2),
            "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128),
            "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3),
            "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        }

    # 4. Define the Model Initialization
    def hpo_model_init(trial):
        return SentenceTransformer("distilbert-base-uncased")

    # 5. Define the Loss Initialization
    def hpo_loss_init(model):
        return losses.MultipleNegativesRankingLoss(model)

    # 6. Define the Objective Function
    def hpo_compute_objective(metrics):
        """
        Valid keys are: 'eval_loss', 'eval_sts-dev_pearson_cosine', 'eval_sts-dev_spearman_cosine',
        'eval_sts-dev_pearson_manhattan', 'eval_sts-dev_spearman_manhattan', 'eval_sts-dev_pearson_euclidean',
        'eval_sts-dev_spearman_euclidean', 'eval_sts-dev_pearson_dot', 'eval_sts-dev_spearman_dot',
        'eval_sts-dev_pearson_max', 'eval_sts-dev_spearman_max', 'eval_runtime', 'eval_samples_per_second',
        'eval_steps_per_second', 'epoch'

        due to the evaluator that we're using.
        """
        return metrics["eval_sts-dev_spearman_cosine"]

    # 7. Define the training arguments
    args = SentenceTransformerTrainingArguments(
        # Required parameter:
        output_dir="checkpoints",
        # Optional training parameters:
        # max_steps=10000, # We might want to limit the number of steps for HPO
        fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
        bf16=False,  # Set to True if you have a GPU that supports BF16
        batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
        # Optional tracking/debugging parameters:
        eval_strategy="no", # We don't need to evaluate/save during HPO
        save_strategy="no",
        logging_steps=10,
        run_name="hpo",  # Will be used in W&B if `wandb` is installed
    )

    # 8. Create the trainer with model_init rather than model
    trainer = SentenceTransformerTrainer(
        model=None,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        evaluator=dev_evaluator,
        model_init=hpo_model_init,
        loss=hpo_loss_init,
    )

    # 9. Perform the HPO
    best_trial = trainer.hyperparameter_search(
        hp_space=hpo_search_space,
        compute_objective=hpo_compute_objective,
        n_trials=20,
        direction="maximize",
        backend="optuna",
    )
    print(best_trial)

::

    [I 2024-05-17 15:10:47,844] Trial 0 finished with value: 0.7889856589698055 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 123, 'warmup_ratio': 0.07380948785410107, 'learning_rate': 2.686331417509812e-06}. Best is trial 0 with value: 0.7889856589698055.
    [I 2024-05-17 15:12:13,283] Trial 1 finished with value: 0.7927780672090986 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 69, 'warmup_ratio': 0.2927897848007451, 'learning_rate': 5.885372118095137e-06}. Best is trial 1 with value: 0.7927780672090986.
    [I 2024-05-17 15:12:43,896] Trial 2 finished with value: 0.7684829743509601 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 114, 'warmup_ratio': 0.0739429232666916, 'learning_rate': 7.344415188959276e-05}. Best is trial 1 with value: 0.7927780672090986.
    [I 2024-05-17 15:14:49,730] Trial 3 finished with value: 0.7873032743147989 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 43, 'warmup_ratio': 0.15184370143796674, 'learning_rate': 9.703232080395476e-06}. Best is trial 1 with value: 0.7927780672090986.
    [I 2024-05-17 15:15:39,597] Trial 4 finished with value: 0.7759251781929949 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 127, 'warmup_ratio': 0.263946220093495, 'learning_rate': 1.231454337152625e-06}. Best is trial 1 with value: 0.7927780672090986.
    [I 2024-05-17 15:17:02,191] Trial 5 finished with value: 0.7964580509886684 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 34, 'warmup_ratio': 0.2276865359631089, 'learning_rate': 7.889007438884571e-06}. Best is trial 5 with value: 0.7964580509886684.
    [I 2024-05-17 15:18:55,559] Trial 6 finished with value: 0.7901878917859169 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 48, 'warmup_ratio': 0.23228838664572948, 'learning_rate': 2.883013292682523e-06}. Best is trial 5 with value: 0.7964580509886684.
    [I 2024-05-17 15:20:27,027] Trial 7 finished with value: 0.7935671067660925 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 62, 'warmup_ratio': 0.22061123927198237, 'learning_rate': 2.95413457610349e-06}. Best is trial 5 with value: 0.7964580509886684.
    [I 2024-05-17 15:22:23,147] Trial 8 finished with value: 0.7848123114933252 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 45, 'warmup_ratio': 0.23071701022961139, 'learning_rate': 9.793681667449783e-06}. Best is trial 5 with value: 0.7964580509886684.
    [I 2024-05-17 15:22:52,826] Trial 9 finished with value: 0.7909708416168918 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 121, 'warmup_ratio': 0.22440506724181647, 'learning_rate': 4.0744671365843346e-05}. Best is trial 5 with value: 0.7964580509886684.
    [I 2024-05-17 15:23:30,395] Trial 10 finished with value: 0.7928991732385567 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 89, 'warmup_ratio': 0.14607293301068847, 'learning_rate': 2.5557492055039498e-05}. Best is trial 5 with value: 0.7964580509886684.
    [I 2024-05-17 15:24:18,024] Trial 11 finished with value: 0.7991870087507459 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 66, 'warmup_ratio': 0.16886154348739527, 'learning_rate': 3.705926066938032e-06}. Best is trial 11 with value: 0.7991870087507459.
    [I 2024-05-17 15:25:44,198] Trial 12 finished with value: 0.7923304174306207 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 33, 'warmup_ratio': 0.15953772535423974, 'learning_rate': 1.8076298025704224e-05}. Best is trial 11 with value: 0.7991870087507459.
    [I 2024-05-17 15:26:20,739] Trial 13 finished with value: 0.8020260244040395 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 90, 'warmup_ratio': 0.18105202625281253, 'learning_rate': 5.513908793512551e-06}. Best is trial 13 with value: 0.8020260244040395.
    [I 2024-05-17 15:26:57,783] Trial 14 finished with value: 0.7571110256860063 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 95, 'warmup_ratio': 0.00122391151793258, 'learning_rate': 1.0432486633629492e-06}. Best is trial 13 with value: 0.8020260244040395.
    [I 2024-05-17 15:27:32,581] Trial 15 finished with value: 0.8009013936824717 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 101, 'warmup_ratio': 0.1761274711346081, 'learning_rate': 4.5918293464430035e-06}. Best is trial 13 with value: 0.8020260244040395.
    [I 2024-05-17 15:28:05,850] Trial 16 finished with value: 0.8017668050806169 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 103, 'warmup_ratio': 0.10766501647726355, 'learning_rate': 5.0309795522333e-06}. Best is trial 13 with value: 0.8020260244040395.
    [I 2024-05-17 15:28:37,393] Trial 17 finished with value: 0.7769412380909586 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 108, 'warmup_ratio': 0.1036610178950246, 'learning_rate': 1.7747598626081271e-06}. Best is trial 13 with value: 0.8020260244040395.
    [I 2024-05-17 15:29:19,340] Trial 18 finished with value: 0.8011921300048339 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 80, 'warmup_ratio': 0.117014165550441, 'learning_rate': 1.238558867958792e-05}. Best is trial 13 with value: 0.8020260244040395.
    [I 2024-05-17 15:29:59,508] Trial 19 finished with value: 0.8027501854704168 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}. Best is trial 19 with value: 0.8027501854704168.

    BestRun(run_id='19', objective=0.8027501854704168, hyperparameters={'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}, run_summary=None)

As you can see, the strongest hyperparameters reached **0.802** Spearman correlation on the STS (dev) benchmark. For context, training with the default training arguments (``per_device_train_batch_size=8``, ``learning_rate=5e-5``) results in **0.736**, and hyperparameters chosen based on experience (``per_device_train_batch_size=64``, ``learning_rate=2e-5``) results in **0.783** Spearman correlation. Consequently, HPO proved quite effective here in improving the model performance.

Example Scripts
---------------

- `hpo_nli.py <hpo_nli.py>`_ - An example script that performs hyperparameter optimization on the AllNLI dataset.