Hyperparameter Optimization =========================== The :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` supports hyperparameter optimization using ``transformers``, which in turn supports four hyperparameter search backends: `optuna `_, `sigopt `_, `raytune `_, and `wandb `_. You should install your backend of choice before using it:: pip install optuna/sigopt/wandb/ray[tune] On this page, we'll show you how to use the hyperparameter optimization feature with the `optuna` backend. The other backends are similar to use, but you should refer to their respective documentation or the `transformers HPO documentation `_ for more information. HPO Components -------------- The hyperparameter optimization process consists of the following components: .. raw:: html
Hyperparameter Search Space ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The hyperparameter search space is defined by a function that returns a dictionary of hyperparameters and their respective search spaces. Here's an example using ``optuna`` of a search space function that defines the hyperparameters for a `SentenceTransformer` model:: def hpo_search_space(trial): return { "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2), "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128), "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3), "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True), } Model Initialization ~~~~~~~~~~~~~~~~~~~~ The model initialization function is a function that takes the hyperparameters of the current "trial" as input and returns a `SentenceTransformer` model. Generally, this function is quite simple. Here's an example of a model initialization function:: def hpo_model_init(trial): return SentenceTransformer("distilbert-base-uncased") Loss Initialization ~~~~~~~~~~~~~~~~~~~ The loss initialization function is a function that takes the model initialized for the current trial and returns a loss function. Here's an example of a loss initialization function:: def hpo_loss_init(model): return losses.CosineSimilarityLoss(model) Compute Objective ~~~~~~~~~~~~~~~~~ The compute objective function is a function that takes the evaluation ``metrics`` and returns the float value to be minimized or maximized. Here's an example of a compute objective function:: def hpo_compute_objective(metrics): return metrics["eval_sts-dev_spearman_cosine"] .. note: The dictionary keys of ``metrics`` are all prepended with ``eval_``. Additionally, if you're interested in maximizing the performance of an evaluator, note that the ``name`` of the evaluator is also prepended with a ``-``. So, to optimize on ``spearman_cosine`` from :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator` which was initialized with ``name="stsb_dev"``, then you would use the key ``eval_sts-dev_spearman_cosine`` in your ``hpo_compute_objective``. Another common option is to use ``eval_loss``. Putting It All Together ------------------------ You can perform HPO on any regular training loop, the only difference being that you don't call :meth:`SentenceTransformerTrainer.train `, but :meth:`SentenceTransformerTrainer.hyperparameter_search ` instead. Here's an example of how to put it all together: .. sidebar:: Documentation #. `sentence-transformers/all-nli `_ #. :class:`~sentence_transformers.evaluation.EmbeddingSimilarityEvaluator` #. `Hyperparameter Search Space <#hyperparameter-search-space>`_ #. `Model Initialization <#model-initialization>`_ #. `Loss Initialization <#loss-initialization>`_ #. `Compute Objective <#compute-objective>`_ #. :class:`~sentence_transformers.training_args.SentenceTransformerTrainingArguments` #. :class:`~sentence_transformers.trainer.SentenceTransformerTrainer` #. :meth:`~sentence_transformers.trainer.SentenceTransformerTrainer.hyperparameter_search` :: from sentence_transformers import losses from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction from sentence_transformers.training_args import BatchSamplers from datasets import load_dataset # 1. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli, only 10k train and 1k dev train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:10000]") eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev[:1000]") # 2. Create an evaluator to perform useful HPO stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") dev_evaluator = EmbeddingSimilarityEvaluator( sentences1=stsb_eval_dataset["sentence1"], sentences2=stsb_eval_dataset["sentence2"], scores=stsb_eval_dataset["score"], main_similarity=SimilarityFunction.COSINE, name="sts-dev", ) # 3. Define the Hyperparameter Search Space def hpo_search_space(trial): return { "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 2), "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 32, 128), "warmup_ratio": trial.suggest_float("warmup_ratio", 0, 0.3), "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True), } # 4. Define the Model Initialization def hpo_model_init(trial): return SentenceTransformer("distilbert-base-uncased") # 5. Define the Loss Initialization def hpo_loss_init(model): return losses.MultipleNegativesRankingLoss(model) # 6. Define the Objective Function def hpo_compute_objective(metrics): """ Valid keys are: 'eval_loss', 'eval_sts-dev_pearson_cosine', 'eval_sts-dev_spearman_cosine', 'eval_sts-dev_pearson_manhattan', 'eval_sts-dev_spearman_manhattan', 'eval_sts-dev_pearson_euclidean', 'eval_sts-dev_spearman_euclidean', 'eval_sts-dev_pearson_dot', 'eval_sts-dev_spearman_dot', 'eval_sts-dev_pearson_max', 'eval_sts-dev_spearman_max', 'eval_runtime', 'eval_samples_per_second', 'eval_steps_per_second', 'epoch' due to the evaluator that we're using. """ return metrics["eval_sts-dev_spearman_cosine"] # 7. Define the training arguments args = SentenceTransformerTrainingArguments( # Required parameter: output_dir="checkpoints", # Optional training parameters: # max_steps=10000, # We might want to limit the number of steps for HPO fp16=True, # Set to False if you get an error that your GPU can't run on FP16 bf16=False, # Set to True if you have a GPU that supports BF16 batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch # Optional tracking/debugging parameters: eval_strategy="no", # We don't need to evaluate/save during HPO save_strategy="no", logging_steps=10, run_name="hpo", # Will be used in W&B if `wandb` is installed ) # 8. Create the trainer with model_init rather than model trainer = SentenceTransformerTrainer( model=None, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, evaluator=dev_evaluator, model_init=hpo_model_init, loss=hpo_loss_init, ) # 9. Perform the HPO best_trial = trainer.hyperparameter_search( hp_space=hpo_search_space, compute_objective=hpo_compute_objective, n_trials=20, direction="maximize", backend="optuna", ) print(best_trial) :: [I 2024-05-17 15:10:47,844] Trial 0 finished with value: 0.7889856589698055 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 123, 'warmup_ratio': 0.07380948785410107, 'learning_rate': 2.686331417509812e-06}. Best is trial 0 with value: 0.7889856589698055. [I 2024-05-17 15:12:13,283] Trial 1 finished with value: 0.7927780672090986 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 69, 'warmup_ratio': 0.2927897848007451, 'learning_rate': 5.885372118095137e-06}. Best is trial 1 with value: 0.7927780672090986. [I 2024-05-17 15:12:43,896] Trial 2 finished with value: 0.7684829743509601 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 114, 'warmup_ratio': 0.0739429232666916, 'learning_rate': 7.344415188959276e-05}. Best is trial 1 with value: 0.7927780672090986. [I 2024-05-17 15:14:49,730] Trial 3 finished with value: 0.7873032743147989 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 43, 'warmup_ratio': 0.15184370143796674, 'learning_rate': 9.703232080395476e-06}. Best is trial 1 with value: 0.7927780672090986. [I 2024-05-17 15:15:39,597] Trial 4 finished with value: 0.7759251781929949 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 127, 'warmup_ratio': 0.263946220093495, 'learning_rate': 1.231454337152625e-06}. Best is trial 1 with value: 0.7927780672090986. [I 2024-05-17 15:17:02,191] Trial 5 finished with value: 0.7964580509886684 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 34, 'warmup_ratio': 0.2276865359631089, 'learning_rate': 7.889007438884571e-06}. Best is trial 5 with value: 0.7964580509886684. [I 2024-05-17 15:18:55,559] Trial 6 finished with value: 0.7901878917859169 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 48, 'warmup_ratio': 0.23228838664572948, 'learning_rate': 2.883013292682523e-06}. Best is trial 5 with value: 0.7964580509886684. [I 2024-05-17 15:20:27,027] Trial 7 finished with value: 0.7935671067660925 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 62, 'warmup_ratio': 0.22061123927198237, 'learning_rate': 2.95413457610349e-06}. Best is trial 5 with value: 0.7964580509886684. [I 2024-05-17 15:22:23,147] Trial 8 finished with value: 0.7848123114933252 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 45, 'warmup_ratio': 0.23071701022961139, 'learning_rate': 9.793681667449783e-06}. Best is trial 5 with value: 0.7964580509886684. [I 2024-05-17 15:22:52,826] Trial 9 finished with value: 0.7909708416168918 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 121, 'warmup_ratio': 0.22440506724181647, 'learning_rate': 4.0744671365843346e-05}. Best is trial 5 with value: 0.7964580509886684. [I 2024-05-17 15:23:30,395] Trial 10 finished with value: 0.7928991732385567 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 89, 'warmup_ratio': 0.14607293301068847, 'learning_rate': 2.5557492055039498e-05}. Best is trial 5 with value: 0.7964580509886684. [I 2024-05-17 15:24:18,024] Trial 11 finished with value: 0.7991870087507459 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 66, 'warmup_ratio': 0.16886154348739527, 'learning_rate': 3.705926066938032e-06}. Best is trial 11 with value: 0.7991870087507459. [I 2024-05-17 15:25:44,198] Trial 12 finished with value: 0.7923304174306207 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 33, 'warmup_ratio': 0.15953772535423974, 'learning_rate': 1.8076298025704224e-05}. Best is trial 11 with value: 0.7991870087507459. [I 2024-05-17 15:26:20,739] Trial 13 finished with value: 0.8020260244040395 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 90, 'warmup_ratio': 0.18105202625281253, 'learning_rate': 5.513908793512551e-06}. Best is trial 13 with value: 0.8020260244040395. [I 2024-05-17 15:26:57,783] Trial 14 finished with value: 0.7571110256860063 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 95, 'warmup_ratio': 0.00122391151793258, 'learning_rate': 1.0432486633629492e-06}. Best is trial 13 with value: 0.8020260244040395. [I 2024-05-17 15:27:32,581] Trial 15 finished with value: 0.8009013936824717 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 101, 'warmup_ratio': 0.1761274711346081, 'learning_rate': 4.5918293464430035e-06}. Best is trial 13 with value: 0.8020260244040395. [I 2024-05-17 15:28:05,850] Trial 16 finished with value: 0.8017668050806169 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 103, 'warmup_ratio': 0.10766501647726355, 'learning_rate': 5.0309795522333e-06}. Best is trial 13 with value: 0.8020260244040395. [I 2024-05-17 15:28:37,393] Trial 17 finished with value: 0.7769412380909586 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 108, 'warmup_ratio': 0.1036610178950246, 'learning_rate': 1.7747598626081271e-06}. Best is trial 13 with value: 0.8020260244040395. [I 2024-05-17 15:29:19,340] Trial 18 finished with value: 0.8011921300048339 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 80, 'warmup_ratio': 0.117014165550441, 'learning_rate': 1.238558867958792e-05}. Best is trial 13 with value: 0.8020260244040395. [I 2024-05-17 15:29:59,508] Trial 19 finished with value: 0.8027501854704168 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}. Best is trial 19 with value: 0.8027501854704168. BestRun(run_id='19', objective=0.8027501854704168, hyperparameters={'num_train_epochs': 1, 'per_device_train_batch_size': 84, 'warmup_ratio': 0.014601112207929548, 'learning_rate': 5.627813947769514e-06}, run_summary=None) As you can see, the strongest hyperparameters reached **0.802** Spearman correlation on the STS (dev) benchmark. For context, training with the default training arguments (``per_device_train_batch_size=8``, ``learning_rate=5e-5``) results in **0.736**, and hyperparameters chosen based on experience (``per_device_train_batch_size=64``, ``learning_rate=2e-5``) results in **0.783** Spearman correlation. Consequently, HPO proved quite effective here in improving the model performance. Example Scripts --------------- - `hpo_nli.py `_ - An example script that performs hyperparameter optimization on the AllNLI dataset.