Enable training for fraction of total steps; enable early stopping from trial 0

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/627 Enable training for fraction of total steps: when doing HPO, users may want to train for a fraction of the number of training steps of a regular (baseline) training run. In this case, it is not enough to just change SOLVER.MAX_ITER because that also changes the learning rate schedule. We introduce a multiplier to be used on top of SOLVER.MAX_ITER when deciding how many steps to train for. This multiplier does not scale the number of steps over which the learning rate schedule is defined. Reviewed By: raghuramank100 Differential Revision: D48699087 fbshipit-source-id: 903f7c957ee471f36365c1449e9cd6a919fd260a

Enable training for fraction of total steps; enable early stopping from trial 0
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/627 Enable training for fraction of total steps: when doing HPO, users may want to train for a fraction of the number of training steps of a regular (baseline) training run. In this case, it is not enough to just change SOLVER.MAX_ITER because that also changes the learning rate schedule. We introduce a multiplier to be used on top of SOLVER.MAX_ITER when deciding how many steps to train for. This multiplier does not scale the number of steps over which the learning rate schedule is defined. Reviewed By: raghuramank100 Differential Revision: D48699087 fbshipit-source-id: 903f7c957ee471f36365c1449e9cd6a919fd260a
3c724416 · Igor Fedorov · Facebook GitHub Bot · 54d9d91b · 3c724416
Commit 3c724416 authored Oct 12, 2023 by Igor Fedorov Committed by Facebook GitHub Bot Oct 12, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 12 additions and 1 deletion

d2go/runner/default_runner.py d2go/runner/default_runner.py +12 -1

No files found.
--- a/d2go/runner/default_runner.py
+++ b/d2go/runner/default_runner.py
@@ -572,7 +572,18 @@ class Detectron2GoRunner(D2GoDataAPIMixIn, BaseRunner):
            # The checkpoint stores the training iteration that just finished, thus we start
            # at the next iteration (or iter zero if there's no checkpoint).
            start_iter += 1
-            max_iter = cfg.SOLVER.MAX_ITER
+
+            if "EARLY_STOPPING_FRACTION" in cfg.SOLVER:
+                assert (
+                    cfg.SOLVER.EARLY_STOPPING_FRACTION >= 0
+                ), f"Early stopping fraction must be non-negative, but is {cfg.SOLVER.EARLY_STOPPING_FRACTION}"
+                assert (
+                    cfg.SOLVER.EARLY_STOPPING_FRACTION <= 1
+                ), f"Early stopping fraction must not be larger than 1, but is {cfg.SOLVER.EARLY_STOPPING_FRACTION}"
+                max_iter = int(cfg.SOLVER.MAX_ITER * cfg.SOLVER.EARLY_STOPPING_FRACTION)
+            else:
+                max_iter = cfg.SOLVER.MAX_ITER
+
            periodic_checkpointer = PeriodicCheckpointer(
                checkpointer, cfg.SOLVER.CHECKPOINT_PERIOD, max_iter=max_iter
            )