We employ a multi-set fine-tuning stage where we uniformly sample from multiple datasets. Given hat some of these datasets have extremely large images (``2048x2048`` or more) we opt for a very aggressive scale-range ``[0.2 - 0.8]`` such that as much of the original frame composition is captured inside the ``384x512`` crop.
This should give an **mae of about 1.416** on the train set of `Middlebury2014`. Results may vary slightly depending on the batch size and the number of GPUs. For the most accurate results use 1 GPU and `--batch-size 1`. The created log file should look like this, where the first key is the number of cascades and the nested key is the number of recursive iterations:
We encourage users to be aware of the **aspect-ratio** and **disparity scale** they are targeting when doing any sort of training or fine-tuning. The model is highly sensitive to these two factors, as a consequence of naive multi-set fine-tuning one can achieve `0.2 mae` relatively fast. We recommend that users pay close attention to how they **balance dataset sizing** when training such networks.
Ideally, dataset scaling should be trated at an individual level and a thorough **EDA** of the disparity distribution in random crops at the desired training / inference size should be performed prior to any large compute investments.
### Disparity scaling
##### Sample A
The top row contains a sample from `Sintel` whereas the bottom row one from `Middlebury`.

From left to right (`left_image`, `right_image`, `valid_mask`, `valid_mask & ground_truth`, `prediction`). **Darker is further away, lighter is closer**. In the case of `Sintel` which is more closely aligned to the original distribution of `CREStereo` we notice that the model accurately predicts the background scale whereas in the case of `Middlebury2014` it cannot correctly estimate the continuous disparity. Notice that the frame composition is similar for both examples. The blue skybox in the `Sintel` scene behaves similarly to the `Middlebury` black background. However, because the `Middlebury` samples comes from an extremely large scene the crop size of `384x512` does not correctly capture the general training distribution.
##### Sample B
The top row contains a scene from `Sceneflow` using the `Monkaa` split whilst the bottom row is a scene from `Middlebury`. This sample exhibits the same issues when it comes to **background estimation**. Given the exaggerated size of the `Middlebury` samples the model **colapses the smooth background** of the sample to what it considers to be a mean background disparity value.
For more detail on why this behaviour occurs based on the training distribution proportions you can read more about the network at: https://github.com/pytorch/vision/pull/6629#discussion_r978160493
### Metric overfitting
##### Learning is critical in the beginning
We also advise users to make user of faster training schedules, as the performance gain over long periods time is marginal. Here we exhibit a difference between a faster decay schedule and later decay schedule.

In **grey** we set the lr decay to begin after `30000` steps whilst in **orange** we opt for a very late learning rate decay at around `180000` steps. Although exhibiting stronger variance, we can notice that unfreezing the learning rate earlier whilst employing `gradient-norm` out-performs the default configuration.
##### Gradient norm saves time

In **grey** we keep ``gradient norm`` enabled whilst in **orange** we do not. We can notice that remvoing the gradient norm exacerbates the performance decrease in the early stages whilst also showcasing an almost complete collapse around the `60000` steps mark where we started decaying the lr for **orange**.
Although both runs ahieve an improvement of about ``0.1`` mae after the lr decay start, the benefits of it are observable much faster when ``gradient norm`` is employed as the recovery period is no longer accounted for.
parser.add_argument("--name",default="crestereo",help="name of the experiment")
parser.add_argument("--resume",type=str,default=None,help="from which checkpoint to resume")
parser.add_argument("--checkpoint-dir",type=str,default="checkpoints",help="path to the checkpoint directory")
# dataset
parser.add_argument("--dataset-root",type=str,default="",help="path to the dataset root directory")
parser.add_argument(
"--train-datasets",
type=str,
nargs="+",
default=["crestereo"],
help="dataset(s) to train on",
choices=list(VALID_DATASETS.keys()),
)
parser.add_argument(
"--dataset-steps",type=int,nargs="+",default=[300_000],help="number of steps for each dataset"
)
parser.add_argument(
"--steps-is-epochs",action="store_true",help="if set, dataset-steps are interpreted as epochs"
)
parser.add_argument(
"--test-datasets",
type=str,
nargs="+",
default=["middlebury2014-train"],
help="dataset(s) to test on",
choices=["middlebury2014-train"],
)
parser.add_argument("--dataset-shuffle",type=bool,help="shuffle the dataset",default=True)
parser.add_argument("--dataset-order-shuffle",type=bool,help="shuffle the dataset order",default=True)
parser.add_argument("--batch-size",type=int,default=2,help="batch size per GPU")
parser.add_argument("--workers",type=int,default=4,help="number of workers per GPU")
parser.add_argument(
"--threads",
type=int,
default=16,
help="number of CPU threads per GPU. This can be changed around to speed-up transforms if needed. This can lead to worker thread contention so use with care.",
)
# model architecture
parser.add_argument(
"--model",
type=str,
default="crestereo_base",
help="model architecture",
choices=["crestereo_base","raft_stereo"],
)
parser.add_argument("--recurrent-updates",type=int,default=10,help="number of recurrent updates")