Add support of MViTv2 video variants (#6373)

* Extending to support MViTv2 * Fix docs, mypy and linter * Refactor the relative positional code. * Code refactoring. * Rename vars. * Update docs. * Replace assert with exception. * Updat docs. * Minor refactoring. * Remove the square input limitation. * Moving methods around. * Modify the shortcut in the attention layer. * Add ported weights. * Introduce a `residual_cls` config on the attention layer. * Make the patch_embed kernel/padding/stride configurable. * Apply changes from code-review. * Remove stale todo.

Add support of MViTv2 video variants (#6373)
* Extending to support MViTv2 * Fix docs, mypy and linter * Refactor the relative positional code. * Code refactoring. * Rename vars. * Update docs. * Replace assert with exception. * Updat docs. * Minor refactoring. * Remove the square input limitation. * Moving methods around. * Modify the shortcut in the attention layer. * Add ported weights. * Introduce a `residual_cls` config on the attention layer. * Make the patch_embed kernel/padding/stride configurable. * Apply changes from code-review. * Remove stale todo.
7e8186e0 · Vasilis Vryniotis · GitHub · 6908129a · 7e8186e0 · 7e8186e0
Unverified Commit 7e8186e0 authored Aug 10, 2022 by Vasilis Vryniotis Committed by GitHub Aug 10, 2022
4 changed files
--- a/docs/source/models/video_mvit.rst
+++ b/docs/source/models/video_mvit.rst
@@ -12,7 +12,7 @@ The MViT model is based on the
 Model builders
 --------------

-The following model builders can be used to instantiate a MViT model, with or
+The following model builders can be used to instantiate a MViT v1 or v2 model, with or
 without pre-trained weights. All the model builders internally rely on the
 ``torchvision.models.video.MViT`` base class. Please refer to the `source
 code
@@ -24,3 +24,4 @@ more details about this class.
    :template: function.rst

    mvit_v1_b
+    mvit_v2_s
--- a/test/expect/ModelTester.test_mvit_v2_s_expect.pkl
+++ b/test/expect/ModelTester.test_mvit_v2_s_expect.pkl
--- a/test/test_models.py
+++ b/test/test_models.py
@@ -309,6 +309,9 @@ _model_params = {
    "mvit_v1_b": {
        "input_shape": (1, 3, 16, 224, 224),
    },
+    "mvit_v2_s": {
+        "input_shape": (1, 3, 16, 224, 224),
+    },
 }
 # speeding up slow models:
 slow_models = [

--- a/torchvision/models/video/mvit.py
+++ b/torchvision/models/video/mvit.py