Add Video SwinTransformer (#6521)

* Just start adding mere copy paste * Replace d with t and D with T * Align swin transformer video to image a bit * Rename d -> t * align with 2d impl * align with 2d impl * Add helpful comments and config for 3d * add docs * Add docs * Add configurations * Add docs * Fix bugs * Fix wrong edit * Fix wrong edit * Fix bugs * Fix bugs * Fix as per fx suggestions * Update torchvision/models/video/swin_transformer.py * Fix as per fx suggestions * Fix expect files and code * Update the expect files * Modify video swin * Add min size and min temporal size, num params * Add flops and size * Fix types * Fix url recipe Co-authored-by: Yosua Michael Maranatha <yosuamichael@fb.com>

Add Video SwinTransformer (#6521)
* Just start adding mere copy paste * Replace d with t and D with T * Align swin transformer video to image a bit * Rename d -> t * align with 2d impl * align with 2d impl * Add helpful comments and config for 3d * add docs * Add docs * Add configurations * Add docs * Fix bugs * Fix wrong edit * Fix wrong edit * Fix bugs * Fix bugs * Fix as per fx suggestions * Update torchvision/models/video/swin_transformer.py * Fix as per fx suggestions * Fix expect files and code * Update the expect files * Modify video swin * Add min size and min temporal size, num params * Add flops and size * Fix types * Fix url recipe Co-authored-by: Yosua Michael Maranatha <yosuamichael@fb.com>
b1054cbb · Aditya Oke · GitHub · 3f4dcae6 · b1054cbb · b1054cbb
Unverified Commit b1054cbb authored Nov 17, 2022 by Aditya Oke Committed by GitHub Nov 17, 2022
9 changed files
--- a/docs/source/models.rst
+++ b/docs/source/models.rst
@@ -518,6 +518,7 @@ pre-trained weights:
   models/video_mvit
   models/video_resnet
   models/video_s3d
+   models/video_swin_transformer

 |


--- a/docs/source/models/swin_transformer.rst
+++ b/docs/source/models/swin_transformer.rst
@@ -15,7 +15,7 @@ Model builders
 --------------

 The following model builders can be used to instantiate an SwinTransformer model (original and V2) with and without pre-trained weights.
-All the model builders internally rely on the ``torchvision.models.swin_transformer.SwinTransformer`` 
+All the model builders internally rely on the ``torchvision.models.swin_transformer.SwinTransformer``
 base class. Please refer to the `source code
 <https://github.com/pytorch/vision/blob/main/torchvision/models/swin_transformer.py>`_ for
 more details about this class.

--- a/docs/source/models/video_swin_transformer.rst
+++ b/docs/source/models/video_swin_transformer.rst
+Video SwinTransformer
+=====================
+
+.. currentmodule:: torchvision.models.video
+
+The Video SwinTransformer model is based on the `Video Swin Transformer <https://arxiv.org/abs/2106.13230>`__ paper.
+
+.. betastatus:: video module
+
+
+Model builders
+--------------
+
+The following model builders can be used to instantiate a VideoResNet model, with or
+without pre-trained weights. All the model builders internally rely on the
+``torchvision.models.video.swin_transformer.SwinTransformer3d`` base class. Please refer to the `source
+code
+<https://github.com/pytorch/vision/blob/main/torchvision/models/video/swin_transformer.py>`_ for
+more details about this class.
+
+.. autosummary::
+    :toctree: generated/
+    :template: function.rst
+
+    swin3d_t
+    swin3d_s
+    swin3d_b
--- a/test/expect/ModelTester.test_swin3d_b_expect.pkl
+++ b/test/expect/ModelTester.test_swin3d_b_expect.pkl
--- a/test/expect/ModelTester.test_swin3d_s_expect.pkl
+++ b/test/expect/ModelTester.test_swin3d_s_expect.pkl
--- a/test/expect/ModelTester.test_swin3d_t_expect.pkl
+++ b/test/expect/ModelTester.test_swin3d_t_expect.pkl
--- a/torchvision/models/swin_transformer.py
+++ b/torchvision/models/swin_transformer.py
@@ -494,6 +494,8 @@ class SwinTransformerBlockV2(SwinTransformerBlock):
        )

    def forward(self, x: Tensor):
+        # Here is the difference, we apply norm after the attention in V2.
+        # In V1 we applied norm before the attention.
        x = x + self.stochastic_depth(self.norm1(self.attn(x)))
        x = x + self.stochastic_depth(self.norm2(self.mlp(x)))
        return x
@@ -587,7 +589,7 @@ class SwinTransformer(nn.Module):

        num_features = embed_dim * 2 ** (len(depths) - 1)
        self.norm = norm_layer(num_features)
-        self.permute = Permute([0, 3, 1, 2])
+        self.permute = Permute([0, 3, 1, 2])  # B H W C -> B C H W
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.flatten = nn.Flatten(1)
        self.head = nn.Linear(num_features, num_classes)

--- a/torchvision/models/video/__init__.py
+++ b/torchvision/models/video/__init__.py
 from .mvit import *
 from .resnet import *
 from .s3d import *
+from .swin_transformer import *
--- a/torchvision/models/video/swin_transformer.py
+++ b/torchvision/models/video/swin_transformer.py