* We provide models for video classification with two backbones: [SlowOnly](https://arxiv.org/abs/1812.03982) and 3D-ResNet (R3D) used in [Spatiotemporal Contrastive Video Representation Learning](https://arxiv.org/abs/2008.03800).
* Training and evaluation details:
* All models are trained from scratch with vision modality (RGB) for 200 epochs.
* We use batch size of 1024 and cosine learning rate decay with linear warmup in first 5 epochs.
* We follow [SlowFast](https://arxiv.org/abs/1812.03982) to perform 30-view evaluation.
### Kinetics-400 Action Recognition Baselines
| model | input (frame x stride) | Top-1 | Top-5 | download |