Merge pull request #9536 from PurdueCAM2Project:yolo

PiperOrigin-RevId: 347467379

Merge pull request #9536 from PurdueCAM2Project:yolo
PiperOrigin-RevId: 347467379
60443d7f · A. Unique TensorFlower · 3b7017a5 · 84df9351 · 60443d7f · 60443d7f
Commit 60443d7f authored Dec 14, 2020 by A. Unique TensorFlower
15 changed files
--- a/official/vision/beta/projects/yolo/README.md
+++ b/official/vision/beta/projects/yolo/README.md
+# YOLO Object Detectors, You Only Look Once
+
+[![Paper](http://img.shields.io/badge/Paper-arXiv.1804.02767-B3181B?logo=arXiv)](https://arxiv.org/abs/1804.02767)
+[![Paper](http://img.shields.io/badge/Paper-arXiv.2004.10934-B3181B?logo=arXiv)](https://arxiv.org/abs/2004.10934)
+
+This repository is the unofficial implementation of the following papers.
+However, we spent painstaking hours ensuring that every aspect that we
+constructed was the exact same as the original paper and the original
+repository.
+
+* YOLOv3: An Incremental Improvement: [YOLOv3: An Incremental Improvement](https://arxiv.org/abs/1804.02767)
+
+* YOLOv4: Optimal Speed and Accuracy of Object Detection: [YOLOv4: Optimal Speed and Accuracy of Object Detection](https://arxiv.org/abs/2004.10934)
+
+## Description
+
+Yolo v1 the original implementation was released in 2015 providing a ground
+breaking algorithm that would quickly process images, and locate objects in a
+single pass through the detector. The original implementation based used a
+backbone derived from state of the art object classifier of the time, like
+[GoogLeNet](https://arxiv.org/abs/1409.4842) and
+[VGG](https://arxiv.org/abs/1409.1556). More attention was given to the novel
+Yolo Detection head that allowed for Object Detection with a single pass of an
+image. Though limited, the network could predict up to 90 bounding boxes per
+image, and was tested for about 80 classes per box. Also, the model could only
+make prediction at one scale. These attributes caused yolo v1 to be more
+limited, and less versatile, so as the year passed, the Developers continued to
+update and develop this model.
+
+Yolo v3 and v4 serve as the most up to date and capable versions of the Yolo
+network group. These model uses a custom backbone called Darknet53 that uses
+knowledge gained from the ResNet paper to improve its predictions. The new
+backbone also allows for objects to be detected at multiple scales. As for the
+new detection head, the model now predicts the bounding boxes using a set of
+anchor box priors (Anchor Boxes) as suggestions. The multiscale predictions in
+combination with the Anchor boxes allows for the network to make up to 1000
+object predictions on a single image. Finally, the new loss function forces the
+network to make better prediction by using Intersection Over Union (IOU) to
+inform the model's confidence rather than relying on the mean squared error for
+the entire output.
+
+## Authors
+
+* Vishnu Samardh Banna ([@GitHub vishnubanna](https://github.com/vishnubanna))
+* Anirudh Vegesana ([@GitHub anivegesana](https://github.com/anivegesana))
+* Akhil Chinnakotla ([@GitHub The-Indian-Chinna](https://github.com/The-Indian-Chinna))
+* Tristan Yan ([@GitHub Tyan3001](https://github.com/Tyan3001))
+* Naveen Vivek ([@GitHub naveen-vivek](https://github.com/naveen-vivek))
+
+## Table of Contents
+
+* [Our Goal](#our-goal)
+* [Models in the library](#models-in-the-library)
+* [References](#references)
+
+
+## Our Goal
+
+Our goal with this model conversion is to provide implementations of the
+Backbone and Yolo Head. We have built the model in such a way that the Yolo
+head could be connected to a new, more powerful backbone if a person chose to.
+
+## Models in the library
+
+| Object Detectors | Classifiers      |
+| :--------------: | :--------------: |
+| Yolo-v3          | Darknet53        |
+| Yolo-v3 tiny     | CSPDarknet53     |
+| Yolo-v3 spp      |
+| Yolo-v4          |
+| Yolo-v4 tiny     |
+
+## Requirements
+
+[![TensorFlow 2.2](https://img.shields.io/badge/TensorFlow-2.2-FF6F00?logo=tensorflow)](https://github.com/tensorflow/tensorflow/releases/tag/v2.2.0)
+[![Python 3.8](https://img.shields.io/badge/Python-3.8-3776AB)](https://www.python.org/downloads/release/python-380/)
--- a/official/vision/beta/projects/yolo/common/registry_imports.py
+++ b/official/vision/beta/projects/yolo/common/registry_imports.py
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""All necessary imports for registration."""
+
+# pylint: disable=unused-import
+from official.common import registry_imports
+from official.vision.beta.projects.yolo.configs import darknet_classification
+from official.vision.beta.projects.yolo.modeling.backbones import darknet
+from official.vision.beta.projects.yolo.tasks import image_classification
--- a/official/vision/beta/projects/yolo/configs/backbones.py
+++ b/official/vision/beta/projects/yolo/configs/backbones.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Backbones configurations."""
+
+import dataclasses
+
+from official.modeling import hyperparams
+
+from official.vision.beta.configs import backbones
+
+
+@dataclasses.dataclass
+class DarkNet(hyperparams.Config):
+  """DarkNet config."""
+  model_id: str = "darknet53"
+
+
+@dataclasses.dataclass
+class Backbone(backbones.Backbone):
+  darknet: DarkNet = DarkNet()
--- a/official/vision/beta/projects/yolo/configs/darknet_classification.py
+++ b/official/vision/beta/projects/yolo/configs/darknet_classification.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Image classification with darknet configs."""
+
+from typing import List, Optional
+
+import dataclasses
+
+from official.core import config_definitions as cfg
+from official.core import exp_factory
+from official.modeling import hyperparams
+from official.vision.beta.configs import common
+from official.vision.beta.configs import image_classification as imc
+from official.vision.beta.projects.yolo.configs import backbones
+
+
+@dataclasses.dataclass
+class ImageClassificationModel(hyperparams.Config):
+  num_classes: int = 0
+  input_size: List[int] = dataclasses.field(default_factory=list)
+  backbone: backbones.Backbone = backbones.Backbone(
+      type='darknet', resnet=backbones.DarkNet())
+  dropout_rate: float = 0.0
+  norm_activation: common.NormActivation = common.NormActivation()
+  # Adds a BatchNormalization layer pre-GlobalAveragePooling in classification
+  add_head_batch_norm: bool = False
+
+
+@dataclasses.dataclass
+class Losses(hyperparams.Config):
+  one_hot: bool = True
+  label_smoothing: float = 0.0
+  l2_weight_decay: float = 0.0
+
+
+@dataclasses.dataclass
+class ImageClassificationTask(cfg.TaskConfig):
+  """The model config."""
+  model: ImageClassificationModel = ImageClassificationModel()
+  train_data: imc.DataConfig = imc.DataConfig(is_training=True)
+  validation_data: imc.DataConfig = imc.DataConfig(is_training=False)
+  evaluation: imc.Evaluation = imc.Evaluation()
+  losses: Losses = Losses()
+  gradient_clip_norm: float = 0.0
+  logging_dir: Optional[str] = None
+
+
+@exp_factory.register_config_factory('darknet_classification')
+def image_classification() -> cfg.ExperimentConfig:
+  """Image classification general."""
+  return cfg.ExperimentConfig(
+      task=ImageClassificationTask(),
+      trainer=cfg.TrainerConfig(),
+      restrictions=[
+          'task.train_data.is_training != None',
+          'task.validation_data.is_training != None'
+      ])
--- a/official/vision/beta/projects/yolo/configs/experiments/csp_darknet53.yaml
+++ b/official/vision/beta/projects/yolo/configs/experiments/csp_darknet53.yaml
+runtime:
+  distribution_strategy: 'mirrored'
+  mixed_precision_dtype: 'float32'
+task:
+  model:
+    num_classes: 1001
+    input_size: [256, 256, 3]
+    backbone:
+      type: 'darknet'
+      darknet:
+        model_id: 'cspdarknet53'
+    norm_activation:
+      activation: 'mish'
+  losses:
+    l2_weight_decay: 0.0005
+    one_hot: true
+    label_smoothing: 0.1
+  train_data:
+    input_path: 'imagenet-2012-tfrecord/train*'
+    is_training: true
+    global_batch_size: 128
+    dtype: 'float16'
+  validation_data:
+    input_path: 'imagenet-2012-tfrecord/valid*'
+    is_training: true
+    global_batch_size: 128
+    dtype: 'float16'
+    drop_remainder: false
+trainer:
+  train_steps: 1200000  # epochs: 120
+  validation_steps: 400  # size of validation data
+  validation_interval: 10000
+  steps_per_loop: 10000
+  summary_interval: 10000
+  checkpoint_interval: 10000
+  optimizer_config:
+    optimizer:
+      type: 'sgd'
+      sgd:
+        momentum: 0.9
+    learning_rate:
+      type: 'polynomial'
+      polynomial:
+        initial_learning_rate: 0.1
+        end_learning_rate: 0.0001
+        power: 4.0
+        decay_steps: 1200000
+    warmup:
+      type: 'linear'
+      linear:
+        warmup_steps: 1000  # learning rate rises from 0 to 0.1 over 1000 steps
--- a/official/vision/beta/projects/yolo/configs/experiments/csp_darknet53_tfds.yaml
+++ b/official/vision/beta/projects/yolo/configs/experiments/csp_darknet53_tfds.yaml
+runtime:
+  distribution_strategy: 'mirrored'
+  mixed_precision_dtype: 'float16'
+  loss_scale: 'dynamic'
+  num_gpus: 2
+task:
+  model:
+    num_classes: 1001
+    input_size: [256, 256, 3]
+    backbone:
+      type: 'darknet'
+      darknet:
+        model_id: 'cspdarknet53'
+    norm_activation:
+      activation: 'mish'
+  losses:
+    l2_weight_decay: 0.0005
+    one_hot: true
+  train_data:
+    tfds_name: 'imagenet2012'
+    tfds_split: 'train'
+    tfds_data_dir: '~/tensorflow_datasets/'
+    tfds_download: true
+    is_training: true
+    global_batch_size: 16  # default = 128
+    dtype: 'float16'
+    shuffle_buffer_size: 100
+  validation_data:
+    tfds_name: 'imagenet2012'
+    tfds_split: 'validation'
+    tfds_data_dir: '~/tensorflow_datasets/'
+    tfds_download: true
+    is_training: true
+    global_batch_size: 16  # default = 128
+    dtype: 'float16'
+    drop_remainder: false
+    shuffle_buffer_size: 100
+trainer:
+  train_steps: 9600000  # epochs: 120, 1200000 * 128/batchsize
+  validation_steps: 3200  # size of validation data, 400 * 128/batchsize
+  validation_interval: 10000  # 10000
+  steps_per_loop: 10000
+  summary_interval: 10000
+  checkpoint_interval: 10000
+  optimizer_config:
+    optimizer:
+      type: 'sgd'
+      sgd:
+        momentum: 0.9
+    learning_rate:
+      type: 'polynomial'
+      polynomial:
+        initial_learning_rate: 0.0125  # 0.1 * batchsize/128, default = 0.1
+        end_learning_rate: 0.0000125  # 0.0001 * batchsize/128, default = 0.0001
+        power: 4.0
+        decay_steps: 9592000  # 790000 * 128/batchsize,   default =  800000 - 1000 = 799000
+    warmup:
+      type: 'linear'
+      linear:
+        warmup_steps: 8000  # 0 to 0.1 over 1000 * 128/batchsize, default = 128
--- a/official/vision/beta/projects/yolo/configs/experiments/darknet53.yaml
+++ b/official/vision/beta/projects/yolo/configs/experiments/darknet53.yaml
+runtime:
+  distribution_strategy: 'mirrored'
+  mixed_precision_dtype: 'float32'
+task:
+  model:
+    num_classes: 1001
+    input_size: [256, 256, 3]
+    backbone:
+      type: 'darknet'
+      darknet:
+        model_id: 'darknet53'
+    norm_activation:
+      activation: 'mish'
+  losses:
+    l2_weight_decay: 0.0005
+    one_hot: true
+  train_data:
+    input_path: 'imagenet-2012-tfrecord/train*'
+    is_training: true
+    global_batch_size: 128
+    dtype: 'float16'
+  validation_data:
+    input_path: 'imagenet-2012-tfrecord/valid*'
+    is_training: true
+    global_batch_size: 128
+    dtype: 'float16'
+    drop_remainder: false
+trainer:
+  train_steps: 800000  # epochs: 80
+  validation_steps: 400  # size of validation data
+  validation_interval: 10000
+  steps_per_loop: 10000
+  summary_interval: 10000
+  checkpoint_interval: 10000
+  optimizer_config:
+    optimizer:
+      type: 'sgd'
+      sgd:
+        momentum: 0.9
+    learning_rate:
+      type: 'polynomial'
+      polynomial:
+        initial_learning_rate: 0.1
+        end_learning_rate: 0.0001
+        power: 4.0
+        decay_steps: 800000
+    warmup:
+      type: 'linear'
+      linear:
+        warmup_steps: 1000  # learning rate rises from 0 to 0.1 over 1000 steps
--- a/official/vision/beta/projects/yolo/configs/experiments/darknet53_tfds.yaml
+++ b/official/vision/beta/projects/yolo/configs/experiments/darknet53_tfds.yaml
+runtime:
+  distribution_strategy: 'mirrored'
+  mixed_precision_dtype: 'float16'
+  loss_scale: 'dynamic'
+  num_gpus: 2
+task:
+  model:
+    num_classes: 1001
+    input_size: [256, 256, 3]
+    backbone:
+      type: 'darknet'
+      darknet:
+        model_id: 'darknet53'
+    norm_activation:
+      activation: 'mish'
+  losses:
+    l2_weight_decay: 0.0005
+    one_hot: true
+  train_data:
+    tfds_name: 'imagenet2012'
+    tfds_split: 'train'
+    tfds_data_dir: '~/tensorflow_datasets/'
+    tfds_download: true
+    is_training: true
+    global_batch_size: 16  # default = 128
+    dtype: 'float16'
+    shuffle_buffer_size: 100
+  validation_data:
+    tfds_name: 'imagenet2012'
+    tfds_split: 'validation'
+    tfds_data_dir: '~/tensorflow_datasets/'
+    tfds_download: true
+    is_training: true
+    global_batch_size: 16  # default = 128
+    dtype: 'float16'
+    drop_remainder: false
+    shuffle_buffer_size: 100
+trainer:
+  train_steps: 6400000  # epochs: 80, 800000 * 128/batchsize
+  validation_steps: 3200  # size of validation data, 400 * 128/batchsize
+  validation_interval: 10000  # 10000
+  steps_per_loop: 10000
+  summary_interval: 10000
+  checkpoint_interval: 10000
+  optimizer_config:
+    optimizer:
+      type: 'sgd'
+      sgd:
+        momentum: 0.9
+    learning_rate:
+      type: 'polynomial'
+      polynomial:
+        initial_learning_rate: 0.0125  # 0.1 * batchsize/128, default = 0.1
+        end_learning_rate: 0.0000125  # 0.0001 * batchsize/128, default = 0.0001
+        power: 4.0
+        decay_steps: 6392000  # 790000 * 128/batchsize,   default =  800000 - 1000 = 799000
+    warmup:
+      type: 'linear'
+      linear:
+        warmup_steps: 8000  # 0 to 0.1 over 1000 * 128/batchsize, default = 128
--- a/official/vision/beta/projects/yolo/dataloaders/classification_tfds_decoder.py
+++ b/official/vision/beta/projects/yolo/dataloaders/classification_tfds_decoder.py
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TFDS Classification decoder."""
+
+import tensorflow as tf
+
+from official.vision.beta.dataloaders import decoder
+
+
+class Decoder(decoder.Decoder):
+  """A tf.Example decoder for classification task."""
+
+  def __init__(self):
+    return
+
+  def decode(self, serialized_example):
+    sample_dict = {
+        'image/encoded': tf.io.encode_jpeg(
+            serialized_example['image'], quality=100),
+        'image/class/label': serialized_example['label'],
+    }
+    return sample_dict
+
+
--- a/official/vision/beta/projects/yolo/modeling/backbones/darknet.py
+++ b/official/vision/beta/projects/yolo/modeling/backbones/darknet.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Contains definitions of Darknet Backbone Networks.
+
+   The models are inspired by ResNet, and CSPNet
+
+Residual networks (ResNets) were proposed in:
+[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
+    Deep Residual Learning for Image Recognition. arXiv:1512.03385
+
+Cross Stage Partial networks (CSPNets) were proposed in:
+[1] Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang Chen,
+    Jun-Wei Hsieh
+    CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
+    arXiv:1911.11929
+
+
+DarkNets Are used mainly for Object detection in:
+[1] Joseph Redmon, Ali Farhadi
+    YOLOv3: An Incremental Improvement. arXiv:1804.02767
+
+[2] Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao
+    YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934
+"""
+import collections
+
+import tensorflow as tf
+
+from official.vision.beta.modeling.backbones import factory
+from official.vision.beta.projects.yolo.modeling.layers import nn_blocks
+
+
+class BlockConfig(object):
+  """Get layer config to make code more readable.
+
+    Args:
+        layer: string layer name
+        stack: the type of layer ordering to use for this specific level
+        repetitions: integer for the number of times to repeat block
+        bottelneck: boolean for does this stack have a bottle neck layer
+        filters: integer for the output depth of the level
+        pool_size: integer the pool_size of max pool layers
+        kernel_size: optional integer, for convolution kernel size
+        strides: integer or tuple to indicate convolution strides
+        padding: the padding to apply to layers in this stack
+        activation: string for the activation to use for this stack
+        route: integer for what level to route from to get the next input
+        output_name: the name to use for this output
+        is_output: is this layer an output in the default model
+  """
+
+  def __init__(self, layer, stack, reps, bottleneck, filters, pool_size,
+               kernel_size, strides, padding, activation, route, output_name,
+               is_output):
+    self.layer = layer
+    self.stack = stack
+    self.repetitions = reps
+    self.bottleneck = bottleneck
+    self.filters = filters
+    self.kernel_size = kernel_size
+    self.pool_size = pool_size
+    self.strides = strides
+    self.padding = padding
+    self.activation = activation
+    self.route = route
+    self.output_name = output_name
+    self.is_output = is_output
+
+
+def build_block_specs(config):
+  specs = []
+  for layer in config:
+    specs.append(BlockConfig(*layer))
+  return specs
+
+
+class LayerFactory(object):
+  """Class for quick look up of default layers.
+
+  Used by darknet to connect, introduce or exit a level. Used in place of an if
+  condition or switch to make adding new layers easier and to reduce redundant
+  code.
+  """
+
+  def __init__(self):
+    self._layer_dict = {
+        "ConvBN": (nn_blocks.ConvBN, self.conv_bn_config_todict),
+        "MaxPool": (tf.keras.layers.MaxPool2D, self.maxpool_config_todict)
+    }
+
+  def conv_bn_config_todict(self, config, kwargs):
+    dictvals = {
+        "filters": config.filters,
+        "kernel_size": config.kernel_size,
+        "strides": config.strides,
+        "padding": config.padding
+    }
+    dictvals.update(kwargs)
+    return dictvals
+
+  def darktiny_config_todict(self, config, kwargs):
+    dictvals = {"filters": config.filters, "strides": config.strides}
+    dictvals.update(kwargs)
+    return dictvals
+
+  def maxpool_config_todict(self, config, kwargs):
+    return {
+        "pool_size": config.pool_size,
+        "strides": config.strides,
+        "padding": config.padding,
+        "name": kwargs["name"]
+    }
+
+  def __call__(self, config, kwargs):
+    layer, get_param_dict = self._layer_dict[config.layer]
+    param_dict = get_param_dict(config, kwargs)
+    return layer(**param_dict)
+
+
+# model configs
+LISTNAMES = [
+    "default_layer_name", "level_type", "number_of_layers_in_level",
+    "bottleneck", "filters", "kernal_size", "pool_size", "strides", "padding",
+    "default_activation", "route", "level/name", "is_output"
+]
+
+# pylint: disable=line-too-long
+CSPDARKNET53 = {
+    "list_names": LISTNAMES,
+    "splits": {"backbone_split": 106,
+               "neck_split": 138},
+    "backbone": [
+        ["ConvBN", None, 1, False, 32, None, 3, 1, "same", "mish", -1, 0, False],
+        ["DarkRes", "csp", 1, True, 64, None, None, None, None, "mish", -1, 1, False],
+        ["DarkRes", "csp", 2, False, 128, None, None, None, None, "mish", -1, 2, False],
+        ["DarkRes", "csp", 8, False, 256, None, None, None, None, "mish", -1, 3, True],
+        ["DarkRes", "csp", 8, False, 512, None, None, None, None, "mish", -1, 4, True],
+        ["DarkRes", "csp", 4, False, 1024, None, None, None, None, "mish", -1, 5, True],
+    ]
+}
+
+DARKNET53 = {
+    "list_names": LISTNAMES,
+    "splits": {"backbone_split": 76},
+    "backbone": [
+        ["ConvBN", None, 1, False, 32, None, 3, 1, "same", "leaky", -1, 0, False],
+        ["DarkRes", "residual", 1, True, 64, None, None, None, None, "leaky", -1, 1, False],
+        ["DarkRes", "residual", 2, False, 128, None, None, None, None, "leaky", -1, 2, False],
+        ["DarkRes", "residual", 8, False, 256, None, None, None, None, "leaky", -1, 3, True],
+        ["DarkRes", "residual", 8, False, 512, None, None, None, None, "leaky", -1, 4, True],
+        ["DarkRes", "residual", 4, False, 1024, None, None, None, None, "leaky", -1, 5, True],
+    ]
+}
+
+CSPDARKNETTINY = {
+    "list_names": LISTNAMES,
+    "splits": {"backbone_split": 28},
+    "backbone": [
+        ["ConvBN", None, 1, False, 32, None, 3, 2, "same", "leaky", -1, 0, False],
+        ["ConvBN", None, 1, False, 64, None, 3, 2, "same", "leaky", -1, 1, False],
+        ["CSPTiny", "csp_tiny", 1, False, 64, None, 3, 2, "same", "leaky", -1, 2, False],
+        ["CSPTiny", "csp_tiny", 1, False, 128, None, 3, 2, "same", "leaky", -1, 3, False],
+        ["CSPTiny", "csp_tiny", 1, False, 256, None, 3, 2, "same", "leaky", -1, 4, True],
+        ["ConvBN", None, 1, False, 512, None, 3, 1, "same", "leaky", -1, 5, True],
+    ]
+}
+
+DARKNETTINY = {
+    "list_names": LISTNAMES,
+    "splits": {"backbone_split": 14},
+    "backbone": [
+        ["ConvBN", None, 1, False, 16, None, 3, 1, "same", "leaky", -1, 0, False],
+        ["DarkTiny", "tiny", 1, True, 32, None, 3, 2, "same", "leaky", -1, 1, False],
+        ["DarkTiny", "tiny", 1, True, 64, None, 3, 2, "same", "leaky", -1, 2, False],
+        ["DarkTiny", "tiny", 1, False, 128, None, 3, 2, "same", "leaky", -1, 3, False],
+        ["DarkTiny", "tiny", 1, False, 256, None, 3, 2, "same", "leaky", -1, 4, True],
+        ["DarkTiny", "tiny", 1, False, 512, None, 3, 2, "same", "leaky", -1, 5, False],
+        ["DarkTiny", "tiny", 1, False, 1024, None, 3, 1, "same", "leaky", -1, 5, True],
+    ]
+}
+# pylint: enable=line-too-long
+
+BACKBONES = {
+    "darknettiny": DARKNETTINY,
+    "darknet53": DARKNET53,
+    "cspdarknet53": CSPDARKNET53,
+    "cspdarknettiny": CSPDARKNETTINY
+}
+
+
+@tf.keras.utils.register_keras_serializable(package="yolo")
+class Darknet(tf.keras.Model):
+  """Darknet backbone."""
+
+  def __init__(
+      self,
+      model_id="darknet53",
+      input_specs=tf.keras.layers.InputSpec(shape=[None, None, None, 3]),
+      min_level=None,
+      max_level=5,
+      activation=None,
+      use_sync_bn=False,
+      norm_momentum=0.99,
+      norm_epsilon=0.001,
+      kernel_initializer="glorot_uniform",
+      kernel_regularizer=None,
+      bias_regularizer=None,
+      **kwargs):
+
+    layer_specs, splits = Darknet.get_model_config(model_id)
+
+    self._model_name = model_id
+    self._splits = splits
+    self._input_shape = input_specs
+    self._registry = LayerFactory()
+
+    # default layer look up
+    self._min_size = min_level
+    self._max_size = max_level
+    self._output_specs = None
+
+    self._kernel_initializer = kernel_initializer
+    self._bias_regularizer = bias_regularizer
+    self._norm_momentum = norm_momentum
+    self._norm_epislon = norm_epsilon
+    self._use_sync_bn = use_sync_bn
+    self._activation = activation
+    self._kernel_regularizer = kernel_regularizer
+
+    self._default_dict = {
+        "kernel_initializer": self._kernel_initializer,
+        "kernel_regularizer": self._kernel_regularizer,
+        "bias_regularizer": self._bias_regularizer,
+        "norm_momentum": self._norm_momentum,
+        "norm_epsilon": self._norm_epislon,
+        "use_sync_bn": self._use_sync_bn,
+        "activation": self._activation,
+        "name": None
+    }
+
+    inputs = tf.keras.layers.Input(shape=self._input_shape.shape[1:])
+    output = self._build_struct(layer_specs, inputs)
+    super().__init__(inputs=inputs, outputs=output, name=self._model_name)
+
+  @property
+  def input_specs(self):
+    return self._input_shape
+
+  @property
+  def output_specs(self):
+    return self._output_specs
+
+  @property
+  def splits(self):
+    return self._splits
+
+  def _build_struct(self, net, inputs):
+    endpoints = collections.OrderedDict()
+    stack_outputs = [inputs]
+    for i, config in enumerate(net):
+      if config.stack is None:
+        x = self._build_block(stack_outputs[config.route],
+                              config,
+                              name=f"{config.layer}_{i}")
+        stack_outputs.append(x)
+      elif config.stack == "residual":
+        x = self._residual_stack(stack_outputs[config.route],
+                                 config,
+                                 name=f"{config.layer}_{i}")
+        stack_outputs.append(x)
+      elif config.stack == "csp":
+        x = self._csp_stack(stack_outputs[config.route],
+                            config,
+                            name=f"{config.layer}_{i}")
+        stack_outputs.append(x)
+      elif config.stack == "csp_tiny":
+        x_pass, x = self._csp_tiny_stack(stack_outputs[config.route],
+                                         config, name=f"{config.layer}_{i}")
+        stack_outputs.append(x_pass)
+      elif config.stack == "tiny":
+        x = self._tiny_stack(stack_outputs[config.route],
+                             config,
+                             name=f"{config.layer}_{i}")
+        stack_outputs.append(x)
+      if (config.is_output and self._min_size is None):
+        endpoints[str(config.output_name)] = x
+      elif self._min_size is not None and config.output_name >= self._min_size and config.output_name <= self._max_size:
+        endpoints[str(config.output_name)] = x
+
+    self._output_specs = {l: endpoints[l].get_shape() for l in endpoints.keys()}
+    return endpoints
+
+  def _get_activation(self, activation):
+    if self._activation is None:
+      return activation
+    else:
+      return self._activation
+
+  def _csp_stack(self, inputs, config, name):
+    if config.bottleneck:
+      csp_filter_scale = 1
+      residual_filter_scale = 2
+      scale_filters = 1
+    else:
+      csp_filter_scale = 2
+      residual_filter_scale = 1
+      scale_filters = 2
+    self._default_dict["activation"] = self._get_activation(config.activation)
+    self._default_dict["name"] = f"{name}_csp_down"
+    x, x_route = nn_blocks.CSPRoute(filters=config.filters,
+                                    filter_scale=csp_filter_scale,
+                                    downsample=True,
+                                    **self._default_dict)(inputs)
+    for i in range(config.repetitions):
+      self._default_dict["name"] = f"{name}_{i}"
+      x = nn_blocks.DarkResidual(filters=config.filters // scale_filters,
+                                 filter_scale=residual_filter_scale,
+                                 **self._default_dict)(x)
+
+    self._default_dict["name"] = f"{name}_csp_connect"
+    output = nn_blocks.CSPConnect(filters=config.filters,
+                                  filter_scale=csp_filter_scale,
+                                  **self._default_dict)([x, x_route])
+    self._default_dict["activation"] = self._activation
+    self._default_dict["name"] = None
+    return output
+
+  def _csp_tiny_stack(self, inputs, config, name):
+    self._default_dict["activation"] = self._get_activation(config.activation)
+    self._default_dict["name"] = f"{name}_csp_tiny"
+    x, x_route = nn_blocks.CSPTiny(filters=config.filters,
+                                   **self._default_dict)(inputs)
+    self._default_dict["activation"] = self._activation
+    self._default_dict["name"] = None
+    return x, x_route
+
+  def _tiny_stack(self, inputs, config, name):
+    x = tf.keras.layers.MaxPool2D(pool_size=2,
+                                  strides=config.strides,
+                                  padding="same",
+                                  data_format=None,
+                                  name=f"{name}_tiny/pool")(inputs)
+    self._default_dict["activation"] = self._get_activation(config.activation)
+    self._default_dict["name"] = f"{name}_tiny/conv"
+    x = nn_blocks.ConvBN(
+        filters=config.filters,
+        kernel_size=(3, 3),
+        strides=(1, 1),
+        padding="same",
+        **self._default_dict)(
+            x)
+    self._default_dict["activation"] = self._activation
+    self._default_dict["name"] = None
+    return x
+
+  def _residual_stack(self, inputs, config, name):
+    self._default_dict["activation"] = self._get_activation(config.activation)
+    self._default_dict["name"] = f"{name}_residual_down"
+    x = nn_blocks.DarkResidual(filters=config.filters,
+                               downsample=True,
+                               **self._default_dict)(inputs)
+    for i in range(config.repetitions - 1):
+      self._default_dict["name"] = f"{name}_{i}"
+      x = nn_blocks.DarkResidual(filters=config.filters,
+                                 **self._default_dict)(x)
+    self._default_dict["activation"] = self._activation
+    self._default_dict["name"] = None
+    return x
+
+  def _build_block(self, inputs, config, name):
+    x = inputs
+    i = 0
+    self._default_dict["activation"] = self._get_activation(config.activation)
+    while i < config.repetitions:
+      self._default_dict["name"] = f"{name}_{i}"
+      layer = self._registry(config, self._default_dict)
+      x = layer(x)
+      i += 1
+    self._default_dict["activation"] = self._activation
+    self._default_dict["name"] = None
+    return x
+
+  @staticmethod
+  def get_model_config(name):
+    name = name.lower()
+    backbone = BACKBONES[name]["backbone"]
+    splits = BACKBONES[name]["splits"]
+    return build_block_specs(backbone), splits
+
+  @property
+  def model_id(self):
+    return self._model_name
+
+  @classmethod
+  def from_config(cls, config, custom_objects=None):
+    return cls(**config)
+
+  def get_config(self):
+    layer_config = {
+        "model_id": self._model_name,
+        "min_level": self._min_size,
+        "max_level": self._max_size,
+        "kernel_initializer": self._kernel_initializer,
+        "kernel_regularizer": self._kernel_regularizer,
+        "bias_regularizer": self._bias_regularizer,
+        "norm_momentum": self._norm_momentum,
+        "norm_epsilon": self._norm_epislon,
+        "use_sync_bn": self._use_sync_bn,
+        "activation": self._activation
+    }
+    return layer_config
+
+
+@factory.register_backbone_builder("darknet")
+def build_darknet(
+    input_specs: tf.keras.layers.InputSpec,
+    model_config,
+    l2_regularizer: tf.keras.regularizers.Regularizer = None) -> tf.keras.Model:
+  """Builds darknet backbone."""
+
+  backbone_cfg = model_config.backbone.get()
+  norm_activation_config = model_config.norm_activation
+  model = Darknet(
+      model_id=backbone_cfg.model_id,
+      input_shape=input_specs,
+      activation=norm_activation_config.activation,
+      use_sync_bn=norm_activation_config.use_sync_bn,
+      norm_momentum=norm_activation_config.norm_momentum,
+      norm_epsilon=norm_activation_config.norm_epsilon,
+      kernel_regularizer=l2_regularizer)
+  return model
--- a/official/vision/beta/projects/yolo/modeling/backbones/darknet_test.py
+++ b/official/vision/beta/projects/yolo/modeling/backbones/darknet_test.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for resnet."""
+
+from absl.testing import parameterized
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.python.distribute import combinations
+from tensorflow.python.distribute import strategy_combinations
+from official.vision.beta.projects.yolo.modeling.backbones import darknet
+
+
+class DarkNetTest(parameterized.TestCase, tf.test.TestCase):
+
+  @parameterized.parameters(
+      (224, "darknet53", 2, 1),
+      (224, "darknettiny", 1, 2),
+      (224, "cspdarknettiny", 1, 1),
+      (224, "cspdarknet53", 2, 1),
+  )
+  def test_network_creation(self, input_size, model_id,
+                            endpoint_filter_scale, scale_final):
+    """Test creation of ResNet family models."""
+    tf.keras.backend.set_image_data_format("channels_last")
+
+    network = darknet.Darknet(model_id=model_id, min_level=3, max_level=5)
+    self.assertEqual(network.model_id, model_id)
+
+    inputs = tf.keras.Input(shape=(input_size, input_size, 3), batch_size=1)
+    endpoints = network(inputs)
+
+    self.assertAllEqual(
+        [1, input_size / 2**3, input_size / 2**3, 128 * endpoint_filter_scale],
+        endpoints["3"].shape.as_list())
+    self.assertAllEqual(
+        [1, input_size / 2**4, input_size / 2**4, 256 * endpoint_filter_scale],
+        endpoints["4"].shape.as_list())
+    self.assertAllEqual([
+        1, input_size / 2**5, input_size / 2**5,
+        512 * endpoint_filter_scale * scale_final
+    ], endpoints["5"].shape.as_list())
+
+  @combinations.generate(
+      combinations.combine(
+          strategy=[
+              strategy_combinations.cloud_tpu_strategy,
+              strategy_combinations.one_device_strategy_gpu,
+          ],
+          use_sync_bn=[False, True],
+      ))
+  def test_sync_bn_multiple_devices(self, strategy, use_sync_bn):
+    """Test for sync bn on TPU and GPU devices."""
+    inputs = np.random.rand(1, 224, 224, 3)
+
+    tf.keras.backend.set_image_data_format("channels_last")
+
+    with strategy.scope():
+      network = darknet.Darknet(model_id="darknet53", min_size=3, max_size=5)
+      _ = network(inputs)
+
+  @parameterized.parameters(1, 3, 4)
+  def test_input_specs(self, input_dim):
+    """Test different input feature dimensions."""
+    tf.keras.backend.set_image_data_format("channels_last")
+
+    input_specs = tf.keras.layers.InputSpec(shape=[None, None, None, input_dim])
+    network = darknet.Darknet(
+        model_id="darknet53", min_level=3, max_level=5, input_specs=input_specs)
+
+    inputs = tf.keras.Input(shape=(224, 224, input_dim), batch_size=1)
+    _ = network(inputs)
+
+  def test_serialize_deserialize(self):
+    # Create a network object that sets all of its config options.
+    kwargs = dict(
+        model_id="darknet53",
+        min_level=3,
+        max_level=5,
+        use_sync_bn=False,
+        activation="relu",
+        norm_momentum=0.99,
+        norm_epsilon=0.001,
+        kernel_initializer="VarianceScaling",
+        kernel_regularizer=None,
+        bias_regularizer=None,
+    )
+    network = darknet.Darknet(**kwargs)
+
+    expected_config = dict(kwargs)
+    self.assertEqual(network.get_config(), expected_config)
+
+    # Create another network object from the first object's config.
+    new_network = darknet.Darknet.from_config(network.get_config())
+
+    # Validate that the config can be forced to JSON.
+    _ = new_network.to_json()
+
+    # If the serialization was successful, the new config should match the old.
+    self.assertAllEqual(network.get_config(), new_network.get_config())
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/vision/beta/projects/yolo/modeling/layers/nn_blocks.py
+++ b/official/vision/beta/projects/yolo/modeling/layers/nn_blocks.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Contains common building blocks for yolo neural networks."""
+
+from typing import Callable, List
+import tensorflow as tf
+from official.modeling import tf_utils
+
+
+@tf.keras.utils.register_keras_serializable(package="yolo")
+class Identity(tf.keras.layers.Layer):
+
+  def call(self, inputs):
+    return inputs
+
+
+@tf.keras.utils.register_keras_serializable(package="yolo")
+class ConvBN(tf.keras.layers.Layer):
+  """Modified Convolution layer to match that of the DarkNet Library.
+
+  The Layer is a standards combination of Conv BatchNorm Activation,
+  however, the use of bias in the conv is determined by the use of batch norm.
+
+  Cross Stage Partial networks (CSPNets) were proposed in:
+  [1] Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang
+  Chen, Jun-Wei Hsieh.
+  CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
+  arXiv:1911.11929
+  """
+
+  def __init__(self,
+               filters=1,
+               kernel_size=(1, 1),
+               strides=(1, 1),
+               padding="same",
+               dilation_rate=(1, 1),
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               use_bn=True,
+               use_sync_bn=False,
+               norm_momentum=0.99,
+               norm_epsilon=0.001,
+               activation="leaky",
+               leaky_alpha=0.1,
+               **kwargs):
+    """Initializes ConvBN layer.
+
+    Args:
+      filters: integer for output depth, or the number of features to learn
+      kernel_size: integer or tuple for the shape of the weight matrix or kernel
+        to learn.
+      strides: integer of tuple how much to move the kernel after each kernel
+        use padding: string 'valid' or 'same', if same, then pad the image, else
+        do not.
+      padding: `str`, padding method for conv layers.
+      dilation_rate: tuple to indicate how much to modulate kernel weights and
+                      how many pixels in a feature map to skip.
+      kernel_initializer: string to indicate which function to use to initialize
+        weights.
+      bias_initializer: string to indicate which function to use to initialize
+        bias.
+      kernel_regularizer: string to indicate which function to use to
+        regularizer weights.
+      bias_regularizer: string to indicate which function to use to regularizer
+        bias.
+      use_bn: boolean for whether to use batch normalization.
+      use_sync_bn: boolean for whether sync batch normalization.
+      norm_momentum: float for moment to use for batch normalization
+      norm_epsilon: float for batch normalization epsilon
+      activation: string or None for activation function to use in layer,
+                  if None activation is replaced by linear.
+      leaky_alpha: float to use as alpha if activation function is leaky.
+      **kwargs: Keyword Arguments
+    """
+    # convolution params
+    self._filters = filters
+    self._kernel_size = kernel_size
+    self._strides = strides
+    self._padding = padding
+    self._dilation_rate = dilation_rate
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+    self._kernel_regularizer = kernel_regularizer
+    self._bias_regularizer = bias_regularizer
+
+    # batch normalization params
+    self._use_bn = use_bn
+    self._use_sync_bn = use_sync_bn
+    self._norm_moment = norm_momentum
+    self._norm_epsilon = norm_epsilon
+
+    if tf.keras.backend.image_data_format() == "channels_last":
+      # format: (batch_size, height, width, channels)
+      self._bn_axis = -1
+    else:
+      # format: (batch_size, channels, width, height)
+      self._bn_axis = 1
+
+    # activation params
+    self._activation = activation
+    self._leaky_alpha = leaky_alpha
+
+    super(ConvBN, self).__init__(**kwargs)
+
+  def build(self, input_shape):
+    use_bias = not self._use_bn
+
+    self.conv = tf.keras.layers.Conv2D(
+        filters=self._filters,
+        kernel_size=self._kernel_size,
+        strides=self._strides,
+        padding=self._padding,
+        dilation_rate=self._dilation_rate,
+        use_bias=use_bias,
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer)
+
+    if self._use_bn:
+      if self._use_sync_bn:
+        self.bn = tf.keras.layers.experimental.SyncBatchNormalization(
+            momentum=self._norm_moment,
+            epsilon=self._norm_epsilon,
+            axis=self._bn_axis)
+      else:
+        self.bn = tf.keras.layers.BatchNormalization(
+            momentum=self._norm_moment,
+            epsilon=self._norm_epsilon,
+            axis=self._bn_axis)
+    else:
+      self.bn = Identity()
+
+    if self._activation == "leaky":
+      self._activation_fn = tf.keras.layers.LeakyReLU(alpha=self._leaky_alpha)
+    elif self._activation == "mish":
+      self._activation_fn = lambda x: x * tf.math.tanh(tf.math.softplus(x))
+    else:
+      self._activation_fn = tf_utils.get_activation(self._activation)
+
+  def call(self, x):
+    x = self.conv(x)
+    x = self.bn(x)
+    x = self._activation_fn(x)
+    return x
+
+  def get_config(self):
+    # used to store/share parameters to reconstruct the model
+    layer_config = {
+        "filters": self._filters,
+        "kernel_size": self._kernel_size,
+        "strides": self._strides,
+        "padding": self._padding,
+        "dilation_rate": self._dilation_rate,
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "bias_regularizer": self._bias_regularizer,
+        "kernel_regularizer": self._kernel_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_moment": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "activation": self._activation,
+        "leaky_alpha": self._leaky_alpha
+    }
+    layer_config.update(super(ConvBN, self).get_config())
+    return layer_config
+
+  def __repr__(self):
+    return repr(self.get_config())
+
+
+@tf.keras.utils.register_keras_serializable(package="yolo")
+class DarkResidual(tf.keras.layers.Layer):
+  """DarkNet block with Residual connection for Yolo v3 Backbone.
+  """
+
+  def __init__(self,
+               filters=1,
+               filter_scale=2,
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               use_bn=True,
+               use_sync_bn=False,
+               norm_momentum=0.99,
+               norm_epsilon=0.001,
+               activation="leaky",
+               leaky_alpha=0.1,
+               sc_activation="linear",
+               downsample=False,
+               **kwargs):
+    """Initializes DarkResidual.
+
+    Args:
+      filters: integer for output depth, or the number of features to learn.
+      filter_scale: `int`, scale factor for number of filters.
+      kernel_initializer: string to indicate which function to use to initialize
+        weights
+      bias_initializer: string to indicate which function to use to initialize
+        bias
+      kernel_regularizer: string to indicate which function to use to
+        regularizer weights
+      bias_regularizer: string to indicate which function to use to regularizer
+        bias
+      use_bn: boolean for whether to use batch normalization
+      use_sync_bn: boolean for whether sync batch normalization.
+      norm_momentum: float for moment to use for batch normalization
+      norm_epsilon: float for batch normalization epsilon
+      activation: string for activation function to use in conv layers.
+      leaky_alpha: float to use as alpha if activation function is leaky
+      sc_activation: string for activation function to use in layer
+      downsample: boolean for if image input is larger than layer output, set
+        downsample to True so the dimensions are forced to match
+      **kwargs: Keyword Arguments
+    """
+    # downsample
+    self._downsample = downsample
+
+    # ConvBN params
+    self._filters = filters
+    self._filter_scale = filter_scale
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+    self._bias_regularizer = bias_regularizer
+    self._use_bn = use_bn
+    self._use_sync_bn = use_sync_bn
+    self._kernel_regularizer = kernel_regularizer
+
+    # normal params
+    self._norm_moment = norm_momentum
+    self._norm_epsilon = norm_epsilon
+
+    # activation params
+    self._conv_activation = activation
+    self._leaky_alpha = leaky_alpha
+    self._sc_activation = sc_activation
+
+    super().__init__(**kwargs)
+
+  def build(self, input_shape):
+    self._dark_conv_args = {
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "bias_regularizer": self._bias_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_momentum": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "activation": self._conv_activation,
+        "kernel_regularizer": self._kernel_regularizer,
+        "leaky_alpha": self._leaky_alpha
+    }
+    if self._downsample:
+      self._dconv = ConvBN(
+          filters=self._filters,
+          kernel_size=(3, 3),
+          strides=(2, 2),
+          padding="same",
+          **self._dark_conv_args)
+    else:
+      self._dconv = Identity()
+
+    self._conv1 = ConvBN(
+        filters=self._filters // self._filter_scale,
+        kernel_size=(1, 1),
+        strides=(1, 1),
+        padding="same",
+        **self._dark_conv_args)
+
+    self._conv2 = ConvBN(
+        filters=self._filters,
+        kernel_size=(3, 3),
+        strides=(1, 1),
+        padding="same",
+        **self._dark_conv_args)
+
+    self._shortcut = tf.keras.layers.Add()
+    if self._sc_activation == "leaky":
+      self._activation_fn = tf.keras.layers.LeakyReLU(
+          alpha=self._leaky_alpha)
+    elif self._sc_activation == "mish":
+      self._activation_fn = lambda x: x * tf.math.tanh(tf.math.softplus(x))
+    else:
+      self._activation_fn = tf_utils.get_activation(self._sc_activation)
+    super().build(input_shape)
+
+  def call(self, inputs):
+    shortcut = self._dconv(inputs)
+    x = self._conv1(shortcut)
+    x = self._conv2(x)
+    x = self._shortcut([x, shortcut])
+    return self._activation_fn(x)
+
+  def get_config(self):
+    # used to store/share parameters to reconstruct the model
+    layer_config = {
+        "filters": self._filters,
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "kernel_regularizer": self._kernel_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_moment": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "activation": self._conv_activation,
+        "leaky_alpha": self._leaky_alpha,
+        "sc_activation": self._sc_activation,
+        "downsample": self._downsample
+    }
+    layer_config.update(super().get_config())
+    return layer_config
+
+
+@tf.keras.utils.register_keras_serializable(package="yolo")
+class CSPTiny(tf.keras.layers.Layer):
+  """A Small size convolution block proposed in the CSPNet.
+
+  The layer uses shortcuts, routing(concatnation), and feature grouping
+  in order to improve gradient variablity and allow for high efficency, low
+  power residual learning for small networtf.keras.
+
+  Cross Stage Partial networks (CSPNets) were proposed in:
+  [1] Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang
+  Chen, Jun-Wei Hsieh
+      CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
+      arXiv:1911.11929
+  """
+
+  def __init__(self,
+               filters=1,
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               use_bn=True,
+               use_sync_bn=False,
+               group_id=1,
+               groups=2,
+               norm_momentum=0.99,
+               norm_epsilon=0.001,
+               activation="leaky",
+               downsample=True,
+               leaky_alpha=0.1,
+               **kwargs):
+    """Initializes CSPTiny.
+
+    Args:
+      filters: integer for output depth, or the number of features to learn
+      kernel_initializer: string to indicate which function to use to initialize
+        weights
+      bias_initializer: string to indicate which function to use to initialize
+        bias
+      kernel_regularizer: string to indicate which function to use to
+        regularizer weights
+      bias_regularizer: string to indicate which function to use to regularizer
+        bias
+      use_bn: boolean for whether to use batch normalization
+      use_sync_bn: boolean for whether sync batch normalization statistics of
+        all batch norm layers to the models global statistics (across all input
+        batches)
+      group_id: integer for which group of features to pass through the csp tiny
+        stack.
+      groups: integer for how many splits there should be in the convolution
+        feature stack output
+      norm_momentum: float for moment to use for batch normalization
+      norm_epsilon: float for batch normalization epsilon
+      activation: string or None for activation function to use in layer,
+        if None activation is replaced by linear
+      downsample: boolean for if image input is larger than layer output, set
+        downsample to True so the dimensions are forced to match
+      leaky_alpha: float to use as alpha if activation function is leaky
+      **kwargs: Keyword Arguments
+    """
+
+    # ConvBN params
+    self._filters = filters
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+    self._bias_regularizer = bias_regularizer
+    self._use_bn = use_bn
+    self._use_sync_bn = use_sync_bn
+    self._kernel_regularizer = kernel_regularizer
+    self._groups = groups
+    self._group_id = group_id
+    self._downsample = downsample
+
+    # normal params
+    self._norm_moment = norm_momentum
+    self._norm_epsilon = norm_epsilon
+
+    # activation params
+    self._conv_activation = activation
+    self._leaky_alpha = leaky_alpha
+
+    super().__init__(**kwargs)
+
+  def build(self, input_shape):
+    self._dark_conv_args = {
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "bias_regularizer": self._bias_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_momentum": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "activation": self._conv_activation,
+        "kernel_regularizer": self._kernel_regularizer,
+        "leaky_alpha": self._leaky_alpha
+    }
+    self._convlayer1 = ConvBN(
+        filters=self._filters,
+        kernel_size=(3, 3),
+        strides=(1, 1),
+        padding="same",
+        **self._dark_conv_args)
+
+    self._convlayer2 = ConvBN(
+        filters=self._filters // 2,
+        kernel_size=(3, 3),
+        strides=(1, 1),
+        padding="same",
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        bias_regularizer=self._bias_regularizer,
+        kernel_regularizer=self._kernel_regularizer,
+        use_bn=self._use_bn,
+        use_sync_bn=self._use_sync_bn,
+        norm_momentum=self._norm_moment,
+        norm_epsilon=self._norm_epsilon,
+        activation=self._conv_activation,
+        leaky_alpha=self._leaky_alpha)
+
+    self._convlayer3 = ConvBN(
+        filters=self._filters // 2,
+        kernel_size=(3, 3),
+        strides=(1, 1),
+        padding="same",
+        **self._dark_conv_args)
+
+    self._convlayer4 = ConvBN(
+        filters=self._filters,
+        kernel_size=(1, 1),
+        strides=(1, 1),
+        padding="same",
+        **self._dark_conv_args)
+
+    self._maxpool = tf.keras.layers.MaxPool2D(
+        pool_size=2, strides=2, padding="same", data_format=None)
+
+    super().build(input_shape)
+
+  def call(self, inputs):
+    x1 = self._convlayer1(inputs)
+    x1_group = tf.split(x1, self._groups, axis=-1)[self._group_id]
+    x2 = self._convlayer2(x1_group)  # grouping
+    x3 = self._convlayer3(x2)
+    x4 = tf.concat([x3, x2], axis=-1)  # csp partial using grouping
+    x5 = self._convlayer4(x4)
+    x = tf.concat([x1, x5], axis=-1)  # csp connect
+    if self._downsample:
+      x = self._maxpool(x)
+    return x, x5
+
+  def get_config(self):
+    # used to store/share parameters to reconsturct the model
+    layer_config = {
+        "filters": self._filters,
+        "strides": self._strides,
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "kernel_regularizer": self._kernel_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_moment": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "activation": self._conv_activation,
+        "leaky_alpha": self._leaky_alpha,
+        "sc_activation": self._sc_activation,
+    }
+    layer_config.update(super().get_config())
+    return layer_config
+
+
+@tf.keras.utils.register_keras_serializable(package="yolo")
+class CSPRoute(tf.keras.layers.Layer):
+  """Down sampling layer to take the place of down sampleing.
+
+  It is applied in Residual networks. This is the first of 2 layers needed to
+  convert any Residual Network model to a CSPNet. At the start of a new level
+  change, this CSPRoute layer creates a learned identity that will act as a
+  cross stage connection, that is used to inform the inputs to the next stage.
+  It is called cross stage partial because the number of filters required in
+  every intermitent Residual layer is reduced by half. The sister layer will
+  take the partial generated by this layer and concatnate it with the output of
+  the final residual layer in the stack to create a fully feature level output.
+  This concatnation merges the partial blocks of 2 levels as input to the next
+  allowing the gradients of each level to be more unique, and reducing the
+  number of parameters required by each level by 50% while keeping accuracy
+  consistent.
+
+  Cross Stage Partial networks (CSPNets) were proposed in:
+  [1] Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang
+      Chen, Jun-Wei Hsieh.
+      CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
+      arXiv:1911.11929
+  """
+
+  def __init__(self,
+               filters,
+               filter_scale=2,
+               activation="mish",
+               downsample=True,
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               use_bn=True,
+               use_sync_bn=False,
+               norm_momentum=0.99,
+               norm_epsilon=0.001,
+               **kwargs):
+    """Initializes CSPRoute.
+
+    Args:
+      filters: integer for output depth, or the number of features to learn
+      filter_scale: integer dicating (filters//2) or the number of filters in
+        the partial feature stack.
+      activation: string for activation function to use in layer
+      downsample: down_sample the input.
+      kernel_initializer: string to indicate which function to use to initialize
+        weights.
+      bias_initializer: string to indicate which function to use to initialize
+        bias.
+      kernel_regularizer: string to indicate which function to use to
+        regularizer weights.
+      bias_regularizer: string to indicate which function to use to regularizer
+        bias.
+      use_bn: boolean for whether to use batch normalization.
+      use_sync_bn: boolean for whether sync batch normalization.
+      norm_momentum: float for moment to use for batch normalization
+      norm_epsilon: float for batch normalization epsilon
+      **kwargs: Keyword Arguments
+    """
+
+    super().__init__(**kwargs)
+    # Layer params.
+    self._filters = filters
+    self._filter_scale = filter_scale
+    self._activation = activation
+
+    # Convoultion params.
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+    self._kernel_regularizer = kernel_regularizer
+    self._bias_regularizer = bias_regularizer
+    self._use_bn = use_bn
+    self._use_sync_bn = use_sync_bn
+    self._norm_moment = norm_momentum
+    self._norm_epsilon = norm_epsilon
+    self._downsample = downsample
+
+  def build(self, input_shape):
+    self._dark_conv_args = {
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "bias_regularizer": self._bias_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_momentum": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "activation": self._activation,
+        "kernel_regularizer": self._kernel_regularizer,
+    }
+    if self._downsample:
+      self._conv1 = ConvBN(filters=self._filters,
+                           kernel_size=(3, 3),
+                           strides=(2, 2),
+                           **self._dark_conv_args)
+    else:
+      self._conv1 = ConvBN(filters=self._filters,
+                           kernel_size=(3, 3),
+                           strides=(1, 1),
+                           **self._dark_conv_args)
+    self._conv2 = ConvBN(filters=self._filters // self._filter_scale,
+                         kernel_size=(1, 1),
+                         strides=(1, 1),
+                         **self._dark_conv_args)
+
+    self._conv3 = ConvBN(filters=self._filters // self._filter_scale,
+                         kernel_size=(1, 1),
+                         strides=(1, 1),
+                         **self._dark_conv_args)
+
+  def call(self, inputs):
+    x = self._conv1(inputs)
+    y = self._conv2(x)
+    x = self._conv3(x)
+    return (x, y)
+
+
+@tf.keras.utils.register_keras_serializable(package="yolo")
+class CSPConnect(tf.keras.layers.Layer):
+  """Sister Layer to the CSPRoute layer.
+
+  Merges the partial feature stacks generated by the CSPDownsampling layer,
+  and the finaly output of the residual stack. Suggested in the CSPNet paper.
+
+  Cross Stage Partial networks (CSPNets) were proposed in:
+  [1] Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang
+      Chen, Jun-Wei Hsieh.
+      CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
+      arXiv:1911.11929
+  """
+
+  def __init__(self,
+               filters,
+               filter_scale=2,
+               activation="mish",
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               use_bn=True,
+               use_sync_bn=False,
+               norm_momentum=0.99,
+               norm_epsilon=0.001,
+               **kwargs):
+    """Initializes CSPConnect.
+
+    Args:
+      filters: integer for output depth, or the number of features to learn.
+      filter_scale: integer dicating (filters//2) or the number of filters in
+        the partial feature stack.
+      activation: string for activation function to use in layer.
+      kernel_initializer: string to indicate which function to use to initialize
+        weights.
+      bias_initializer: string to indicate which function to use to initialize
+        bias.
+      kernel_regularizer: string to indicate which function to use to
+        regularizer weights.
+      bias_regularizer: string to indicate which function to use to regularizer
+        bias.
+      use_bn: boolean for whether to use batch normalization.
+      use_sync_bn: boolean for whether sync batch normalization.
+      norm_momentum: float for moment to use for batch normalization
+      norm_epsilon: float for batch normalization epsilon
+      **kwargs: Keyword Arguments
+    """
+    super().__init__(**kwargs)
+    # layer params.
+    self._filters = filters
+    self._filter_scale = filter_scale
+    self._activation = activation
+
+    # Convoultion params.
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+    self._kernel_regularizer = kernel_regularizer
+    self._bias_regularizer = bias_regularizer
+    self._use_bn = use_bn
+    self._use_sync_bn = use_sync_bn
+    self._norm_moment = norm_momentum
+    self._norm_epsilon = norm_epsilon
+
+  def build(self, input_shape):
+    self._dark_conv_args = {
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "bias_regularizer": self._bias_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_momentum": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "activation": self._activation,
+        "kernel_regularizer": self._kernel_regularizer,
+    }
+    self._conv1 = ConvBN(filters=self._filters // self._filter_scale,
+                         kernel_size=(1, 1),
+                         strides=(1, 1),
+                         **self._dark_conv_args)
+    self._concat = tf.keras.layers.Concatenate(axis=-1)
+    self._conv2 = ConvBN(filters=self._filters,
+                         kernel_size=(1, 1),
+                         strides=(1, 1),
+                         **self._dark_conv_args)
+
+  def call(self, inputs):
+    x_prev, x_csp = inputs
+    x = self._conv1(x_prev)
+    x = self._concat([x, x_csp])
+    x = self._conv2(x)
+    return x
+
+
+class CSPStack(tf.keras.layers.Layer):
+  """CSP full stack.
+
+  Combines the route and the connect in case you dont want to just quickly wrap
+  an existing callable or list of layers to make it a cross stage partial.
+  Added for ease of use. you should be able to wrap any layer stack with a CSP
+  independent of wether it belongs to the Darknet family. if filter_scale = 2,
+  then the blocks in the stack passed into the the CSP stack should also have
+  filters = filters/filter_scale.
+
+  Cross Stage Partial networks (CSPNets) were proposed in:
+  [1] Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, Ping-Yang
+      Chen, Jun-Wei Hsieh
+      CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
+      arXiv:1911.11929
+  """
+
+  def __init__(self,
+               filters,
+               model_to_wrap=None,
+               filter_scale=2,
+               activation="mish",
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               downsample=True,
+               use_bn=True,
+               use_sync_bn=False,
+               norm_momentum=0.99,
+               norm_epsilon=0.001,
+               **kwargs):
+    """Initializes CSPStack.
+
+    Args:
+      filters: integer for output depth, or the number of features to learn.
+      model_to_wrap: callable Model or a list of callable objects that will
+        process the output of CSPRoute, and be input into CSPConnect. List will
+        be called sequentially.
+      filter_scale: integer dicating (filters//2) or the number of filters in
+        the partial feature stack.
+      activation: string for activation function to use in layer.
+      kernel_initializer: string to indicate which function to use to initialize
+        weights.
+      bias_initializer: string to indicate which function to use to initialize
+        bias.
+      kernel_regularizer: string to indicate which function to use to
+        regularizer weights.
+      bias_regularizer: string to indicate which function to use to regularizer
+        bias.
+      downsample: down_sample the input.
+      use_bn: boolean for whether to use batch normalization
+      use_sync_bn: boolean for whether sync batch normalization.
+      norm_momentum: float for moment to use for batch normalization
+      norm_epsilon: float for batch normalization epsilon
+      **kwargs: Keyword Arguments
+    """
+    super().__init__(**kwargs)
+    # Layer params.
+    self._filters = filters
+    self._filter_scale = filter_scale
+    self._activation = activation
+    self._downsample = downsample
+
+    # Convoultion params.
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+    self._kernel_regularizer = kernel_regularizer
+    self._bias_regularizer = bias_regularizer
+    self._use_bn = use_bn
+    self._use_sync_bn = use_sync_bn
+    self._norm_moment = norm_momentum
+    self._norm_epsilon = norm_epsilon
+
+    if model_to_wrap is not None:
+      if isinstance(model_to_wrap, Callable):
+        self._model_to_wrap = [model_to_wrap]
+      elif isinstance(model_to_wrap, List):
+        self._model_to_wrap = model_to_wrap
+      else:
+        raise ValueError("The input to the CSPStack must be a list of layers"
+                         "that we can iterate through, or \n a callable")
+    else:
+      self._model_to_wrap = []
+
+  def build(self, input_shape):
+    self._dark_conv_args = {
+        "filters": self._filters,
+        "filter_scale": self._filter_scale,
+        "activation": self._activation,
+        "kernel_initializer": self._kernel_initializer,
+        "bias_initializer": self._bias_initializer,
+        "bias_regularizer": self._bias_regularizer,
+        "use_bn": self._use_bn,
+        "use_sync_bn": self._use_sync_bn,
+        "norm_momentum": self._norm_moment,
+        "norm_epsilon": self._norm_epsilon,
+        "kernel_regularizer": self._kernel_regularizer,
+    }
+    self._route = CSPRoute(downsample=self._downsample, **self._dark_conv_args)
+    self._connect = CSPConnect(**self._dark_conv_args)
+    return
+
+  def call(self, inputs):
+    x, x_route = self._route(inputs)
+    for layer in self._model_to_wrap:
+      x = layer(x)
+    x = self._connect([x, x_route])
+    return x
--- a/official/vision/beta/projects/yolo/modeling/layers/nn_blocks_test.py
+++ b/official/vision/beta/projects/yolo/modeling/layers/nn_blocks_test.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from absl.testing import parameterized
+import numpy as np
+import tensorflow as tf
+
+from official.vision.beta.projects.yolo.modeling.layers import nn_blocks
+
+
+class CSPConnectTest(tf.test.TestCase, parameterized.TestCase):
+
+  @parameterized.named_parameters(("same", 224, 224, 64, 1),
+                                  ("downsample", 224, 224, 64, 2))
+  def test_pass_through(self, width, height, filters, mod):
+    x = tf.keras.Input(shape=(width, height, filters))
+    test_layer = nn_blocks.CSPRoute(filters=filters, filter_scale=mod)
+    test_layer2 = nn_blocks.CSPConnect(filters=filters, filter_scale=mod)
+    outx, px = test_layer(x)
+    outx = test_layer2([outx, px])
+    print(outx)
+    print(outx.shape.as_list())
+    self.assertAllEqual(
+        outx.shape.as_list(),
+        [None, np.ceil(width // 2),
+         np.ceil(height // 2), (filters)])
+
+  @parameterized.named_parameters(("same", 224, 224, 64, 1),
+                                  ("downsample", 224, 224, 128, 2))
+  def test_gradient_pass_though(self, filters, width, height, mod):
+    loss = tf.keras.losses.MeanSquaredError()
+    optimizer = tf.keras.optimizers.SGD()
+    test_layer = nn_blocks.CSPRoute(filters, filter_scale=mod)
+    path_layer = nn_blocks.CSPConnect(filters, filter_scale=mod)
+
+    init = tf.random_normal_initializer()
+    x = tf.Variable(
+        initial_value=init(shape=(1, width, height, filters), dtype=tf.float32))
+    y = tf.Variable(initial_value=init(shape=(1, int(np.ceil(width // 2)),
+                                              int(np.ceil(height // 2)),
+                                              filters),
+                                       dtype=tf.float32))
+
+    with tf.GradientTape() as tape:
+      x_hat, x_prev = test_layer(x)
+      x_hat = path_layer([x_hat, x_prev])
+      grad_loss = loss(x_hat, y)
+    grad = tape.gradient(grad_loss, test_layer.trainable_variables)
+    optimizer.apply_gradients(zip(grad, test_layer.trainable_variables))
+
+    self.assertNotIn(None, grad)
+
+
+class CSPRouteTest(tf.test.TestCase, parameterized.TestCase):
+
+  @parameterized.named_parameters(("same", 224, 224, 64, 1),
+                                  ("downsample", 224, 224, 64, 2))
+  def test_pass_through(self, width, height, filters, mod):
+    x = tf.keras.Input(shape=(width, height, filters))
+    test_layer = nn_blocks.CSPRoute(filters=filters, filter_scale=mod)
+    outx, _ = test_layer(x)
+    print(outx)
+    print(outx.shape.as_list())
+    self.assertAllEqual(
+        outx.shape.as_list(),
+        [None, np.ceil(width // 2),
+         np.ceil(height // 2), (filters / mod)])
+
+  @parameterized.named_parameters(("same", 224, 224, 64, 1),
+                                  ("downsample", 224, 224, 128, 2))
+  def test_gradient_pass_though(self, filters, width, height, mod):
+    loss = tf.keras.losses.MeanSquaredError()
+    optimizer = tf.keras.optimizers.SGD()
+    test_layer = nn_blocks.CSPRoute(filters, filter_scale=mod)
+    path_layer = nn_blocks.CSPConnect(filters, filter_scale=mod)
+
+    init = tf.random_normal_initializer()
+    x = tf.Variable(
+        initial_value=init(shape=(1, width, height, filters), dtype=tf.float32))
+    y = tf.Variable(initial_value=init(shape=(1, int(np.ceil(width // 2)),
+                                              int(np.ceil(height // 2)),
+                                              filters),
+                                       dtype=tf.float32))
+
+    with tf.GradientTape() as tape:
+      x_hat, x_prev = test_layer(x)
+      x_hat = path_layer([x_hat, x_prev])
+      grad_loss = loss(x_hat, y)
+    grad = tape.gradient(grad_loss, test_layer.trainable_variables)
+    optimizer.apply_gradients(zip(grad, test_layer.trainable_variables))
+
+    self.assertNotIn(None, grad)
+
+
+class CSPStackTest(tf.test.TestCase, parameterized.TestCase):
+
+  def build_layer(
+      self, layer_type, filters, filter_scale, count, stack_type, downsample):
+    if stack_type is not None:
+      layers = []
+      if layer_type == "residual":
+        for _ in range(count):
+          layers.append(
+              nn_blocks.DarkResidual(
+                  filters=filters // filter_scale, filter_scale=filter_scale))
+      else:
+        for _ in range(count):
+          layers.append(nn_blocks.ConvBN(filters=filters))
+
+      if stack_type == "model":
+        layers = tf.keras.Sequential(layers=layers)
+    else:
+      layers = None
+
+    stack = nn_blocks.CSPStack(
+        filters=filters,
+        filter_scale=filter_scale,
+        downsample=downsample,
+        model_to_wrap=layers)
+    return stack
+
+  @parameterized.named_parameters(
+      ("no_stack", 224, 224, 64, 2, "residual", None, 0, True),
+      ("residual_stack", 224, 224, 64, 2, "residual", "list", 2, True),
+      ("conv_stack", 224, 224, 64, 2, "conv", "list", 3, False),
+      ("callable_no_scale", 224, 224, 64, 1, "residual", "model", 5, False))
+  def test_pass_through(self, width, height, filters, mod, layer_type,
+                        stack_type, count, downsample):
+    x = tf.keras.Input(shape=(width, height, filters))
+    test_layer = self.build_layer(layer_type, filters, mod, count, stack_type,
+                                  downsample)
+    outx = test_layer(x)
+    print(outx)
+    print(outx.shape.as_list())
+    if downsample:
+      self.assertAllEqual(outx.shape.as_list(),
+                          [None, width // 2, height // 2, filters])
+    else:
+      self.assertAllEqual(outx.shape.as_list(), [None, width, height, filters])
+
+  @parameterized.named_parameters(
+      ("no_stack", 224, 224, 64, 2, "residual", None, 0, True),
+      ("residual_stack", 224, 224, 64, 2, "residual", "list", 2, True),
+      ("conv_stack", 224, 224, 64, 2, "conv", "list", 3, False),
+      ("callable_no_scale", 224, 224, 64, 1, "residual", "model", 5, False))
+  def test_gradient_pass_though(self, width, height, filters, mod, layer_type,
+                                stack_type, count, downsample):
+    loss = tf.keras.losses.MeanSquaredError()
+    optimizer = tf.keras.optimizers.SGD()
+
+    init = tf.random_normal_initializer()
+    x = tf.Variable(
+        initial_value=init(shape=(1, width, height, filters), dtype=tf.float32))
+
+    if not downsample:
+      y = tf.Variable(
+          initial_value=init(
+              shape=(1, width, height, filters), dtype=tf.float32))
+    else:
+      y = tf.Variable(
+          initial_value=init(
+              shape=(1, width // 2, height // 2, filters), dtype=tf.float32))
+    test_layer = self.build_layer(layer_type, filters, mod, count, stack_type,
+                                  downsample)
+
+    with tf.GradientTape() as tape:
+      x_hat = test_layer(x)
+      grad_loss = loss(x_hat, y)
+    grad = tape.gradient(grad_loss, test_layer.trainable_variables)
+    optimizer.apply_gradients(zip(grad, test_layer.trainable_variables))
+
+    self.assertNotIn(None, grad)
+
+
+class ConvBNTest(tf.test.TestCase, parameterized.TestCase):
+
+  @parameterized.named_parameters(
+      ("valid", (3, 3), "valid", (1, 1)), ("same", (3, 3), "same", (1, 1)),
+      ("downsample", (3, 3), "same", (2, 2)), ("test", (1, 1), "valid", (1, 1)))
+  def test_pass_through(self, kernel_size, padding, strides):
+    if padding == "same":
+      pad_const = 1
+    else:
+      pad_const = 0
+    x = tf.keras.Input(shape=(224, 224, 3))
+    test_layer = nn_blocks.ConvBN(
+        filters=64,
+        kernel_size=kernel_size,
+        padding=padding,
+        strides=strides,
+        trainable=False)
+    outx = test_layer(x)
+    print(outx.shape.as_list())
+    test = [
+        None,
+        int((224 - kernel_size[0] + (2 * pad_const)) / strides[0] + 1),
+        int((224 - kernel_size[1] + (2 * pad_const)) / strides[1] + 1), 64
+    ]
+    print(test)
+    self.assertAllEqual(outx.shape.as_list(), test)
+
+  @parameterized.named_parameters(("filters", 3))
+  def test_gradient_pass_though(self, filters):
+    loss = tf.keras.losses.MeanSquaredError()
+    optimizer = tf.keras.optimizers.SGD()
+    with tf.device("/CPU:0"):
+      test_layer = nn_blocks.ConvBN(filters, kernel_size=(3, 3), padding="same")
+
+    init = tf.random_normal_initializer()
+    x = tf.Variable(initial_value=init(shape=(1, 224, 224,
+                                              3), dtype=tf.float32))
+    y = tf.Variable(
+        initial_value=init(shape=(1, 224, 224, filters), dtype=tf.float32))
+
+    with tf.GradientTape() as tape:
+      x_hat = test_layer(x)
+      grad_loss = loss(x_hat, y)
+    grad = tape.gradient(grad_loss, test_layer.trainable_variables)
+    optimizer.apply_gradients(zip(grad, test_layer.trainable_variables))
+    self.assertNotIn(None, grad)
+
+
+class DarkResidualTest(tf.test.TestCase, parameterized.TestCase):
+
+  @parameterized.named_parameters(("same", 224, 224, 64, False),
+                                  ("downsample", 223, 223, 32, True),
+                                  ("oddball", 223, 223, 32, False))
+  def test_pass_through(self, width, height, filters, downsample):
+    mod = 1
+    if downsample:
+      mod = 2
+    x = tf.keras.Input(shape=(width, height, filters))
+    test_layer = nn_blocks.DarkResidual(filters=filters, downsample=downsample)
+    outx = test_layer(x)
+    print(outx)
+    print(outx.shape.as_list())
+    self.assertAllEqual(
+        outx.shape.as_list(),
+        [None, np.ceil(width / mod),
+         np.ceil(height / mod), filters])
+
+  @parameterized.named_parameters(("same", 64, 224, 224, False),
+                                  ("downsample", 32, 223, 223, True),
+                                  ("oddball", 32, 223, 223, False))
+  def test_gradient_pass_though(self, filters, width, height, downsample):
+    loss = tf.keras.losses.MeanSquaredError()
+    optimizer = tf.keras.optimizers.SGD()
+    test_layer = nn_blocks.DarkResidual(filters, downsample=downsample)
+
+    if downsample:
+      mod = 2
+    else:
+      mod = 1
+
+    init = tf.random_normal_initializer()
+    x = tf.Variable(
+        initial_value=init(shape=(1, width, height, filters), dtype=tf.float32))
+    y = tf.Variable(initial_value=init(shape=(1, int(np.ceil(width / mod)),
+                                              int(np.ceil(height / mod)),
+                                              filters),
+                                       dtype=tf.float32))
+
+    with tf.GradientTape() as tape:
+      x_hat = test_layer(x)
+      grad_loss = loss(x_hat, y)
+    grad = tape.gradient(grad_loss, test_layer.trainable_variables)
+    optimizer.apply_gradients(zip(grad, test_layer.trainable_variables))
+
+    self.assertNotIn(None, grad)
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/vision/beta/projects/yolo/tasks/image_classification.py
+++ b/official/vision/beta/projects/yolo/tasks/image_classification.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Image classification task definition."""
+import tensorflow as tf
+
+from official.core import input_reader
+from official.core import task_factory
+from official.vision.beta.dataloaders import classification_input
+from official.vision.beta.projects.yolo.configs import darknet_classification as exp_cfg
+from official.vision.beta.projects.yolo.dataloaders import classification_tfds_decoder as cli
+from official.vision.beta.tasks import image_classification
+
+
+@task_factory.register_task_cls(exp_cfg.ImageClassificationTask)
+class ImageClassificationTask(image_classification.ImageClassificationTask):
+  """A task for image classification."""
+
+  def build_inputs(self, params, input_context=None):
+    """Builds classification input."""
+
+    num_classes = self.task_config.model.num_classes
+    input_size = self.task_config.model.input_size
+
+    if params.tfds_name:
+      decoder = cli.Decoder()
+    else:
+      decoder = classification_input.Decoder()
+
+    parser = classification_input.Parser(
+        output_size=input_size[:2],
+        num_classes=num_classes,
+        dtype=params.dtype)
+
+    reader = input_reader.InputReader(
+        params,
+        dataset_fn=tf.data.TFRecordDataset,
+        decoder_fn=decoder.decode,
+        parser_fn=parser.parse_fn(params.is_training))
+
+    dataset = reader.read(input_context=input_context)
+    return dataset
+
+  def train_step(self, inputs, model, optimizer, metrics=None):
+    """Does forward and backward.
+
+    Args:
+      inputs: a dictionary of input tensors.
+      model: the model, forward pass definition.
+      optimizer: the optimizer for this training step.
+      metrics: a nested structure of metrics objects.
+
+    Returns:
+      A dictionary of logs.
+    """
+    features, labels = inputs
+    if self.task_config.losses.one_hot:
+      labels = tf.one_hot(labels, self.task_config.model.num_classes)
+
+    num_replicas = tf.distribute.get_strategy().num_replicas_in_sync
+    with tf.GradientTape() as tape:
+      outputs = model(features, training=True)
+      # Casting output layer as float32 is necessary when mixed_precision is
+      # mixed_float16 or mixed_bfloat16 to ensure output is casted as float32.
+      outputs = tf.nest.map_structure(
+          lambda x: tf.cast(x, tf.float32), outputs)
+
+      # Computes per-replica loss.
+      loss = self.build_losses(
+          model_outputs=outputs, labels=labels, aux_losses=model.losses)
+      # Scales loss as the default gradients allreduce performs sum inside the
+      # optimizer.
+      scaled_loss = loss / num_replicas
+
+      # For mixed_precision policy, when LossScaleOptimizer is used, loss is
+      # scaled for numerical stability.
+      if isinstance(
+          optimizer, tf.keras.mixed_precision.experimental.LossScaleOptimizer):
+        scaled_loss = optimizer.get_scaled_loss(scaled_loss)
+
+    tvars = model.trainable_variables
+    grads = tape.gradient(scaled_loss, tvars)
+    # Scales back gradient before apply_gradients when LossScaleOptimizer is
+    # used.
+    if isinstance(
+        optimizer, tf.keras.mixed_precision.experimental.LossScaleOptimizer):
+      grads = optimizer.get_unscaled_gradients(grads)
+
+    # Apply gradient clipping.
+    if self.task_config.gradient_clip_norm > 0:
+      grads, _ = tf.clip_by_global_norm(
+          grads, self.task_config.gradient_clip_norm)
+    optimizer.apply_gradients(list(zip(grads, tvars)))
+
+    logs = {self.loss: loss}
+    if metrics:
+      self.process_metrics(metrics, labels, outputs)
+      logs.update({m.name: m.result() for m in metrics})
+    elif model.compiled_metrics:
+      self.process_compiled_metrics(model.compiled_metrics, labels, outputs)
+      logs.update({m.name: m.result() for m in model.metrics})
+    return logs
+
+
--- a/official/vision/beta/projects/yolo/train.py
+++ b/official/vision/beta/projects/yolo/train.py
+# Lint as: python3
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TensorFlow Model Garden Vision training driver."""
+
+from absl import app
+from absl import flags
+import gin
+
+from official.common import distribute_utils
+from official.common import flags as tfm_flags
+from official.core import task_factory
+from official.core import train_lib
+from official.core import train_utils
+from official.modeling import performance
+from official.vision.beta.projects.yolo.common import registry_imports  # pylint: disable=unused-import
+
+FLAGS = flags.FLAGS
+
+'''
+python3 -m official.vision.beta.projects.yolo.train --mode=train_and_eval --experiment=darknet_classification --model_dir=training_dir --config_file=official/vision/beta/projects/yolo/configs/experiments/darknet53_tfds.yaml
+'''
+
+
+def main(_):
+  gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_params)
+  print(FLAGS.experiment)
+  params = train_utils.parse_configuration(FLAGS)
+
+  model_dir = FLAGS.model_dir
+  if 'train' in FLAGS.mode:
+    # Pure eval modes do not output yaml files. Otherwise continuous eval job
+    # may race against the train job for writing the same file.
+    train_utils.serialize_config(params, model_dir)
+
+  # Sets mixed_precision policy. Using 'mixed_float16' or 'mixed_bfloat16'
+  # can have significant impact on model speeds by utilizing float16 in case of
+  # GPUs, and bfloat16 in the case of TPUs. loss_scale takes effect only when
+  # dtype is float16
+  if params.runtime.mixed_precision_dtype:
+    performance.set_mixed_precision_policy(params.runtime.mixed_precision_dtype,
+                                           params.runtime.loss_scale)
+  distribution_strategy = distribute_utils.get_distribution_strategy(
+      distribution_strategy=params.runtime.distribution_strategy,
+      all_reduce_alg=params.runtime.all_reduce_alg,
+      num_gpus=params.runtime.num_gpus,
+      tpu_address=params.runtime.tpu)
+  with distribution_strategy.scope():
+    task = task_factory.get_task(params.task, logging_dir=model_dir)
+
+  train_lib.run_experiment(
+      distribution_strategy=distribution_strategy,
+      task=task,
+      mode=FLAGS.mode,
+      params=params,
+      model_dir=model_dir)
+
+if __name__ == '__main__':
+  tfm_flags.define_flags()
+  app.run(main)