add tutorial

9276aa93 · VoVAllen · 8801154b · 9276aa93 · 9276aa93 · 9276aa93
Commit 9276aa93 authored Oct 17, 2018 by VoVAllen
12 changed files
--- a/tutorial/capsule/Capsule Tutorial.ipynb
+++ b/tutorial/capsule/Capsule Tutorial.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Capsule Network\n",
+    "================\n",
+    "**Author**: `Jinjing Zhou`\n",
+    "\n",
+    "This Tutorial is for blablabla"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "Capsule Network is proposed by blablabla"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### What's a capsule?\n",
+    "> A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part.    \n",
+    "-- <cite>Geoffrey E. Hinton</cite>\n",
+    "\n",
+    "Generally Speaking, "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.6",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/tutorial/capsule/codes/CHANGELOG.md
+++ b/tutorial/capsule/codes/CHANGELOG.md
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
+and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
+## [0.4.0] - 2018-01-30
+### Added
+- Supports and works with CIFAR10 dataset.
+### Changed
+- Upgrade to PyTorch 0.3.0.
+- Supports CUDA 9.
+- Drop our custom softmax function and switch to PyTorch softmax function.
+- Modify the save_image utils function to handle 3-channel (RGB) image.
+### Fixed
+- Compatibilities with PyTorch 0.3.0.
+## [0.3.0] - 2017-11-27
+### Added
+- Decoder network PyTorch module.
+- Reconstruct image with Decoder network during testing.
+- Save the original and recontructed images into file system.
+- Log the original and reconstructed images using TensorBoard.
+### Changed
+- Refactor reconstruction loss function and decoder network.
+- Remove image reconstruction from training.
+## [0.2.0] - 2017-11-26
+### Added
+- New dependencies for TensorBoard and tqdm.
+- Logging losses and accuracies with TensorBoard.
+- New utils functions for:
+    - computing accuracy
+    - convert values of the model parameters to numpy.array.
+    - parsing boolean values with argparse
+- Softmax function that takes a dimension.
+- More detailed code comments.
+- Show margin loss and reconstruction loss in logs.
+- Show accuracy in train logs.
+### Changed
+- Refactor loss functions.
+- Clean codes.
+### Fixed
+- Runtime error during pip install requirements.txt
+- Bug in routing algorithm.
+## [0.1.0] - 2017-11-12
+### Added
+- Implemented reconstruction loss.
+- Saving reconstructed image as file.
+- Improve training speed by using PyTorch DataParallel to wrap our model.
+    - PyTorch will parallelized the model and data over multiple GPUs.
+- Supports training:
+    - on CPU (tested with macOS Sierra)
+    - on one GPU (tested with 1 Tesla K80 GPU)
+    - on multiple GPU (tested with 8 GPUs)
+    - with or without CUDA (tested with CUDA version 8.0.61)
+    - cuDNN 5 (tested with cuDNN 5.1.3)
+### Changed
+- More intuitive variable naming.
+### Fixed
+- Resolve Pylint warnings and reformat code.
+- Missing square in equation 4 for margin (class) loss.
+## 0.0.1 - 2017-11-04
+### Added
+- Initial release. The first beta version. API is stable. The code runs. So, I think it's safe to use for development but not ready for general production usage.
+[Unreleased]: https://github.com/cedrickchee/capsule-net-pytorch/compare/v1.0.0...HEAD
+[0.1.0]: https://github.com/cedrickchee/capsule-net-pytorch/compare/v0.0.1...v0.1.0
+[0.2.0]: https://github.com/cedrickchee/capsule-net-pytorch/compare/v0.1.0...v0.2.0
+[0.3.0]: https://github.com/cedrickchee/capsule-net-pytorch/compare/v0.2.0...v0.3.0
+[0.4.0]: https://github.com/cedrickchee/capsule-net-pytorch/compare/v0.3.0...v0.4.0
--- a/tutorial/capsule/codes/LICENSE
+++ b/tutorial/capsule/codes/LICENSE
+COPYRIGHT
+All contributions by Cedric Chee:
+Copyright (c) 2017, Cedric Chee.
+All rights reserved.
+All other contributions:
+Copyright (c) 2017, the respective contributors.
+All rights reserved.
+Each contributor holds copyright over their respective contributions.
+The project versioning (Git) records all such contribution source information.
+LICENSE
+The MIT License (MIT)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/tutorial/capsule/codes/README.md
+++ b/tutorial/capsule/codes/README.md
+# PyTorch CapsNet: Capsule Network for PyTorch
+[![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/cedrickchee/capsule-net-pytorch/blob/master/LICENSE)
+![completion](https://img.shields.io/badge/completion%20state-95%25-green.svg?style=plastic)
+A CUDA-enabled PyTorch implementation of CapsNet (Capsule Network) based on this paper:
+[Sara Sabour, Nicholas Frosst, Geoffrey E Hinton. Dynamic Routing Between Capsules. NIPS 2017](https://arxiv.org/abs/1710.09829)
+The current `test error is 0.21%` and the `best test error is 0.20%`. The current `test accuracy is 99.31%` and the `best test accuracy is 99.32%`.
+**What is a Capsule**
+> A Capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part.
+You can learn more about Capsule Networks [here](#learning-resources).
+**Why another CapsNet implementation?**
+I wanted a decent PyTorch implementation of CapsNet and I couldn't find one at the point when I started. The goal of this implementation is focus to help newcomers learn and understand the CapsNet architecture and the idea of Capsules. The implementation is **NOT** focus on rigorous correctness of the results. In addition, the codes are not optimized for speed. To help us read and understand the codes easier, the codes comes with ample comments and the Python classes and functions are documented with Python docstring.
+I will try my best to check and fix issues reported. Contributions are highly welcomed. If you find any bugs or errors in the codes, please do not hesitate to open an issue or a pull request. Thank you.
+**Status and Latest Updates:**
+See the [CHANGELOG](CHANGELOG.md)
+**Datasets**
+The model was trained on the standard [MNIST](http://yann.lecun.com/exdb/mnist/) data.
+*Note: you don't have to manually download, preprocess, and load the MNIST dataset as [TorchVision](https://github.com/pytorch/vision) will take care of this step for you.*
+I have tried using other datasets. See the [Other Datasets](#other-datasets) section below for more details.
+## Requirements
+- Python 3
+  - Tested with version 3.6.4
+- [PyTorch](http://pytorch.org/)
+    - Tested with version 0.3.0.post4
+    - Migrate existing code to work in version 0.4.0. [Work-In-Progress]
+    - Code will not run with version 0.1.2 due to `keepdim` not available in this version.
+    - Code will not run with version 0.2.0 due to `softmax` function doesn't takes a dimension.
+- CUDA 8 and above
+  - Tested with CUDA 8 and CUDA 9.
+- [TorchVision](https://github.com/pytorch/vision)
+- [tensorboardX](https://github.com/lanpa/tensorboard-pytorch)
+- [tqdm](https://github.com/tqdm/tqdm)
+## Usage
+### Training and Evaluation
+**Step 1.**
+Clone this repository with ``git`` and install project dependencies.
+```bash
+$ git clone https://github.com/cedrickchee/capsule-net-pytorch.git
+$ cd capsule-net-pytorch
+$ pip install -r requirements.txt
+```
+**Step 2.** 
+Start the CapsNet on MNIST training and evaluation:
+- Training with default settings:
+```bash
+$ python main.py
+```
+- Training on 8 GPUs with 30 epochs and 1 routing iteration:
+```bash
+$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --epochs 30 --num-routing 1 --threads 16 --batch-size 128 --test-batch-size 128
+```
+**Step 3.**
+Test a pre-trained model:
+If you have trained a model in Step 2 above, then the weights for the trained model will be saved to `results/trained_model/model_epoch_10.pth`. [WIP] Now just run the following command to get test results.
+```bash
+$ python main.py --is-training 0 --weights results/trained_model/model_epoch_10.pth
+```
+### Pre-trained Model
+You can download the weights for the pre-trained model from my Google Drive. We saved the weights (model state dict) and the optimizer state for the model at the end of every training epoch.
+- Weights from [epoch 50 checkpoint](https://drive.google.com/uc?export=download&id=1lYtOMSreP4I9hr9un4DsBJZrzodI6l2d) [84 MB].
+- Weights from [epoch 40 to 50](https://drive.google.com/uc?export=download&id=1VMuVtJrecz47czsT5HqLxZpFjkLoMKaL) checkpoints [928 MB].
+Uncompress and put the weights (.pth files) into `./results/trained_model/`.
+*Note: the model was **last trained** on 2017-11-26 and the weights **last updated** on 2017-11-28.*
+### The Default Hyper Parameters
+| Parameter | Value | CLI arguments |
+| --- | --- | --- |
+| Training epochs | 10 | --epochs 10 |
+| Learning rate | 0.01 | --lr 0.01 |
+| Training batch size | 128 | --batch-size 128 |
+| Testing batch size | 128 | --test-batch-size 128 |
+| Log interval | 10 | --log-interval 10 |
+| Disables CUDA training | false | --no-cuda |
+| Num. of channels produced by the convolution | 256 | --num-conv-out-channel 256 |
+| Num. of input channels to the convolution | 1 | --num-conv-in-channel 1 |
+| Num. of primary unit | 8 | --num-primary-unit 8 |
+| Primary unit size | 1152 | --primary-unit-size 1152 |
+| Num. of digit classes | 10 | --num-classes 10 |
+| Output unit size | 16 | --output-unit-size 16 |
+| Num. routing iteration | 3 | --num-routing 3 |
+| Use reconstruction loss | true | --use-reconstruction-loss |
+| Regularization coefficient for reconstruction loss | 0.0005 | --regularization-scale 0.0005 |
+| Dataset name (mnist, cifar10) | mnist | --dataset mnist |
+| Input image width to the convolution | 28 | --input-width 28 |
+| Input image height to the convolution | 28 | --input-height 28 |
+## Results
+### Test Error
+CapsNet classification test error on MNIST. The MNIST average and standard deviation results are reported from 3 trials.
+The results can be reproduced by running the following commands.
+```bash
+ python main.py --epochs 50 --num-routing 1 --use-reconstruction-loss no --regularization-scale 0.0       #CapsNet-v1
+ python main.py --epochs 50 --num-routing 1 --use-reconstruction-loss yes --regularization-scale 0.0005   #CapsNet-v2
+ python main.py --epochs 50 --num-routing 3 --use-reconstruction-loss no --regularization-scale 0.0       #CapsNet-v3
+ python main.py --epochs 50 --num-routing 3 --use-reconstruction-loss yes --regularization-scale 0.0005   #CapsNet-v4
+```
+Method | Routing | Reconstruction | MNIST (%) | *Paper*
+:---------|:------:|:---:|:----:|:----:
+Baseline |  -- | -- | -- | *0.39*
+CapsNet-v1 | 1 | no | -- | *0.34 (0.032)*
+CapsNet-v2 | 1 | yes | -- | *0.29 (0.011)*
+CapsNet-v3 | 3 | no | -- | *0.35 (0.036)*
+CapsNet-v4 | 3 | yes | 0.21 | *0.25 (0.005)*
+### Training Loss and Accuracy
+The training losses and accuracies for CapsNet-v4 (50 epochs, 3 routing iteration, using reconstruction, regularization scale of 0.0005):
+![](results/train_loss_accuracy.png)
+Training accuracy. Highest training accuracy: 100%
+![](results/train_accuracy.png)
+Training loss. Lowest training error: 0.1938%
+![](results/train_loss.png)
+### Test Loss and Accuracy
+The test losses and accuracies for CapsNet-v4 (50 epochs, 3 routing iteration, using reconstruction, regularization scale of 0.0005):
+![](results/test_loss_accuracy.png)
+Test accuracy. Highest test accuracy: 99.32%
+![](results/test_accuracy.png)
+Test loss. Lowest test error: 0.2002%
+![](results/test_loss.png)
+### Training Speed
+- Around `5.97s / batch` or `8min / epoch` on a single Tesla K80 GPU with batch size of 704.
+- Around `3.25s / batch` or `25min / epoch` on a single Tesla K80 GPUwith batch size of 128.
+![](results/training_speed.png)
+In my case, these are the hyperparameters I used for the training setup:
+- batch size: 128
+- Epochs: 50
+- Num. of routing: 3
+- Use reconstruction loss: yes
+- Regularization scale for reconstruction loss: 0.0005
+### Reconstruction
+The results of CapsNet-v4.
+Digits at left are reconstructed images.
+<table>
+  <tr>
+    <td>
+     <img src="results/reconstructed_images.png"/>
+    </td>
+    <td>
+      <p>[WIP] Ground truth image from dataset</p>
+    </td>
+  </tr>
+</table>
+### Model Design
+```bash
+Model architecture:
+------------------
+Net (
+  (conv1): ConvLayer (
+    (conv0): Conv2d(1, 256, kernel_size=(9, 9), stride=(1, 1))
+    (relu): ReLU (inplace)
+  )
+  (primary): CapsuleLayer (
+    (conv_units): ModuleList (
+      (0): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+      (1): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+      (2): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+      (3): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+      (4): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+      (5): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+      (6): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+      (7): Conv2d(256, 32, kernel_size=(9, 9), stride=(2, 2))
+    )
+  )
+  (digits): CapsuleLayer (
+  )
+  (decoder): Decoder (
+    (fc1): Linear (160 -> 512)
+    (fc2): Linear (512 -> 1024)
+    (fc3): Linear (1024 -> 784)
+    (relu): ReLU (inplace)
+    (sigmoid): Sigmoid ()
+  )
+)
+Parameters and size:
+-------------------
+conv1.conv0.weight: [256, 1, 9, 9]
+conv1.conv0.bias: [256]
+primary.conv_units.0.weight: [32, 256, 9, 9]
+primary.conv_units.0.bias: [32]
+primary.conv_units.1.weight: [32, 256, 9, 9]
+primary.conv_units.1.bias: [32]
+primary.conv_units.2.weight: [32, 256, 9, 9]
+primary.conv_units.2.bias: [32]
+primary.conv_units.3.weight: [32, 256, 9, 9]
+primary.conv_units.3.bias: [32]
+primary.conv_units.4.weight: [32, 256, 9, 9]
+primary.conv_units.4.bias: [32]
+primary.conv_units.5.weight: [32, 256, 9, 9]
+primary.conv_units.5.bias: [32]
+primary.conv_units.6.weight: [32, 256, 9, 9]
+primary.conv_units.6.bias: [32]
+primary.conv_units.7.weight: [32, 256, 9, 9]
+primary.conv_units.7.bias: [32]
+digits.weight: [1, 1152, 10, 16, 8]
+decoder.fc1.weight: [512, 160]
+decoder.fc1.bias: [512]
+decoder.fc2.weight: [1024, 512]
+decoder.fc2.bias: [1024]
+decoder.fc3.weight: [784, 1024]
+decoder.fc3.bias: [784]
+Total number of parameters on (with reconstruction network): 8227088 (8 million)
+```
+### TensorBoard
+We logged the training and test losses and accuracies using tensorboardX. TensorBoard helps us visualize how the machine learn over time. We can visualize statistics, such as how the objective function is changing or weights or accuracy varied during training.
+TensorBoard operates by reading TensorFlow data (events files).
+#### How to Use TensorBoard
+1. Download a [copy of the events files](https://drive.google.com/uc?export=download&id=1lZVffeZTkUQfSxmZmYDViRzmhb59wBWL) for the latest run from my Google Drive.
+2. Uncompress the file and put it into `./runs`.
+3. Check to ensure you have installed tensorflow (CPU version). We need this for TensorBoard server and dashboard.
+4. Start TensorBoard.
+```bash
+$ tensorboard --logdir runs
+```
+5. Open TensorBoard dashboard in your web browser using this URL: http://localhost:6006
+### Other Datasets
+#### CIFAR10
+In the spirit of experiment, I have tried using other datasets. I have updated the implementation so that it supports and works with CIFAR10. Need to note that I have not tested throughly our capsule model on CIFAR10.
+Here's how we can train and test the model on CIFAR10 by running the following commands.
+```bash
+python main.py --dataset cifar10 --num-conv-in-channel 3 --input-width 32 --input-height 32 --primary-unit-size 2048 --epochs 80 --num-routing 1 --use-reconstruction-loss yes --regularization-scale 0.0005
+```
+##### Training Loss and Accuracy
+The training losses and accuracies for CapsNet-v4 (80 epochs, 3 routing iteration, using reconstruction, regularization scale of 0.0005):
+![](results/cifar10/train_loss_accuracy.png)
+- Highest training accuracy: 100%
+- Lowest training error: 0.3589%
+##### Test Loss and Accuracy
+The test losses and accuracies for CapsNet-v4 (80 epochs, 3 routing iteration, using reconstruction, regularization scale of 0.0005):
+![](results/cifar10/test_loss_accuracy.png)
+- Highest test accuracy: 71%
+- Lowest test error: 0.5735%
+## TODO
+- [x] Publish results.
+- [x] More testing.
+- [ ] Inference mode - command to test a pre-trained model.
+- [ ] Jupyter Notebook version.
+- [x] Create a sample to show how we can apply CapsNet to real-world application.
+- [ ] Experiment with CapsNet:
+    * [x] Try using another dataset.
+    * [ ] Come out a more creative model structure.
+- [x] Pre-trained model and weights.
+- [x] Add visualization for training and evaluation metrics.
+- [x] Implement recontruction loss.
+- [x] Check algorithm for correctness.
+- [x] Update results from TensorBoard after making improvements and bug fixes.
+- [x] Publish updated pre-trained model weights.
+- [x] Log the original and reconstructed images using TensorBoard.
+- [ ] Update results with reconstructed image and original image.
+- [ ] Resume training by loading model checkpoint.
+- [ ] Migrate existing code to work in PyTorch 0.4.0.
+*WIP is an acronym for Work-In-Progress*
+## Credits
+Referenced these implementations mainly for sanity check:
+1. [TensorFlow implementation by @naturomics](https://github.com/naturomics/CapsNet-Tensorflow)
+## Learning Resources
+Here's some resources that we think will be helpful if you want to learn more about Capsule Networks:
+- Articles and blog posts:
+  - [Understanding Hinton's Capsule Networks. Part I: Intuition.](https://medium.com/@pechyonkin/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b)
+  - [Dynamic routing between capsules](https://blog.acolyer.org/2017/11/13/dynamic-routing-between-capsules/)
+  - [What is a CapsNet or Capsule Network?](https://hackernoon.com/what-is-a-capsnet-or-capsule-network-2bfbe48769cc)
+  - [Capsule Networks Are Shaking up AI — Here's How to Use Them](https://hackernoon.com/capsule-networks-are-shaking-up-ai-heres-how-to-use-them-c233a0971952)
+  - [Capsule Networks Explained](https://kndrck.co/posts/capsule_networks_explained/)
+- Videos:
+  - [Capsule Networks: An Improvement to Convolutional Networks](https://www.youtube.com/watch?v=VKoLGnq15RM)
+  - [Capsule Networks (CapsNets) – Tutorial](https://www.youtube.com/watch?v=pPN8d0E3900)
+## Other Implementations
+- TensorFlow:
+  - The first author of the paper, [Sara Sabour has released the code](https://github.com/Sarasra/models/tree/master/research/capsules).
+## Real-world Application of CapsNet
+The following is a few samples in the wild that show how we can apply CapsNet to real-world use cases.
+- [An attempt to implement CapsNet for car make-model classification](https://www.reddit.com/r/MachineLearning/comments/80eiz3/p_implementing_a_capsnet_for_car_makemodel/)
+- [A Keras implementation of Capsule Network on Fashion MNIST dataset](https://github.com/XifengGuo/CapsNet-Fashion-MNIST)
\ No newline at end of file
--- a/tutorial/capsule/codes/capsule_layer.py
+++ b/tutorial/capsule/codes/capsule_layer.py
+"""Capsule layer
+PyTorch implementation of CapsNet in Sabour, Hinton et al.'s paper
+Dynamic Routing Between Capsules. NIPS 2017.
+https://arxiv.org/abs/1710.09829
+Author: Cedric Chee
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+import utils
+class CapsuleLayer(nn.Module):
+    """
+    The core implementation of the idea of capsules
+    """
+    def __init__(self, in_unit, in_channel, num_unit, unit_size, use_routing,
+                 num_routing, cuda_enabled):
+        super(CapsuleLayer, self).__init__()
+        self.in_unit = in_unit
+        self.in_channel = in_channel
+        self.num_unit = num_unit
+        self.use_routing = use_routing
+        self.num_routing = num_routing
+        self.cuda_enabled = cuda_enabled
+        if self.use_routing:
+            """
+            Based on the paper, DigitCaps which is capsule layer(s) with
+            capsule inputs use a routing algorithm that uses this weight matrix, Wij
+            """
+            # weight shape:
+            # [1 x primary_unit_size x num_classes x output_unit_size x num_primary_unit]
+            # == [1 x 1152 x 10 x 16 x 8]
+            self.weight = nn.Parameter(torch.randn(1, in_channel, num_unit, unit_size, in_unit))
+        else:
+            """
+            According to the CapsNet architecture section in the paper,
+            we have routing only between two consecutive capsule layers (e.g. PrimaryCapsules and DigitCaps).
+            No routing is used between Conv1 and PrimaryCapsules.
+            This means PrimaryCapsules is composed of several convolutional units.
+            """
+            # Define 8 convolutional units.
+            self.conv_units = nn.ModuleList([
+                nn.Conv2d(self.in_channel, 32, 9, 2) for u in range(self.num_unit)
+            ])
+    def forward(self, x):
+        if self.use_routing:
+            # Currently used by DigitCaps layer.
+            return self.routing(x)
+        else:
+            # Currently used by PrimaryCaps layer.
+            return self.no_routing(x)
+    def routing(self, x):
+        """
+        Routing algorithm for capsule.
+        :input: tensor x of shape [128, 8, 1152]
+        :return: vector output of capsule j
+        """
+        batch_size = x.size(0)
+        x = x.transpose(1, 2) # dim 1 and dim 2 are swapped. out tensor shape: [128, 1152, 8]
+        # Stacking and adding a dimension to a tensor.
+        # stack ops output shape: [128, 1152, 10, 8]
+        # unsqueeze ops output shape: [128, 1152, 10, 8, 1]
+        x = torch.stack([x] * self.num_unit, dim=2).unsqueeze(4)
+        # Convert single weight to batch weight.
+        # [1 x 1152 x 10 x 16 x 8] to: [128, 1152, 10, 16, 8]
+        batch_weight = torch.cat([self.weight] * batch_size, dim=0)
+        # u_hat is "prediction vectors" from the capsules in the layer below.
+        # Transform inputs by weight matrix.
+        # Matrix product of 2 tensors with shape: [128, 1152, 10, 16, 8] x [128, 1152, 10, 8, 1]
+        # u_hat shape: [128, 1152, 10, 16, 1]
+        u_hat = torch.matmul(batch_weight, x)
+        # All the routing logits (b_ij in the paper) are initialized to zero.
+        # self.in_channel = primary_unit_size = 32 * 6 * 6 = 1152
+        # self.num_unit = num_classes = 10
+        # b_ij shape: [1, 1152, 10, 1]
+        b_ij = Variable(torch.zeros(1, self.in_channel, self.num_unit, 1))
+        if self.cuda_enabled:
+            b_ij = b_ij.cuda()
+        # From the paper in the "Capsules on MNIST" section,
+        # the sample MNIST test reconstructions of a CapsNet with 3 routing iterations.
+        num_iterations = self.num_routing
+        for iteration in range(num_iterations):
+            # Routing algorithm
+            # Calculate routing or also known as coupling coefficients (c_ij).
+            # c_ij shape: [1, 1152, 10, 1]
+            c_ij = F.softmax(b_ij, dim=2)  # Convert routing logits (b_ij) to softmax.
+            # c_ij shape from: [128, 1152, 10, 1] to: [128, 1152, 10, 1, 1]
+            c_ij = torch.cat([c_ij] * batch_size, dim=0).unsqueeze(4)
+            # Implement equation 2 in the paper.
+            # s_j is total input to a capsule, is a weigthed sum over all "prediction vectors".
+            # u_hat is weighted inputs, prediction ˆuj|i made by capsule i.
+            # c_ij * u_hat shape: [128, 1152, 10, 16, 1]
+            # s_j output shape: [batch_size=128, 1, 10, 16, 1]
+            # Sum of Primary Capsules outputs, 1152D becomes 1D.
+            s_j = (c_ij * u_hat).sum(dim=1, keepdim=True)
+            # Squash the vector output of capsule j.
+            # v_j shape: [batch_size, weighted sum of PrimaryCaps output,
+            #             num_classes, output_unit_size from u_hat, 1]
+            # == [128, 1, 10, 16, 1]
+            # So, the length of the output vector of a capsule is 16, which is in dim 3.
+            v_j = utils.squash(s_j, dim=3)
+            # in_channel is 1152.
+            # v_j1 shape: [128, 1152, 10, 16, 1]
+            v_j1 = torch.cat([v_j] * self.in_channel, dim=1)
+            # The agreement.
+            # Transpose u_hat with shape [128, 1152, 10, 16, 1] to [128, 1152, 10, 1, 16],
+            # so we can do matrix product u_hat and v_j1.
+            # u_vj1 shape: [1, 1152, 10, 1]
+            u_vj1 = torch.matmul(u_hat.transpose(3, 4), v_j1).squeeze(4).mean(dim=0, keepdim=True)
+            # Update routing (b_ij) by adding the agreement to the initial logit.
+            b_ij = b_ij + u_vj1
+        return v_j.squeeze(1) # shape: [128, 10, 16, 1]
+    def no_routing(self, x):
+        """
+        Get output for each unit.
+        A unit has batch, channels, height, width.
+        An example of a unit output shape is [128, 32, 6, 6]
+        :return: vector output of capsule j
+        """
+        # Create 8 convolutional unit.
+        # A convolutional unit uses normal convolutional layer with a non-linearity (squash).
+        unit = [self.conv_units[i](x) for i, l in enumerate(self.conv_units)]
+        # Stack all unit outputs.
+        # Stacked of 8 unit output shape: [128, 8, 32, 6, 6]
+        unit = torch.stack(unit, dim=1)
+        batch_size = x.size(0)
+        # Flatten the 32 of 6x6 grid into 1152.
+        # Shape: [128, 8, 1152]
+        unit = unit.view(batch_size, self.num_unit, -1)
+        # Add non-linearity
+        # Return squashed outputs of shape: [128, 8, 1152]
+        return utils.squash(unit, dim=2) # dim 2 is the third dim (1152D array) in our tensor
--- a/tutorial/capsule/codes/conv_layer.py
+++ b/tutorial/capsule/codes/conv_layer.py
+"""Convolutional layer
+PyTorch implementation of CapsNet in Sabour, Hinton et al.'s paper
+Dynamic Routing Between Capsules. NIPS 2017.
+https://arxiv.org/abs/1710.09829
+Author: Cedric Chee
+"""
+import torch
+import torch.nn as nn
+class ConvLayer(nn.Module):
+    """
+    Conventional Conv2d layer
+    """
+    def __init__(self, in_channel, out_channel, kernel_size):
+        super(ConvLayer, self).__init__()
+        self.conv0 = nn.Conv2d(in_channels=in_channel,
+                               out_channels=out_channel,
+                               kernel_size=kernel_size,
+                               stride=1)
+        self.relu = nn.ReLU(inplace=True)
+    def forward(self, x):
+        """Forward pass"""
+        # x shape: [128, 1, 28, 28]
+        # out_conv0 shape: [128, 256, 20, 20]
+        out_conv0 = self.conv0(x)
+        # out_relu shape: [128, 256, 20, 20]
+        out_relu = self.relu(out_conv0)
+        return out_relu
--- a/tutorial/capsule/codes/decoder.py
+++ b/tutorial/capsule/codes/decoder.py
+"""Decoder Network
+PyTorch implementation of CapsNet in Sabour, Hinton et al.'s paper
+Dynamic Routing Between Capsules. NIPS 2017.
+https://arxiv.org/abs/1710.09829
+Author: Cedric Chee
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import utils
+class Decoder(nn.Module):
+    """
+    Implement Decoder structure in section 4.1, Figure 2 to reconstruct a digit
+    from the `DigitCaps` layer representation.
+    The decoder network consists of 3 fully connected layers. For each
+    [10, 16] output, we mask out the incorrect predictions, and send
+    the [16,] vector to the decoder network to reconstruct a [784,] size
+    image.
+    This Decoder network is used in training and prediction (testing).
+    """
+    def __init__(self, num_classes, output_unit_size, input_width,
+                 input_height, num_conv_in_channel, cuda_enabled):
+        """
+        The decoder network consists of 3 fully connected layers, with
+        512, 1024, 784 (or 3072 for CIFAR10) neurons each.
+        """
+        super(Decoder, self).__init__()
+        self.cuda_enabled = cuda_enabled
+        fc1_output_size = 512
+        fc2_output_size = 1024
+        self.fc3_output_size = input_width * input_height * num_conv_in_channel
+        self.fc1 = nn.Linear(num_classes * output_unit_size, fc1_output_size) # input dim 10 * 16.
+        self.fc2 = nn.Linear(fc1_output_size, fc2_output_size)
+        self.fc3 = nn.Linear(fc2_output_size, self.fc3_output_size)
+        # Activation functions
+        self.relu = nn.ReLU(inplace=True)
+        self.sigmoid = nn.Sigmoid()
+    def forward(self, x, target):
+        """
+        We send the outputs of the `DigitCaps` layer, which is a
+        [batch_size, 10, 16] size tensor into the Decoder network, and
+        reconstruct a [batch_size, fc3_output_size] size tensor representing the image.
+        Args:
+            x: [batch_size, 10, 16] The output of the digit capsule.
+            target: [batch_size, 10] One-hot MNIST dataset labels.
+        Returns:
+            reconstruction: [batch_size, fc3_output_size] Tensor of reconstructed images.
+        """
+        batch_size = target.size(0)
+        """
+        First, do masking.
+        """
+        # Method 1: mask with y.
+        # Note: we have not implement method 2 which is masking with true label.
+        # masked_caps shape: [batch_size, 10, 16, 1]
+        masked_caps = utils.mask(x, self.cuda_enabled)
+        """
+        Second, reconstruct the images with 3 Fully Connected layers.
+        """
+        # vector_j shape: [batch_size, 160=10*16]
+        vector_j = masked_caps.view(x.size(0), -1) # reshape the masked_caps tensor
+        # Forward pass of the network
+        fc1_out = self.relu(self.fc1(vector_j))
+        fc2_out = self.relu(self.fc2(fc1_out)) # shape: [batch_size, 1024]
+        reconstruction = self.sigmoid(self.fc3(fc2_out)) # shape: [batch_size, fc3_output_size]
+        assert reconstruction.size() == torch.Size([batch_size, self.fc3_output_size])
+        return reconstruction
--- a/tutorial/capsule/codes/dgl_capsule_batch.py
+++ b/tutorial/capsule/codes/dgl_capsule_batch.py
+import dgl
+import torch
+import torch.nn.functional as F
+from torch import nn
+from capsule_layer import CapsuleLayer
+# import main
+from utils import writer, step
+# global_step = main.global_step
+device = "cuda" if torch.cuda.is_available() else "cpu"
+class DGLFeature():
+    """
+    To wrap different shape of representation tensor into the same shape
+    """
+    def __init__(self, tensor, pad_to):
+        # self.tensor = tensor
+        self.node_num = tensor.size(0)
+        self.flat_tensor = tensor.contiguous().view(self.node_num, -1)
+        self.node_feature_dim = self.flat_tensor.size(1)
+        self.flat_pad_tensor = F.pad(self.flat_tensor, (0, pad_to - self.flat_tensor.size(1)))
+        self.shape = tensor.shape
+    @property
+    def tensor(self):
+        """
+        :return: Tensor with original shape
+        """
+        return self.flat_tensor.index_select(1, torch.arange(0, self.node_feature_dim).to(device)).view(self.shape)
+    @property
+    def padded_tensor(self):
+        """
+        :return: Flatted and padded Tensor
+        """
+        return self.flat_pad_tensor
+class DGLBatchCapsuleLayer(CapsuleLayer):
+    def __init__(self, in_unit, in_channel, num_unit, unit_size, use_routing,
+                 num_routing, cuda_enabled):
+        super(DGLBatchCapsuleLayer, self).__init__(in_unit, in_channel, num_unit, unit_size, use_routing,
+                                                   num_routing, cuda_enabled)
+        self.unit_size = unit_size
+        self.weight = nn.Parameter(torch.randn(in_channel, num_unit, unit_size, in_unit))
+    def routing(self, x):
+        self.batch_size = x.size(0)
+        self.g = dgl.DGLGraph()
+        self.g.add_nodes_from([i for i in range(self.in_channel)])
+        self.g.add_nodes_from([i + self.in_channel for i in range(self.num_unit)])
+        for i in range(self.in_channel):
+            for j in range(self.num_unit):
+                index_j = j + self.in_channel
+                self.g.add_edge(i, index_j)
+        self.edge_features = torch.zeros(self.in_channel, self.num_unit).to('cuda')
+        x_ = x.transpose(0, 2)
+        x_ = DGLFeature(x_, self.batch_size * self.unit_size)
+        x = x.transpose(1, 2)
+        x = torch.stack([x] * self.num_unit, dim=2).unsqueeze(4)
+        W = torch.cat([self.weight.unsqueeze(0)] * self.batch_size, dim=0)
+        u_hat = torch.matmul(W, x).permute(1, 2, 0, 3, 4).squeeze().contiguous()
+        self.node_feature = DGLFeature(torch.zeros(self.num_unit, self.batch_size, self.unit_size).to('cuda'),
+                                       self.batch_size * self.unit_size)
+        nf = torch.cat([x_.padded_tensor, self.node_feature.padded_tensor], dim=0)
+        self.g.set_e_repr({'b_ij': self.edge_features.view(-1)})
+        self.g.set_n_repr({'h': nf})
+        self.g.set_e_repr({'u_hat': u_hat.view(-1, self.batch_size, self.unit_size)})
+        for i in range(self.num_routing):
+            self.i = i
+            self.g.update_all(self.capsule_msg, self.capsule_reduce,
+                              lambda x: {'h': DGLFeature(x['h'], self.batch_size * self.unit_size).padded_tensor},
+                              batchable=True)
+            self.g.update_edge(dgl.base.ALL, dgl.base.ALL, self.update_edge, batchable=True)
+        self.node_feature = self.g.get_n_repr()['h'] \
+            .index_select(0, torch.arange(self.in_channel, self.in_channel + self.num_unit).to(device)) \
+            .view(self.num_unit, self.batch_size, self.unit_size)
+        return self.node_feature.transpose(0, 1).unsqueeze(1).unsqueeze(4).squeeze(1)
+    def update_edge(self, u, v, edge):
+        return {
+            'b_ij': edge['b_ij'] + (v['h'].view(-1, self.batch_size, self.unit_size) * edge['u_hat']).mean(dim=1).sum(
+                dim=1)}
+    @staticmethod
+    def capsule_msg(src, edge):
+        return {'b_ij': edge['b_ij'], 'h': src['h'], 'u_hat': edge['u_hat']}
+    def capsule_reduce(self, node, msg):
+        b_ij_c, h_c, u_hat_c = msg['b_ij'], msg['h'], msg['u_hat']
+        u_hat = u_hat_c
+        c_i = F.softmax(b_ij_c, dim=0)
+        writer.add_histogram(f"c_i{self.i}", c_i, step['step'])
+        s_j = (c_i.unsqueeze(2).unsqueeze(3) * u_hat).sum(dim=1)
+        v_j = self.squash(s_j)
+        return {'h': v_j.view(-1, self.batch_size * self.unit_size)}
+    @staticmethod
+    def squash(s):
+        # This is equation 1 from the paper.
+        mag_sq = torch.sum(s ** 2, dim=2, keepdim=True)
+        mag = torch.sqrt(mag_sq)
+        s = (mag_sq / (1.0 + mag_sq)) * (s / mag)
+        return s
--- a/tutorial/capsule/codes/main.py
+++ b/tutorial/capsule/codes/main.py
+"""
+PyTorch implementation of CapsNet in Sabour, Hinton et al.'s paper
+Dynamic Routing Between Capsules. NIPS 2017.
+https://arxiv.org/abs/1710.09829
+Usage:
+    python main.py
+    python main.py --epochs 30
+    python main.py --epochs 30 --num-routing 1
+Author: Cedric Chee
+"""
+from __future__ import print_function
+import argparse
+import os
+from timeit import default_timer as timer
+import torch
+import torch.optim as optim
+import torchvision.utils as vutils
+from torch.autograd import Variable
+from torch.backends import cudnn
+from tqdm import tqdm
+import utils
+from model import Net
+from utils import writer, step
+def train(model, data_loader, optimizer, epoch, writer):
+    """
+    Train CapsuleNet model on training set
+    Args:
+        model: The CapsuleNet model.
+        data_loader: An interator over the dataset. It combines a dataset and a sampler.
+        optimizer: Optimization algorithm.
+        epoch: Current epoch.
+    """
+    print('===> Training mode')
+    num_batches = len(data_loader)  # iteration per epoch. e.g: 469
+    total_step = args.epochs * num_batches
+    epoch_tot_acc = 0
+    # Switch to train mode
+    model.train()
+    if args.cuda:
+        # When we wrap a Module in DataParallel for multi-GPUs
+        model = model.module
+    start_time = timer()
+    for batch_idx, (data, target) in enumerate(tqdm(data_loader, unit='batch')):
+        batch_size = data.size(0)
+        global_step = batch_idx + (epoch * num_batches) - num_batches
+        step['step'] = global_step
+        labels = target
+        target_one_hot = utils.one_hot_encode(target, length=args.num_classes)
+        assert target_one_hot.size() == torch.Size([batch_size, 10])
+        data, target = Variable(data), Variable(target_one_hot)
+        if args.cuda:
+            data = data.cuda()
+            target = target.cuda()
+        # Train step - forward, backward and optimize
+        optimizer.zero_grad()
+        output = model(data)  # output from DigitCaps (out_digit_caps)
+        loss, margin_loss, recon_loss = model.loss(data, output, target)
+        loss.backward()
+        optimizer.step()
+        # Calculate accuracy for each step and average accuracy for each epoch
+        acc = utils.accuracy(output, labels, args.cuda)
+        epoch_tot_acc += acc
+        epoch_avg_acc = epoch_tot_acc / (batch_idx + 1)
+        # TensorBoard logging
+        # 1) Log the scalar values
+        writer.add_scalar('train/total_loss', loss.item(), global_step)
+        writer.add_scalar('train/margin_loss', margin_loss.item(), global_step)
+        if args.use_reconstruction_loss:
+            writer.add_scalar('train/reconstruction_loss', recon_loss.item(), global_step)
+        writer.add_scalar('train/batch_accuracy', acc, global_step)
+        writer.add_scalar('train/accuracy', epoch_avg_acc, global_step)
+        # 2) Log values and gradients of the parameters (histogram)
+        # for tag, value in model.named_parameters():
+        #     tag = tag.replace('.', '/')
+        #     writer.add_histogram(tag, utils.to_np(value), global_step)
+        #     writer.add_histogram(tag + '/grad', utils.to_np(value.grad), global_step)
+        # Print losses
+        if batch_idx % args.log_interval == 0:
+            template = 'Epoch {}/{}, ' \
+                       'Step {}/{}: ' \
+                       '[Total loss: {:.6f},' \
+                       '\tMargin loss: {:.6f},' \
+                       '\tReconstruction loss: {:.6f},' \
+                       '\tBatch accuracy: {:.6f},' \
+                       '\tAccuracy: {:.6f}]'
+            tqdm.write(template.format(
+                epoch,
+                args.epochs,
+                global_step,
+                total_step,
+                loss.item(),
+                margin_loss.item(),
+                recon_loss.item() if args.use_reconstruction_loss else 0,
+                acc,
+                epoch_avg_acc))
+    # Print time elapsed for an epoch
+    end_time = timer()
+    print('Time elapsed for epoch {}: {:.0f}s.'.format(epoch, end_time - start_time))
+def test(model, data_loader, num_train_batches, epoch, writer):
+    """
+    Evaluate model on validation set
+    Args:
+        model: The CapsuleNet model.
+        data_loader: An interator over the dataset. It combines a dataset and a sampler.
+    """
+    print('===> Evaluate mode')
+    # Switch to evaluate mode
+    model.eval()
+    if args.cuda:
+        # When we wrap a Module in DataParallel for multi-GPUs
+        model = model.module
+    loss = 0
+    margin_loss = 0
+    recon_loss = 0
+    correct = 0
+    num_batches = len(data_loader)
+    global_step = epoch * num_train_batches + num_train_batches
+    step['step'] = global_step
+    for data, target in data_loader:
+        batch_size = data.size(0)
+        target_indices = target
+        target_one_hot = utils.one_hot_encode(target_indices, length=args.num_classes)
+        assert target_one_hot.size() == torch.Size([batch_size, 10])
+        data, target = Variable(data, volatile=True), Variable(target_one_hot)
+        if args.cuda:
+            data = data.cuda()
+            target = target.cuda()
+        # Output predictions
+        output = model(data)  # output from DigitCaps (out_digit_caps)
+        # Sum up batch loss
+        t_loss, m_loss, r_loss = model.loss(data, output, target, size_average=False)
+        loss += t_loss.data[0]
+        margin_loss += m_loss.data[0]
+        recon_loss += r_loss.data[0]
+        # Count number of correct predictions
+        # v_magnitude shape: [128, 10, 1, 1]
+        v_magnitude = torch.sqrt((output ** 2).sum(dim=2, keepdim=True))
+        # pred shape: [128, 1, 1, 1]
+        pred = v_magnitude.data.max(1, keepdim=True)[1].cpu()
+        correct += pred.eq(target_indices.view_as(pred)).sum()
+    # Get the reconstructed images of the last batch
+    if args.use_reconstruction_loss:
+        reconstruction = model.decoder(output, target)
+        # Input image size and number of channel.
+        # By default, for MNIST, the image width and height is 28x28 and 1 channel for black/white.
+        image_width = args.input_width
+        image_height = args.input_height
+        image_channel = args.num_conv_in_channel
+        recon_img = reconstruction.view(-1, image_channel, image_width, image_height)
+        assert recon_img.size() == torch.Size([batch_size, image_channel, image_width, image_height])
+        # Save the image into file system
+        utils.save_image(recon_img, 'results/recons_image_test_{}_{}.png'.format(epoch, global_step))
+        utils.save_image(data, 'results/original_image_test_{}_{}.png'.format(epoch, global_step))
+        # Add and visualize the image in TensorBoard
+        recon_img = vutils.make_grid(recon_img.data, normalize=True, scale_each=True)
+        original_img = vutils.make_grid(data.data, normalize=True, scale_each=True)
+        writer.add_image('test/recons-image-{}-{}'.format(epoch, global_step), recon_img, global_step)
+        writer.add_image('test/original-image-{}-{}'.format(epoch, global_step), original_img, global_step)
+    # Log test losses
+    loss /= num_batches
+    margin_loss /= num_batches
+    recon_loss /= num_batches
+    # Log test accuracies
+    num_test_data = len(data_loader.dataset)
+    accuracy = correct / num_test_data
+    accuracy_percentage = 100. * accuracy
+    # TensorBoard logging
+    # 1) Log the scalar values
+    writer.add_scalar('test/total_loss', loss, global_step)
+    writer.add_scalar('test/margin_loss', margin_loss, global_step)
+    if args.use_reconstruction_loss:
+        writer.add_scalar('test/reconstruction_loss', recon_loss, global_step)
+    writer.add_scalar('test/accuracy', accuracy, global_step)
+    # Print test losses and accuracy
+    print('Test: [Loss: {:.6f},' \
+          '\tMargin loss: {:.6f},' \
+          '\tReconstruction loss: {:.6f}]'.format(
+        loss,
+        margin_loss,
+        recon_loss if args.use_reconstruction_loss else 0))
+    print('Test Accuracy: {}/{} ({:.0f}%)\n'.format(
+        correct, num_test_data, accuracy_percentage))
+def main():
+    """The main function
+    Entry point.
+    """
+    global args
+    # Setting the hyper parameters
+    parser = argparse.ArgumentParser(description='Example of Capsule Network')
+    parser.add_argument('--epochs', type=int, default=10,
+                        help='number of training epochs. default=10')
+    parser.add_argument('--lr', type=float, default=0.01,
+                        help='learning rate. default=0.01')
+    parser.add_argument('--batch-size', type=int, default=128,
+                        help='training batch size. default=128')
+    parser.add_argument('--test-batch-size', type=int,
+                        default=128, help='testing batch size. default=128')
+    parser.add_argument('--log-interval', type=int, default=10,
+                        help='how many batches to wait before logging training status. default=10')
+    parser.add_argument('--no-cuda', action='store_true', default=False,
+                        help='disables CUDA training. default=false')
+    parser.add_argument('--threads', type=int, default=4,
+                        help='number of threads for data loader to use. default=4')
+    parser.add_argument('--seed', type=int, default=42,
+                        help='random seed for training. default=42')
+    parser.add_argument('--num-conv-out-channel', type=int, default=256,
+                        help='number of channels produced by the convolution. default=256')
+    parser.add_argument('--num-conv-in-channel', type=int, default=1,
+                        help='number of input channels to the convolution. default=1')
+    parser.add_argument('--num-primary-unit', type=int, default=8,
+                        help='number of primary unit. default=8')
+    parser.add_argument('--primary-unit-size', type=int,
+                        default=1152, help='primary unit size is 32 * 6 * 6. default=1152')
+    parser.add_argument('--num-classes', type=int, default=10,
+                        help='number of digit classes. 1 unit for one MNIST digit. default=10')
+    parser.add_argument('--output-unit-size', type=int,
+                        default=16, help='output unit size. default=16')
+    parser.add_argument('--num-routing', type=int,
+                        default=3, help='number of routing iteration. default=3')
+    parser.add_argument('--use-reconstruction-loss', type=utils.str2bool, nargs='?', default=True,
+                        help='use an additional reconstruction loss. default=True')
+    parser.add_argument('--regularization-scale', type=float, default=0.0005,
+                        help='regularization coefficient for reconstruction loss. default=0.0005')
+    parser.add_argument('--dataset', help='the name of dataset (mnist, cifar10)', default='mnist')
+    parser.add_argument('--input-width', type=int,
+                        default=28, help='input image width to the convolution. default=28 for MNIST')
+    parser.add_argument('--input-height', type=int,
+                        default=28, help='input image height to the convolution. default=28 for MNIST')
+    args = parser.parse_args()
+    print(args)
+    # Check GPU or CUDA is available
+    args.cuda = not args.no_cuda and torch.cuda.is_available()
+    # Get reproducible results by manually seed the random number generator
+    torch.manual_seed(args.seed)
+    if args.cuda:
+        torch.cuda.manual_seed(args.seed)
+    # Load data
+    train_loader, test_loader = utils.load_data(args)
+    # Build Capsule Network
+    print('===> Building model')
+    model = Net(num_conv_in_channel=args.num_conv_in_channel,
+                num_conv_out_channel=args.num_conv_out_channel,
+                num_primary_unit=args.num_primary_unit,
+                primary_unit_size=args.primary_unit_size,
+                num_classes=args.num_classes,
+                output_unit_size=args.output_unit_size,
+                num_routing=args.num_routing,
+                use_reconstruction_loss=args.use_reconstruction_loss,
+                regularization_scale=args.regularization_scale,
+                input_width=args.input_width,
+                input_height=args.input_height,
+                cuda_enabled=args.cuda)
+    if args.cuda:
+        print('Utilize GPUs for computation')
+        print('Number of GPU available', torch.cuda.device_count())
+        model.cuda()
+        cudnn.benchmark = True
+        model = torch.nn.DataParallel(model)
+    # Print the model architecture and parameters
+    print('Model architectures:\n{}\n'.format(model))
+    print('Parameters and size:')
+    for name, param in model.named_parameters():
+        print('{}: {}'.format(name, list(param.size())))
+    # CapsNet has:
+    # - 8.2M parameters and 6.8M parameters without the reconstruction subnet on MNIST.
+    # - 11.8M parameters and 8.0M parameters without the reconstruction subnet on CIFAR10.
+    num_params = sum([param.nelement() for param in model.parameters()])
+    # The coupling coefficients c_ij are not included in the parameter list,
+    # we need to add them manually, which is 1152 * 10 = 11520 (on MNIST) or 2048 * 10 (on CIFAR10)
+    print('\nTotal number of parameters: {}\n'.format(num_params + (11520 if args.dataset == 'mnist' else 20480)))
+    # Optimizer
+    optimizer = optim.Adam(model.parameters(), lr=args.lr)
+    # Make model checkpoint directory
+    if not os.path.exists('results/trained_model'):
+        os.makedirs('results/trained_model')
+    # Train and test
+    for epoch in range(1, args.epochs + 1):
+        train(model, train_loader, optimizer, epoch, writer)
+        test(model, test_loader, len(train_loader), epoch, writer)
+        # Save model checkpoint
+        utils.checkpoint({
+            'epoch': epoch + 1,
+            'state_dict': model.state_dict(),
+            'optimizer': optimizer.state_dict()
+        }, epoch)
+    writer.close()
+if __name__ == "__main__":
+    main()
--- a/tutorial/capsule/codes/model.py
+++ b/tutorial/capsule/codes/model.py
+"""CapsNet Architecture
+PyTorch implementation of CapsNet in Sabour, Hinton et al.'s paper
+Dynamic Routing Between Capsules. NIPS 2017.
+https://arxiv.org/abs/1710.09829
+Author: Cedric Chee
+"""
+import torch
+import torch.nn as nn
+from torch.autograd import Variable
+from capsule_layer import CapsuleLayer
+from conv_layer import ConvLayer
+from decoder import Decoder
+from dgl_capsule_batch import DGLBatchCapsuleLayer
+class Net(nn.Module):
+    """
+    A simple CapsNet with 3 layers
+    """
+    def __init__(self, num_conv_in_channel, num_conv_out_channel, num_primary_unit,
+                 primary_unit_size, num_classes, output_unit_size, num_routing,
+                 use_reconstruction_loss, regularization_scale, input_width, input_height,
+                 cuda_enabled):
+        """
+        In the constructor we instantiate one ConvLayer module and two CapsuleLayer modules
+        and assign them as member variables.
+        """
+        super(Net, self).__init__()
+        self.cuda_enabled = cuda_enabled
+        # Configurations used for image reconstruction.
+        self.use_reconstruction_loss = use_reconstruction_loss
+        # Input image size and number of channel.
+        # By default, for MNIST, the image width and height is 28x28
+        # and 1 channel for black/white.
+        self.image_width = input_width
+        self.image_height = input_height
+        self.image_channel = num_conv_in_channel
+        # Also known as lambda reconstruction. Default value is 0.0005.
+        # We use sum of squared errors (SSE) similar to paper.
+        self.regularization_scale = regularization_scale
+        # Layer 1: Conventional Conv2d layer.
+        self.conv1 = ConvLayer(in_channel=num_conv_in_channel,
+                               out_channel=num_conv_out_channel,
+                               kernel_size=9)
+        # PrimaryCaps
+        # Layer 2: Conv2D layer with `squash` activation.
+        self.primary = CapsuleLayer(in_unit=0,
+                                    in_channel=num_conv_out_channel,
+                                    num_unit=num_primary_unit,
+                                    unit_size=primary_unit_size,  # capsule outputs
+                                    use_routing=False,
+                                    num_routing=num_routing,
+                                    cuda_enabled=cuda_enabled)
+        # DigitCaps
+        # Final layer: Capsule layer where the routing algorithm is.
+        self.digits = CapsuleLayer(in_unit=num_primary_unit,
+                                           in_channel=primary_unit_size,
+                                           num_unit=num_classes,
+                                           unit_size=output_unit_size,  # 16D capsule per digit class
+                                           use_routing=True,
+                                           num_routing=num_routing,
+                                           cuda_enabled=cuda_enabled)
+        # Reconstruction network
+        if use_reconstruction_loss:
+            self.decoder = Decoder(num_classes, output_unit_size, input_width,
+                                   input_height, num_conv_in_channel, cuda_enabled)
+    def forward(self, x):
+        """
+        Defines the computation performed at every forward pass.
+        """
+        # x shape: [128, 1, 28, 28]. 128 is for the batch size.
+        # out_conv1 shape: [128, 256, 20, 20]
+        out_conv1 = self.conv1(x)
+        # out_primary_caps shape: [128, 8, 1152].
+        # Total PrimaryCapsules has [32 × 6 × 6 = 1152] capsule outputs.
+        out_primary_caps = self.primary(out_conv1)
+        # out_digit_caps shape: [128, 10, 16, 1]
+        # batch size: 128, 10 digit class, 16D capsule per digit class.
+        out_digit_caps = self.digits(out_primary_caps)
+        return out_digit_caps
+    def loss(self, image, out_digit_caps, target, size_average=True):
+        """Custom loss function
+        Args:
+            image: [batch_size, 1, 28, 28] MNIST samples.
+            out_digit_caps: [batch_size, 10, 16, 1] The output from `DigitCaps` layer.
+            target: [batch_size, 10] One-hot MNIST dataset labels.
+            size_average: A boolean to enable mean loss (average loss over batch size).
+        Returns:
+            total_loss: A scalar Variable of total loss.
+            m_loss: A scalar of margin loss.
+            recon_loss: A scalar of reconstruction loss.
+        """
+        recon_loss = 0
+        m_loss = self.margin_loss(out_digit_caps, target)
+        if size_average:
+            m_loss = m_loss.mean()
+        total_loss = m_loss
+        if self.use_reconstruction_loss:
+            # Reconstruct the image from the Decoder network
+            reconstruction = self.decoder(out_digit_caps, target)
+            recon_loss = self.reconstruction_loss(reconstruction, image)
+            # Mean squared error
+            if size_average:
+                recon_loss = recon_loss.mean()
+            # In order to keep in line with the paper,
+            # they scale down the reconstruction loss by 0.0005
+            # so that it does not dominate the margin loss.
+            total_loss = m_loss + recon_loss * self.regularization_scale
+        return total_loss, m_loss, (recon_loss * self.regularization_scale)
+    def margin_loss(self, input, target):
+        """
+        Class loss
+        Implement equation 4 in section 3 'Margin loss for digit existence' in the paper.
+        Args:
+            input: [batch_size, 10, 16, 1] The output from `DigitCaps` layer.
+            target: target: [batch_size, 10] One-hot MNIST labels.
+        Returns:
+            l_c: A scalar of class loss or also know as margin loss.
+        """
+        batch_size = input.size(0)
+        # ||vc|| also known as norm.
+        v_c = torch.sqrt((input ** 2).sum(dim=2, keepdim=True))
+        # Calculate left and right max() terms.
+        zero = Variable(torch.zeros(1))
+        if self.cuda_enabled:
+            zero = zero.cuda()
+        m_plus = 0.9
+        m_minus = 0.1
+        loss_lambda = 0.5
+        max_left = torch.max(m_plus - v_c, zero).view(batch_size, -1) ** 2
+        max_right = torch.max(v_c - m_minus, zero).view(batch_size, -1) ** 2
+        t_c = target
+        # Lc is margin loss for each digit of class c
+        l_c = t_c * max_left + loss_lambda * (1.0 - t_c) * max_right
+        l_c = l_c.sum(dim=1)
+        return l_c
+    def reconstruction_loss(self, reconstruction, image):
+        """
+        The reconstruction loss is the sum of squared differences between
+        the reconstructed image (outputs of the logistic units) and
+        the original image (input image).
+        Implement section 4.1 'Reconstruction as a regularization method' in the paper.
+        Based on naturomics's implementation.
+        Args:
+            reconstruction: [batch_size, 784] Decoder outputs of reconstructed image tensor.
+            image: [batch_size, 1, 28, 28] MNIST samples.
+        Returns:
+            recon_error: A scalar Variable of reconstruction loss.
+        """
+        # Calculate reconstruction loss.
+        batch_size = image.size(0)  # or another way recon_img.size(0)
+        # error = (recon_img - image).view(batch_size, -1)
+        image = image.view(batch_size, -1)  # flatten 28x28 by reshaping to [batch_size, 784]
+        error = reconstruction - image
+        squared_error = error ** 2
+        # Scalar Variable
+        recon_error = torch.sum(squared_error, dim=1)
+        return recon_error
--- a/tutorial/capsule/codes/requirements.txt
+++ b/tutorial/capsule/codes/requirements.txt
+http://download.pytorch.org/whl/cu90/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl ; sys_platform == "linux"
+http://download.pytorch.org/whl/torch-0.3.0.post4-cp36-cp36m-macosx_10_7_x86_64.whl ; sys_platform == "darwin"
+torchvision
+tensorboardX
+tensorflow
+tqdm
--- a/tutorial/capsule/codes/utils.py
+++ b/tutorial/capsule/codes/utils.py
+"""Utilities
+PyTorch implementation of CapsNet in Sabour, Hinton et al.'s paper
+Dynamic Routing Between Capsules. NIPS 2017.
+https://arxiv.org/abs/1710.09829
+Author: Cedric Chee
+"""
+import argparse
+import torch
+import torch.nn.functional as F
+import torchvision.utils as vutils
+from tensorboardX import SummaryWriter
+from torch.autograd import Variable
+from torch.utils.data import DataLoader
+from torchvision import transforms, datasets
+# Set the logger
+writer = SummaryWriter()
+step = {'step': 0}
+def one_hot_encode(target, length):
+    """Converts batches of class indices to classes of one-hot vectors."""
+    batch_s = target.size(0)
+    one_hot_vec = torch.zeros(batch_s, length)
+    for i in range(batch_s):
+        one_hot_vec[i, target[i]] = 1.0
+    return one_hot_vec
+def checkpoint(state, epoch):
+    """Save checkpoint"""
+    model_out_path = 'results/trained_model/model_epoch_{}.pth'.format(epoch)
+    torch.save(state, model_out_path)
+    print('Checkpoint saved to {}'.format(model_out_path))
+def load_mnist(args):
+    """Load MNIST dataset.
+    The data is split and normalized between train and test sets.
+    """
+    # Normalize MNIST dataset.
+    data_transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+    ])
+    kwargs = {'num_workers': args.threads,
+              'pin_memory': True} if args.cuda else {}
+    print('===> Loading MNIST training datasets')
+    # MNIST dataset
+    training_set = datasets.MNIST(
+        './data', train=True, download=True, transform=data_transform)
+    # Input pipeline
+    training_data_loader = DataLoader(
+        training_set, batch_size=args.batch_size, shuffle=True, **kwargs)
+    print('===> Loading MNIST testing datasets')
+    testing_set = datasets.MNIST(
+        './data', train=False, download=True, transform=data_transform)
+    testing_data_loader = DataLoader(
+        testing_set, batch_size=args.test_batch_size, shuffle=True, **kwargs)
+    return training_data_loader, testing_data_loader
+def load_cifar10(args):
+    """Load CIFAR10 dataset.
+    The data is split and normalized between train and test sets.
+    """
+    # Normalize CIFAR10 dataset.
+    data_transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
+    ])
+    kwargs = {'num_workers': args.threads,
+              'pin_memory': True} if args.cuda else {}
+    print('===> Loading CIFAR10 training datasets')
+    # CIFAR10 dataset
+    training_set = datasets.CIFAR10(
+        './data', train=True, download=True, transform=data_transform)
+    # Input pipeline
+    training_data_loader = DataLoader(
+        training_set, batch_size=args.batch_size, shuffle=True, **kwargs)
+    print('===> Loading CIFAR10 testing datasets')
+    testing_set = datasets.CIFAR10(
+        './data', train=False, download=True, transform=data_transform)
+    testing_data_loader = DataLoader(
+        testing_set, batch_size=args.test_batch_size, shuffle=True, **kwargs)
+    return training_data_loader, testing_data_loader
+def load_data(args):
+    """
+    Load dataset.
+    """
+    dst = args.dataset
+    if dst == 'mnist':
+        return load_mnist(args)
+    elif dst == 'cifar10':
+        return load_cifar10(args)
+    else:
+        raise Exception('Invalid dataset, please check the name of dataset:', dst)
+def squash(sj, dim=2):
+    """
+    The non-linear activation used in Capsule.
+    It drives the length of a large vector to near 1 and small vector to 0
+    This implement equation 1 from the paper.
+    """
+    sj_mag_sq = torch.sum(sj ** 2, dim, keepdim=True)
+    # ||sj||
+    sj_mag = torch.sqrt(sj_mag_sq)
+    v_j = (sj_mag_sq / (1.0 + sj_mag_sq)) * (sj / sj_mag)
+    return v_j
+def mask(out_digit_caps, cuda_enabled=True):
+    """
+    In the paper, they mask out all but the activity vector of the correct digit capsule.
+    This means:
+    a) during training, mask all but the capsule (1x16 vector) which match the ground-truth.
+    b) during testing, mask all but the longest capsule (1x16 vector).
+    Args:
+        out_digit_caps: [batch_size, 10, 16] Tensor output of `DigitCaps` layer.
+    Returns:
+        masked: [batch_size, 10, 16, 1] The masked capsules tensors.
+    """
+    # a) Get capsule outputs lengths, ||v_c||
+    v_length = torch.sqrt((out_digit_caps ** 2).sum(dim=2))
+    # b) Pick out the index of longest capsule output, v_length by
+    # masking the tensor by the max value in dim=1.
+    _, max_index = v_length.max(dim=1)
+    max_index = max_index.data
+    # Method 1: masking with y.
+    # c) In all batches, get the most active capsule
+    # It's not easy to understand the indexing process with max_index
+    # as we are 3D animal.
+    batch_size = out_digit_caps.size(0)
+    masked_v = [None] * batch_size  # Python list
+    for batch_ix in range(batch_size):
+        # Batch sample
+        sample = out_digit_caps[batch_ix]
+        # Masks out the other capsules in this sample.
+        v = Variable(torch.zeros(sample.size()))
+        if cuda_enabled:
+            v = v.cuda()
+        # Get the maximum capsule index from this batch sample.
+        max_caps_index = max_index[batch_ix]
+        v[max_caps_index] = sample[max_caps_index]
+        masked_v[batch_ix] = v  # append v to masked_v
+    # Concatenates sequence of masked capsules tensors along the batch dimension.
+    masked = torch.stack(masked_v, dim=0)
+    return masked
+def save_image(image, file_name):
+    """
+    Save a given image into an image file
+    """
+    # Check number of channels in an image.
+    if image.size(1) == 2:
+        # 2-channel image
+        zeros = torch.zeros(image.size(0), 1, image.size(2), image.size(3))
+        image_tensor = torch.cat([zeros, image.data.cpu()], dim=1)
+    else:
+        # Grayscale or RGB image
+        image_tensor = image.data.cpu()  # get Tensor from Variable
+    vutils.save_image(image_tensor, file_name)
+def accuracy(output, target, cuda_enabled=True):
+    """
+    Compute accuracy.
+    Args:
+        output: [batch_size, 10, 16, 1] The output from DigitCaps layer.
+        target: [batch_size] Labels for dataset.
+    Returns:
+        accuracy (float): The accuracy for a batch.
+    """
+    batch_size = target.size(0)
+    v_length = torch.sqrt((output ** 2).sum(dim=2, keepdim=True))
+    softmax_v = F.softmax(v_length, dim=1)
+    assert softmax_v.size() == torch.Size([batch_size, 10, 1, 1])
+    _, max_index = softmax_v.max(dim=1)
+    assert max_index.size() == torch.Size([batch_size, 1, 1])
+    pred = max_index.squeeze()  # max_index.view(batch_size)
+    assert pred.size() == torch.Size([batch_size])
+    if cuda_enabled:
+        target = target.cuda()
+        pred = pred.cuda()
+    correct_pred = torch.eq(target, pred.data)  # tensor
+    # correct_pred_sum = correct_pred.sum() # scalar. e.g: 6 correct out of 128 images.
+    acc = correct_pred.float().mean()  # e.g: 6 / 128 = 0.046875
+    return acc
+def to_np(param):
+    """
+    Convert values of the model parameters to numpy.array.
+    """
+    return param.clone().cpu().data.numpy()
+def str2bool(v):
+    """
+    Parsing boolean values with argparse.
+    """
+    if v.lower() in ('yes', 'true', 't', 'y', '1'):
+        return True
+    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+        return False
+    else:
+        raise argparse.ArgumentTypeError('Boolean value expected.')