" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/research/nst_blogpost/4_Neural_Style_Transfer_with_Eager_Execution.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/research/nst_blogpost/4_Neural_Style_Transfer_with_Eager_Execution.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
" </td>\n",
"</table>"
]
},
{
"metadata": {
"id": "aDyGj8DmXCJI",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Overview\n",
"\n",
"In this tutorial, we will learn how to use deep learning to compose images in the style of another image (ever wish you could paint like Picasso or Van Gogh?). This is known as **neural style transfer**! This is a technique outlined in [Leon A. Gatys' paper, A Neural Algorithm of Artistic Style](https://arxiv.org/abs/1508.06576), which is a great read, and you should definitely check it out. \n",
"Is this magic or just deep learning? Fortunately, this doesn’t involve any witchcraft: style transfer is a fun and interesting technique that showcases the capabilities and internal representations of neural networks. \n",
"\n",
"The principle of neural style transfer is to define two distance functions, one that describes how different the content of two images are , $L_{content}$, and one that describes the difference between two images in terms of their style, $L_{style}$. Then, given three images, a desired style image, a desired content image, and the input image (initialized with the content image), we try to transform the input image to minimize the content distance with the content image and its style distance with the style image. \n",
"In summary, we’ll take the base input image, a content image that we want to match, and the style image that we want to match. We’ll transform the base input image by minimizing the content and style distances (losses) with backpropagation, creating an image that matches the content of the content image and the style of the style image. \n",
"\n",
"## Specific concepts that will be covered:\n",
"### Specific concepts that will be covered:\n",
"In the process, we will build practical experience and develop intuition around the following concepts\n",
"\n",
"* **Eager Execution** - use TensorFlow's imperative programming environment that evaluates operations immediately \n",
" * [Learn more about eager execution](https://www.tensorflow.org/programmers_guide/eager)\n",
" * [See it in action](https://www.tensorflow.org/get_started/eager)\n",
...
...
@@ -59,6 +81,7 @@
"* **Create custom training loops** - we'll examine how to set up an optimizer to minimize a given loss with respect to input parameters\n",
"\n",
"### We will follow the general steps to perform style transfer:\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/research/nst_blogpost/4_Neural_Style_Transfer_with_Eager_Execution.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/research/nst_blogpost/4_Neural_Style_Transfer_with_Eager_Execution.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
"Let's create methods that will allow us to load and preprocess our images easily. We perform the same preprocessing process as are expected according to the VGG training process. VGG networks are trained on image with each channel normalized by `mean = [103.939, 116.779, 123.68]`and with channels BGR."
]
},
...
...
@@ -328,9 +355,9 @@
"def deprocess_img(processed_img):\n",
" x = processed_img.copy()\n",
" if len(x.shape) == 4:\n",
" x = x.reshape(img_shape)\n",
" x = np.squeeze(x, 0)\n",
" assert len(x.shape) == 3, (\"Input to deprocess image must be an image of \"\n",
" \"dimension [batch, height, width, channel] or [height_width_channel]\")\n",
" \"dimension [1, height, width, channel] or [height, width, channel]\")\n",
" if len(x.shape) != 3:\n",
" raise ValueError(\"Invalid input to deprocessing image\")\n",
" \n",
...
...
@@ -353,10 +380,10 @@
},
"cell_type": "markdown",
"source": [
"## Define content and style representationst\n",
"### Define content and style representationst\n",
"In order to get both the content and style representations of our image, we will look at some intermediate layers within our model. As we go deeper into the model, these intermediate layers represent higher and higher order features. In this case, we are using the network architecture VGG19, a pretrained image classification network. These intermediate layers are necessary to define the representation of content and style from our images. For an input image, we will try to match the corresponding style and content target representations at these intermediate layers. \n",
"\n",
"### Why intermediate layers?\n",
"#### Why intermediate layers?\n",
"\n",
"You may be wondering why these intermediate outputs within our pretrained image classification network allow us to define style and content representations. At a high level, this phenomenon can be explained by the fact that in order for a network to perform image classification (which our network has been trained to do), it must understand the image. This involves taking the raw image as input pixels and building an internal representation through transformations that turn the raw image pixels into a complex understanding of the features present within the image. This is also partly why convolutional neural networks are able to generalize well: they’re able to capture the invariances and defining features within classes (e.g., cats vs. dogs) that are agnostic to background noise and other nuisances. Thus, somewhere between where the raw image is fed in and the classification label is output, the model serves as a complex feature extractor; hence by accessing intermediate layers, we’re able to describe the content and style of input images. \n",
"\n",
...
...
@@ -396,7 +423,7 @@
},
"cell_type": "markdown",
"source": [
"# Model \n",
"## Build the Model \n",
"In this case, we load [VGG19](https://keras.io/applications/#vgg19), and feed in our input tensor to the model. This will allow us to extract the feature maps (and subsequently the content and style representations) of the content, style, and generated images.\n",
"\n",
"We use VGG19, as suggested in the paper. In addition, since VGG19 is a relatively simple model (compared with ResNet, Inception, etc) the feature maps actually work better for style transfer. "
...
...
@@ -465,7 +492,7 @@
},
"cell_type": "markdown",
"source": [
"# Define and create our loss functions (content and style distances)"
"## Define and create our loss functions (content and style distances)"
]
},
{
...
...
@@ -475,7 +502,7 @@
},
"cell_type": "markdown",
"source": [
"# Content Loss"
"### Content Loss"
]
},
{
...
...
@@ -502,7 +529,7 @@
},
"cell_type": "markdown",
"source": [
"## Computing content loss\n",
"### Computing content loss\n",
"We will actually add our content losses at each desired layer. This way, each iteration when we feed our input image through the model (which in eager is simply `model(input_image)`!) all the content losses through the model will be properly compute and because we are executing eagerly, all the gradients will be computed. "
]
},
...
...
@@ -527,7 +554,7 @@
},
"cell_type": "markdown",
"source": [
"# Style Loss"
"## Style Loss"
]
},
{
...
...
@@ -556,7 +583,7 @@
},
"cell_type": "markdown",
"source": [
"## Computing style loss\n",
"### Computing style loss\n",
"Again, we implement our loss as a distance metric . "
]
},
...
...
@@ -595,7 +622,7 @@
},
"cell_type": "markdown",
"source": [
"# Apply style transfer to our images\n"
"## Apply style transfer to our images\n"
]
},
{
...
...
@@ -605,7 +632,7 @@
},
"cell_type": "markdown",
"source": [
"## Run Gradient Descent \n",
"### Run Gradient Descent \n",
"If you aren't familiar with gradient descent/backpropagation or need a refresher, you should definitely check out this [awesome resource](https://developers.google.com/machine-learning/crash-course/reducing-loss/gradient-descent).\n",
"\n",
"In this case, we use the [Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam)* optimizer in order to minimize our loss. We iteratively update our output image such that it minimizes our loss: we don't update the weights associated with our network, but instead we train our input image to minimize loss. In order to do this, we must know how we calculate our loss and gradients. \n",
" # Get the style and content feature representations from our model \n",
" style_features = [style_layer[0] for style_layer in model_outputs[:num_style_layers]]\n",
" content_features = [content_layer[1] for content_layer in model_outputs[num_style_layers:]]\n",
" style_features = [style_layer[0] for style_layer in style_outputs[:num_style_layers]]\n",
" content_features = [content_layer[0] for content_layer in content_outputs[num_style_layers:]]\n",
" return style_features, content_features"
],
"execution_count": 0,
...
...
@@ -669,7 +697,7 @@
},
"cell_type": "markdown",
"source": [
"## Computing the loss and gradients\n",
"### Computing the loss and gradients\n",
"Here we use [**tf.GradientTape**](https://www.tensorflow.org/programmers_guide/eager#computing_gradients) to compute the gradient. It allows us to take advantage of the automatic differentiation available by tracing operations for computing the gradient later. It records the operations during the forward pass and then is able to compute the gradient of our loss function with respect to our input image for the backwards pass."
]
},
...
...
@@ -768,7 +796,7 @@
},
"cell_type": "markdown",
"source": [
"## Apply and run the style transfer process"
"### Optimization loop"
]
},
{
...
...
@@ -779,12 +807,13 @@
},
"cell_type": "code",
"source": [
"import IPython.display\n",
"\n",
"def run_style_transfer(content_path, \n",
" style_path,\n",
" num_iterations=1000,\n",
" content_weight=1e3, \n",
" style_weight=1e-2): \n",
" display_num = 100\n",
" # We don't need to (or want to) train any layers of our model, so we set their\n",
"Photo By: Andreas Praefcke [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY 3.0 (https://creativecommons.org/licenses/by/3.0)], from Wikimedia Commons"
...
...
@@ -957,7 +1023,7 @@
},
"cell_type": "markdown",
"source": [
"## Starry night + Tuebingen"
"### Starry night + Tuebingen"
]
},
{
...
...
@@ -995,7 +1061,7 @@
},
"cell_type": "markdown",
"source": [
"## Pillars of Creation + Tuebingen"
"### Pillars of Creation + Tuebingen"
]
},
{
...
...
@@ -1029,12 +1095,12 @@
},
{
"metadata": {
"id": "Eteg3glPNF3O",
"id": "bTZdTOdW3s8H",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Kandinsky Composition 7 + Tuebingen"
"### Kandinsky Composition 7 + Tuebingen"
]
},
{
...
...
@@ -1068,12 +1134,12 @@
},
{
"metadata": {
"id": "ACA8-74rNF3U",
"id": "cg68lW2A3s8N",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Pillars of Creation + Sea Turtle"
"### Pillars of Creation + Sea Turtle"
]
},
{
...
...
@@ -1112,8 +1178,10 @@
},
"cell_type": "markdown",
"source": [
"# Key Takeaways\n",
"## What we covered:\n",
"## Key Takeaways\n",
"\n",
"### What we covered:\n",
"\n",
"* We built several different loss functions and used backpropagation to transform our input image in order to minimize these losses\n",
" * In order to do this we had to load in an a **pretrained model** and used its learned feature maps to describe the content and style representation of our images.\n",
" * Our main loss functions were primarily computing the distance in terms of these different representations\n",