Merge pull request #4529 from MarkDaoust/basic-regression

Add basic-regression

Merge pull request #4529 from MarkDaoust/basic-regression
Add basic-regression
ec4e0271 · Mark Daoust · GitHub · ea4d9bc8 · 52308ca9 · ec4e0271
Unverified Commit ec4e0271 authored Jun 18, 2018 by Mark Daoust Committed by GitHub Jun 18, 2018
Hide whitespace changes
Inline Side-by-side

Showing with 635 additions and 0 deletions

samples/core/get_started/basic_regression.ipynb samples/core/get_started/basic_regression.ipynb +635 -0

No files found.
--- a/samples/core/get_started/basic_regression.ipynb
+++ b/samples/core/get_started/basic_regression.ipynb
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "basic-regression.ipynb",
+      "version": "0.3.2",
+      "views": {},
+      "default_view": {},
+      "provenance": [],
+      "private_outputs": true,
+      "collapsed_sections": [
+        "FhGuhbZ6M5tl"
+      ],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "metadata": {
+        "id": "FhGuhbZ6M5tl",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "AwOEIRJC6Une",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "cellView": "form"
+      },
+      "cell_type": "code",
+      "source": [
+        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "KyPEtTqk6VdG",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "cellView": "form"
+      },
+      "cell_type": "code",
+      "source": [
+        "#@title MIT License\n",
+        "#\n",
+        "# Copyright (c) 2017 François Chollet\n",
+        "#\n",
+        "# Permission is hereby granted, free of charge, to any person obtaining a\n",
+        "# copy of this software and associated documentation files (the \"Software\"),\n",
+        "# to deal in the Software without restriction, including without limitation\n",
+        "# the rights to use, copy, modify, merge, publish, distribute, sublicense,\n",
+        "# and/or sell copies of the Software, and to permit persons to whom the\n",
+        "# Software is furnished to do so, subject to the following conditions:\n",
+        "#\n",
+        "# The above copyright notice and this permission notice shall be included in\n",
+        "# all copies or substantial portions of the Software.\n",
+        "#\n",
+        "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n",
+        "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n",
+        "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL\n",
+        "# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n",
+        "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n",
+        "# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER\n",
+        "# DEALINGS IN THE SOFTWARE."
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "EIdT9iu_Z4Rb",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Predicting house prices: a regression example\n",
+        "\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "bBIlTPscrIT9",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "<table align=\"left\"><td>\n",
+        "<a target=\"_blank\"  href=\"https://colab.sandbox.google.com/github/tensorflow/models/blob/master/samples/core/get_started/basic_regression.ipynb\">\n",
+        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>  \n",
+        "</td><td>\n",
+        "<a target=\"_blank\"  href=\"https://github.com/tensorflow/models/blob/master/samples/core/get_started/basic_regression.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on Github</a></td></table>"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "AHp3M9ZmrIxj",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "In a *regression* problem, we aim to predict the output of a continous value, like a price or a probability. Contrast this with a *classification* problem, where we aim to predict a discrete label (for example, where a picture contains an apple or an orange). \n",
+        "\n",
+        "This notebook builds a model to predict the median price of homes in a Boston suburb during the mid-1970s. To do this, we'll provide the model with some data points about the suburb, such as the crime rate and the local property tax rate.\n",
+        "\n",
+        "This example uses the `tf.keras` API, see [this guide](https://www.tensorflow.org/programmers_guide/keras) for details."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "1rRo8oNqZ-Rj",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "import tensorflow as tf\n",
+        "from tensorflow import keras\n",
+        "\n",
+        "import numpy as np\n",
+        "\n",
+        "print(tf.__version__)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "F_72b0LCNbjx",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## The Boston Housing Prices dataset\n",
+        "\n",
+        "This [dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) is accessible directly in TensorFlow. Download and shuffle the training set:"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "p9kxxgzvzlyz",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "boston_housing = keras.datasets.boston_housing\n",
+        "\n",
+        "(train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()\n",
+        "\n",
+        "# Shuffle the training set\n",
+        "order = np.argsort(np.random.random(train_labels.shape))\n",
+        "train_data = train_data[order]\n",
+        "train_labels = train_labels[order]"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "PwEKwRJgsgJ6",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Examples and features \n",
+        "\n",
+        "This dataset is much smaller than the others we've worked with so far: it has 506 total examples are split between 404 training examples and 102 test examples:"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Ujqcgkipr65P",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "print(\"Training set: {}\".format(train_data.shape))  # 404 examples, 13 features\n",
+        "print(\"Testing set:  {}\".format(test_data.shape))   # 102 examples, 13 features"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "0LRPXE3Oz3Nq",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "The dataset contains 13 different features:\n",
+        "\n",
+        "1.   Per capita crime rate.\n",
+        "2.   Proportion of residential land zoned for lots over 25,000 square feet.\n",
+        "3.   Proportion of non-retail business acres per town.\n",
+        "4.   Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n",
+        "5.   Nitric oxides concentration (parts per 10 million).\n",
+        "6.   Average number of rooms per dwelling.\n",
+        "7.   Proportion of owner-occupied units built prior to 1940.\n",
+        "8.   Weighted distances to five Boston employment centres.\n",
+        "9.   Index of accessibility to radial highways.\n",
+        "10.  Full-value property-tax rate per $10,000.\n",
+        "11.  Pupil-teacher ratio by town.\n",
+        "12.  1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.\n",
+        "13.  Percentage lower status of the population.\n",
+        "\n",
+        "Each one of these input data features is stored using a different scale. Some feature are represented by a proportion between 0 and 1, other features are ranges between 1 and 12, some are ranges between 0 and 100, and so on. This is often the case with real-world data, and understanding how to explore and clean such data is an important skill to develop."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "8tYsm8Gs03J4",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "print(train_data[0])  # Display sample features, notice they different scales"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "Q7muNf-d1-ne",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Use the [pandas](https://pandas.pydata.org) library to display the first few rows of the dataset in a nicely formatted table:"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "pYVyGhdyCpIM",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "\n",
+        "column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n",
+        "                'TAX', 'PTRATIO', 'B', 'LSTAT']\n",
+        "\n",
+        "df = pd.DataFrame(train_data, columns=column_names)\n",
+        "df.head()"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "wb9S7Mia2lpf",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Labels\n",
+        "\n",
+        "The labels are the house prices in thousands of dollars. (You may notice the mid-1970s prices.)"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "I8NwI2ND2t4Y",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "print(train_labels[0:10])  # Display first 10 entries"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "mRklxK5s388r",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Normalize features\n",
+        "\n",
+        "It's recommended to normalize features that use different scales and ranges. For each feature, subtract the mean of the feature and divide by the standard deviation:"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "ze5WQP8R1TYg",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# Test data is *not* used when calculating the mean and std.\n",
+        "\n",
+        "mean = train_data.mean(axis=0)\n",
+        "std = train_data.std(axis=0)\n",
+        "train_data = (train_data - mean) / std\n",
+        "test_data = (test_data - mean) / std\n",
+        "\n",
+        "print(train_data[0])  # First training sample, normalized"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "BuiClDk45eS4",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model more dependant on the choice of units used in the input."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "SmjdzxKzEu1-",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Create the model\n",
+        "\n",
+        "Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output later that returns a single, continous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "c26juK7ZG8j-",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "def build_model():\n",
+        "  model = keras.Sequential()\n",
+        "  \n",
+        "  model.add(keras.layers.Dense(64, activation=tf.nn.relu,\n",
+        "                               input_shape=(train_data.shape[1],)))\n",
+        "  model.add(keras.layers.Dense(64, activation=tf.nn.relu))\n",
+        "  model.add(keras.layers.Dense(1))\n",
+        "\n",
+        "  optimizer = tf.train.RMSPropOptimizer(0.001)\n",
+        "\n",
+        "  model.compile(loss='mse',\n",
+        "                optimizer=optimizer,\n",
+        "                metrics=['mae'])\n",
+        "  return model\n",
+        "\n",
+        "model = build_model()\n",
+        "model.summary()"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "0-qWCsh6DlyH",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Train the model\n",
+        "\n",
+        "The model is trained for 500 epochs, and record the training and validation accuracy in the `history` object."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "sD7qHCmNIOY0",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# Display training progress by printing a single dot for each completed epoch.\n",
+        "class PrintDot(keras.callbacks.Callback):\n",
+        "  def on_epoch_end(self,epoch,logs):\n",
+        "    if epoch % 100 == 0: print('')\n",
+        "    print('.', end='')\n",
+        "\n",
+        "EPOCHS = 500\n",
+        "\n",
+        "# Store training stats\n",
+        "history = model.fit(train_data, train_labels, epochs=EPOCHS,\n",
+        "                    validation_split=0.2, verbose=0,\n",
+        "                    callbacks=[PrintDot()])"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "tQm3pc0FYPQB",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Visualize the model's training progress using the stats stored in the `history` object. We want to use this data to determine how long to train *before* the model stops making progress."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "B6XriGbVPh2t",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "\n",
+        "def plot_history(history):\n",
+        "  plt.figure()\n",
+        "  plt.xlabel('Epoch')\n",
+        "  plt.ylabel('Mean Abs Error [1000$]')\n",
+        "  plt.plot(history.epoch, np.array(history.history['mean_absolute_error']), label='Train Loss')\n",
+        "  plt.plot(history.epoch, np.array(history.history['val_mean_absolute_error']), label = 'Val loss')\n",
+        "  plt.legend()\n",
+        "  plt.ylim([0,5])\n",
+        "\n",
+        "plot_history(history)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "AqsuANc11FYv",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "This graph shows little improvement in the model after about 200 epochs. Let's update the `model.fit` method to automatically stop training when the validation score doesn't improve. We'll use a *callback* that tests a training condition for  every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.\n",
+        "\n",
+        "You can learn more about this callback [here](https://www.tensorflow.org/versions/master/api_docs/python/tf/keras/callbacks/EarlyStopping)."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "fdMZuhUgzMZ4",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "model = build_model()\n",
+        "\n",
+        "# The patience parameter is the amount of epochs to check for improvement.\n",
+        "early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)\n",
+        "\n",
+        "history = model.fit(train_data, train_labels, epochs=EPOCHS,\n",
+        "                    validation_split=0.2, verbose=0,\n",
+        "                    callbacks=[early_stop, PrintDot()])\n",
+        "\n",
+        "plot_history(history)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "3St8-DmrX8P4",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "The graph shows the average error is about \\\\$2,500 dollars. Is this good? Well, \\$2,500 is not an insignificant amount when some of the labels are only $15,000.\n",
+        "\n",
+        "Let's see how did the model performs on the test set:"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "jl_yNr5n1kms",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "[loss, mae] = model.evaluate(test_data, test_labels, verbose=0)\n",
+        "\n",
+        "print(\"Testing set Mean Abs Error: ${:7.2f}\".format(mae * 1000))"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "ft603OzXuEZC",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Predict\n",
+        "\n",
+        "Finally, predict some housing prices using data in the testing set:"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Xe7RXH3N3CWU",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "test_predictions = model.predict(test_data).flatten()\n",
+        "\n",
+        "print(test_predictions)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "vgGQuV-yqYZH",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Conclusion\n",
+        "\n",
+        "This notebook i a few techniques to introduce a regresson problem.\n",
+        "\n",
+        "* Mean Squared Error (MSE) is a common loss function used for regression problems (different than classification problems).\n",
+        "* Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).\n",
+        "* When input data features have values with different ranges, each feature should be scaled independently.\n",
+        "* If there is not much training data, prefer a small network with few hidden layers to avoid overfitting.\n",
+        "* Early stopping is a useful technique to prevent overfitting."
+      ]
+    }
+  ]
+}
\ No newline at end of file