Unverified Commit ec4e0271 authored by Mark Daoust's avatar Mark Daoust Committed by GitHub
Browse files

Merge pull request #4529 from MarkDaoust/basic-regression

Add basic-regression
parents ea4d9bc8 52308ca9
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "basic-regression.ipynb",
"version": "0.3.2",
"views": {},
"default_view": {},
"provenance": [],
"private_outputs": true,
"collapsed_sections": [
"FhGuhbZ6M5tl"
],
"toc_visible": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"metadata": {
"id": "FhGuhbZ6M5tl",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"##### Copyright 2018 The TensorFlow Authors."
]
},
{
"metadata": {
"id": "AwOEIRJC6Une",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"cellView": "form"
},
"cell_type": "code",
"source": [
"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "KyPEtTqk6VdG",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"cellView": "form"
},
"cell_type": "code",
"source": [
"#@title MIT License\n",
"#\n",
"# Copyright (c) 2017 François Chollet\n",
"#\n",
"# Permission is hereby granted, free of charge, to any person obtaining a\n",
"# copy of this software and associated documentation files (the \"Software\"),\n",
"# to deal in the Software without restriction, including without limitation\n",
"# the rights to use, copy, modify, merge, publish, distribute, sublicense,\n",
"# and/or sell copies of the Software, and to permit persons to whom the\n",
"# Software is furnished to do so, subject to the following conditions:\n",
"#\n",
"# The above copyright notice and this permission notice shall be included in\n",
"# all copies or substantial portions of the Software.\n",
"#\n",
"# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n",
"# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n",
"# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL\n",
"# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n",
"# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n",
"# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER\n",
"# DEALINGS IN THE SOFTWARE."
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "EIdT9iu_Z4Rb",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Predicting house prices: a regression example\n",
"\n"
]
},
{
"metadata": {
"id": "bBIlTPscrIT9",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"<table align=\"left\"><td>\n",
"<a target=\"_blank\" href=\"https://colab.sandbox.google.com/github/tensorflow/models/blob/master/samples/core/get_started/basic_regression.ipynb\">\n",
" <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a> \n",
"</td><td>\n",
"<a target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/samples/core/get_started/basic_regression.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on Github</a></td></table>"
]
},
{
"metadata": {
"id": "AHp3M9ZmrIxj",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"In a *regression* problem, we aim to predict the output of a continous value, like a price or a probability. Contrast this with a *classification* problem, where we aim to predict a discrete label (for example, where a picture contains an apple or an orange). \n",
"\n",
"This notebook builds a model to predict the median price of homes in a Boston suburb during the mid-1970s. To do this, we'll provide the model with some data points about the suburb, such as the crime rate and the local property tax rate.\n",
"\n",
"This example uses the `tf.keras` API, see [this guide](https://www.tensorflow.org/programmers_guide/keras) for details."
]
},
{
"metadata": {
"id": "1rRo8oNqZ-Rj",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"\n",
"import numpy as np\n",
"\n",
"print(tf.__version__)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "F_72b0LCNbjx",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## The Boston Housing Prices dataset\n",
"\n",
"This [dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) is accessible directly in TensorFlow. Download and shuffle the training set:"
]
},
{
"metadata": {
"id": "p9kxxgzvzlyz",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"boston_housing = keras.datasets.boston_housing\n",
"\n",
"(train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()\n",
"\n",
"# Shuffle the training set\n",
"order = np.argsort(np.random.random(train_labels.shape))\n",
"train_data = train_data[order]\n",
"train_labels = train_labels[order]"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "PwEKwRJgsgJ6",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Examples and features \n",
"\n",
"This dataset is much smaller than the others we've worked with so far: it has 506 total examples are split between 404 training examples and 102 test examples:"
]
},
{
"metadata": {
"id": "Ujqcgkipr65P",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"print(\"Training set: {}\".format(train_data.shape)) # 404 examples, 13 features\n",
"print(\"Testing set: {}\".format(test_data.shape)) # 102 examples, 13 features"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "0LRPXE3Oz3Nq",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The dataset contains 13 different features:\n",
"\n",
"1. Per capita crime rate.\n",
"2. Proportion of residential land zoned for lots over 25,000 square feet.\n",
"3. Proportion of non-retail business acres per town.\n",
"4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n",
"5. Nitric oxides concentration (parts per 10 million).\n",
"6. Average number of rooms per dwelling.\n",
"7. Proportion of owner-occupied units built prior to 1940.\n",
"8. Weighted distances to five Boston employment centres.\n",
"9. Index of accessibility to radial highways.\n",
"10. Full-value property-tax rate per $10,000.\n",
"11. Pupil-teacher ratio by town.\n",
"12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.\n",
"13. Percentage lower status of the population.\n",
"\n",
"Each one of these input data features is stored using a different scale. Some feature are represented by a proportion between 0 and 1, other features are ranges between 1 and 12, some are ranges between 0 and 100, and so on. This is often the case with real-world data, and understanding how to explore and clean such data is an important skill to develop."
]
},
{
"metadata": {
"id": "8tYsm8Gs03J4",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"print(train_data[0]) # Display sample features, notice they different scales"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "Q7muNf-d1-ne",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Use the [pandas](https://pandas.pydata.org) library to display the first few rows of the dataset in a nicely formatted table:"
]
},
{
"metadata": {
"id": "pYVyGhdyCpIM",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"import pandas as pd\n",
"\n",
"column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n",
" 'TAX', 'PTRATIO', 'B', 'LSTAT']\n",
"\n",
"df = pd.DataFrame(train_data, columns=column_names)\n",
"df.head()"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "wb9S7Mia2lpf",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Labels\n",
"\n",
"The labels are the house prices in thousands of dollars. (You may notice the mid-1970s prices.)"
]
},
{
"metadata": {
"id": "I8NwI2ND2t4Y",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"print(train_labels[0:10]) # Display first 10 entries"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "mRklxK5s388r",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Normalize features\n",
"\n",
"It's recommended to normalize features that use different scales and ranges. For each feature, subtract the mean of the feature and divide by the standard deviation:"
]
},
{
"metadata": {
"id": "ze5WQP8R1TYg",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"# Test data is *not* used when calculating the mean and std.\n",
"\n",
"mean = train_data.mean(axis=0)\n",
"std = train_data.std(axis=0)\n",
"train_data = (train_data - mean) / std\n",
"test_data = (test_data - mean) / std\n",
"\n",
"print(train_data[0]) # First training sample, normalized"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "BuiClDk45eS4",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model more dependant on the choice of units used in the input."
]
},
{
"metadata": {
"id": "SmjdzxKzEu1-",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Create the model\n",
"\n",
"Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output later that returns a single, continous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on."
]
},
{
"metadata": {
"id": "c26juK7ZG8j-",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"def build_model():\n",
" model = keras.Sequential()\n",
" \n",
" model.add(keras.layers.Dense(64, activation=tf.nn.relu,\n",
" input_shape=(train_data.shape[1],)))\n",
" model.add(keras.layers.Dense(64, activation=tf.nn.relu))\n",
" model.add(keras.layers.Dense(1))\n",
"\n",
" optimizer = tf.train.RMSPropOptimizer(0.001)\n",
"\n",
" model.compile(loss='mse',\n",
" optimizer=optimizer,\n",
" metrics=['mae'])\n",
" return model\n",
"\n",
"model = build_model()\n",
"model.summary()"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "0-qWCsh6DlyH",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Train the model\n",
"\n",
"The model is trained for 500 epochs, and record the training and validation accuracy in the `history` object."
]
},
{
"metadata": {
"id": "sD7qHCmNIOY0",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"# Display training progress by printing a single dot for each completed epoch.\n",
"class PrintDot(keras.callbacks.Callback):\n",
" def on_epoch_end(self,epoch,logs):\n",
" if epoch % 100 == 0: print('')\n",
" print('.', end='')\n",
"\n",
"EPOCHS = 500\n",
"\n",
"# Store training stats\n",
"history = model.fit(train_data, train_labels, epochs=EPOCHS,\n",
" validation_split=0.2, verbose=0,\n",
" callbacks=[PrintDot()])"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "tQm3pc0FYPQB",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Visualize the model's training progress using the stats stored in the `history` object. We want to use this data to determine how long to train *before* the model stops making progress."
]
},
{
"metadata": {
"id": "B6XriGbVPh2t",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"\n",
"def plot_history(history):\n",
" plt.figure()\n",
" plt.xlabel('Epoch')\n",
" plt.ylabel('Mean Abs Error [1000$]')\n",
" plt.plot(history.epoch, np.array(history.history['mean_absolute_error']), label='Train Loss')\n",
" plt.plot(history.epoch, np.array(history.history['val_mean_absolute_error']), label = 'Val loss')\n",
" plt.legend()\n",
" plt.ylim([0,5])\n",
"\n",
"plot_history(history)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "AqsuANc11FYv",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"This graph shows little improvement in the model after about 200 epochs. Let's update the `model.fit` method to automatically stop training when the validation score doesn't improve. We'll use a *callback* that tests a training condition for every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.\n",
"\n",
"You can learn more about this callback [here](https://www.tensorflow.org/versions/master/api_docs/python/tf/keras/callbacks/EarlyStopping)."
]
},
{
"metadata": {
"id": "fdMZuhUgzMZ4",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"model = build_model()\n",
"\n",
"# The patience parameter is the amount of epochs to check for improvement.\n",
"early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)\n",
"\n",
"history = model.fit(train_data, train_labels, epochs=EPOCHS,\n",
" validation_split=0.2, verbose=0,\n",
" callbacks=[early_stop, PrintDot()])\n",
"\n",
"plot_history(history)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "3St8-DmrX8P4",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The graph shows the average error is about \\\\$2,500 dollars. Is this good? Well, \\$2,500 is not an insignificant amount when some of the labels are only $15,000.\n",
"\n",
"Let's see how did the model performs on the test set:"
]
},
{
"metadata": {
"id": "jl_yNr5n1kms",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"[loss, mae] = model.evaluate(test_data, test_labels, verbose=0)\n",
"\n",
"print(\"Testing set Mean Abs Error: ${:7.2f}\".format(mae * 1000))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "ft603OzXuEZC",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Predict\n",
"\n",
"Finally, predict some housing prices using data in the testing set:"
]
},
{
"metadata": {
"id": "Xe7RXH3N3CWU",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"test_predictions = model.predict(test_data).flatten()\n",
"\n",
"print(test_predictions)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "vgGQuV-yqYZH",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Conclusion\n",
"\n",
"This notebook i a few techniques to introduce a regresson problem.\n",
"\n",
"* Mean Squared Error (MSE) is a common loss function used for regression problems (different than classification problems).\n",
"* Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).\n",
"* When input data features have values with different ranges, each feature should be scaled independently.\n",
"* If there is not much training data, prefer a small network with few hidden layers to avoid overfitting.\n",
"* Early stopping is a useful technique to prevent overfitting."
]
}
]
}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment