" <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a> \n",
"</td><td>\n",
"<a target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/samples/core/get_started/basic_regression.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on Github</a></td></table>"
]
},
{
"metadata": {
"id": "AHp3M9ZmrIxj",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"In a *regression* problem, we aim to predict the output of a continous value, like a price or a probability. Contrast this with a *classification* problem, where we aim to predict a discrete label (for example, where a picture contains an apple or an orange). \n",
"\n",
"This notebook builds a model to predict the median price of homes in a Boston suburb during the mid-1970s. To do this, we'll provide the model with some data points about the suburb, such as the crime rate and the local property tax rate.\n",
"\n",
"This example uses the `tf.keras` API, see [this guide](https://www.tensorflow.org/programmers_guide/keras) for details."
]
},
{
"metadata": {
"id": "1rRo8oNqZ-Rj",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"\n",
"import numpy as np\n",
"\n",
"print(tf.__version__)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "F_72b0LCNbjx",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## The Boston Housing Prices dataset\n",
"\n",
"This [dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) is accessible directly in TensorFlow. Download and shuffle the training set:"
"This dataset is much smaller than the others we've worked with so far: it has 506 total examples are split between 404 training examples and 102 test examples:"
"2. Proportion of residential land zoned for lots over 25,000 square feet.\n",
"3. Proportion of non-retail business acres per town.\n",
"4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n",
"5. Nitric oxides concentration (parts per 10 million).\n",
"6. Average number of rooms per dwelling.\n",
"7. Proportion of owner-occupied units built prior to 1940.\n",
"8. Weighted distances to five Boston employment centres.\n",
"9. Index of accessibility to radial highways.\n",
"10. Full-value property-tax rate per $10,000.\n",
"11. Pupil-teacher ratio by town.\n",
"12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.\n",
"13. Percentage lower status of the population.\n",
"\n",
"Each one of these input data features is stored using a different scale. Some feature are represented by a proportion between 0 and 1, other features are ranges between 1 and 12, some are ranges between 0 and 100, and so on. This is often the case with real-world data, and understanding how to explore and clean such data is an important skill to develop."
]
},
{
"metadata": {
"id": "8tYsm8Gs03J4",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"print(train_data[0]) # Display sample features, notice they different scales"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "Q7muNf-d1-ne",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Use the [pandas](https://pandas.pydata.org) library to display the first few rows of the dataset in a nicely formatted table:"
"The labels are the house prices in thousands of dollars. (You may notice the mid-1970s prices.)"
]
},
{
"metadata": {
"id": "I8NwI2ND2t4Y",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"print(train_labels[0:10]) # Display first 10 entries"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "mRklxK5s388r",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Normalize features\n",
"\n",
"It's recommended to normalize features that use different scales and ranges. For each feature, subtract the mean of the feature and divide by the standard deviation:"
]
},
{
"metadata": {
"id": "ze5WQP8R1TYg",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"# Test data is *not* used when calculating the mean and std.\n",
"\n",
"mean = train_data.mean(axis=0)\n",
"std = train_data.std(axis=0)\n",
"train_data = (train_data - mean) / std\n",
"test_data = (test_data - mean) / std\n",
"\n",
"print(train_data[0]) # First training sample, normalized"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "BuiClDk45eS4",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model more dependant on the choice of units used in the input."
]
},
{
"metadata": {
"id": "SmjdzxKzEu1-",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Create the model\n",
"\n",
"Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output later that returns a single, continous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on."
"Visualize the model's training progress using the stats stored in the `history` object. We want to use this data to determine how long to train *before* the model stops making progress."
"This graph shows little improvement in the model after about 200 epochs. Let's update the `model.fit` method to automatically stop training when the validation score doesn't improve. We'll use a *callback* that tests a training condition for every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.\n",
"\n",
"You can learn more about this callback [here](https://www.tensorflow.org/versions/master/api_docs/python/tf/keras/callbacks/EarlyStopping)."
]
},
{
"metadata": {
"id": "fdMZuhUgzMZ4",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"model = build_model()\n",
"\n",
"# The patience parameter is the amount of epochs to check for improvement.\n",
"The graph shows the average error is about \\\\$2,500 dollars. Is this good? Well, \\$2,500 is not an insignificant amount when some of the labels are only $15,000.\n",
"\n",
"Let's see how did the model performs on the test set:"