" <a target=\"_blank\" href=\"https://www.tensorflow.org/tutorials/estimators/linear\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/estimators/linear.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/samples/core/tutorials/estimators/linear.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
" </td>\n",
"</table>"
]
},
{
...
...
@@ -74,31 +95,34 @@
},
"cell_type": "markdown",
"source": [
"In this tutorial, we will use the `tf.estimator` API in TensorFlow to solve a\n",
"benchmark binary classification problem.\n",
"This tutorial uses the `tf.estimator` API in TensorFlow to solve a benchmark binary classification problem. Estimators are TensorFlow's most scalable and production-oriented model type. For more information see the [Estimator guide](../../guide/estimators.md).\n",
"\n",
"Estimators are TensorFlow's most scalable and production oriented type of model. For more information see the [Estimator guide](../../guide/estimators).\n",
"## Overview\n",
"\n",
"The problem is: Given census data about a person such as age, education, marital status, and occupation (the features), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target label). We will train a **logistic regression** model, and given an individual's information our model will output a number between 0 and 1, which can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.\n",
"Using census data which contains data a person's age, education, marital status, and occupation (the *features*), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target *label*). We will train a *logistic regression* model that, given an individual's information, outputs a number between 0 and 1—this can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.\n",
"\n",
"Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is each feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).\n",
"\n",
"## Setup\n",
"\n",
"To try this tutorial, first import the relavant packages:"
"Import TensorFlow, feature column support, and supporting modules:"
]
},
{
"metadata": {
"id": "NQgONe5ecYvE",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"import tensorflow as tf\n",
"import tensorflow.feature_column as fc \n",
"tf.enable_eager_execution()\n",
"\n",
"import os\n",
"import sys\n",
...
...
@@ -109,31 +133,54 @@
},
{
"metadata": {
"id": "-MPr95UccYvL",
"id": "Rpb1JSMj1nqk",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Official implementation\n",
"\n"
"And let's enable [eager execution](../../guide/eager.md) to inspect this program as we run it:"
]
},
{
"metadata": {
"id": "tJqF5E6rtyCI",
"id": "tQzxON782Eby",
"colab_type": "code",
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"tf.enable_eager_execution()"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "-MPr95UccYvL",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Download the [tutorial code from github](https://github.com/tensorflow/models/tree/master/official/wide_deep/),\n",
" add the root directory to your python path, and jump to the `wide_deep` directory:"
"## Download the official implementation\n",
"\n",
"We'll use the [wide and deep model](https://github.com/tensorflow/models/tree/master/official/wide_deep/) available in TensorFlow's [model repository](https://github.com/tensorflow/models/). Download the code, add the root directory to your Python path, and jump to the `wide_deep` directory:"
]
},
{
"metadata": {
"id": "tTwQzWcn8aBu",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -142,11 +189,26 @@
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "sRpuysc73Eb-",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Add the root directory of the repository to your Python path:"
]
},
{
"metadata": {
"id": "yVvFyhnkcYvL",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -171,7 +233,12 @@
"metadata": {
"id": "6QilS4-0cYvQ",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -201,7 +268,12 @@
"metadata": {
"id": "DYOkY8boUptJ",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -225,7 +297,12 @@
"metadata": {
"id": "1_3tBaLW4YM4",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -248,7 +325,12 @@
"metadata": {
"id": "py7MarZl5Yh6",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -257,16 +339,6 @@
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "Uo2qoafut4MK",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Read on to find out how this code builds its models."
]
},
{
"metadata": {
"id": "AmZ4CpaOcYvV",
...
...
@@ -274,28 +346,25 @@
},
"cell_type": "markdown",
"source": [
"## Reading The Census Data\n",
"## Read the U.S. Census data\n",
"\n",
"The dataset we're using is the\n",
"[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n",
"which downloads the code and performs some additional cleanup.\n",
"This example uses the [U.S Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income) from 1994 and 1995. We have provided the [census_dataset.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_dataset.py) script to download the data and perform a little cleanup.\n",
"\n",
"Since the task is a binary classification problem, we'll construct a label\n",
"column named \"label\" whose value is 1 if the income is over 50K, and 0\n",
"Since the task is a *binary classification problem*, we'll construct a label column named \"label\" whose value is 1 if the income is over 50K, and 0 otherwise. For reference, see the `input_fn` in [census_main.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py).\n",
"\n",
"Next, let's take a look at the data and see which columns we can use to\n",
"predict the target label. "
"Let's look at the data to see which columns we can use to predict the target label:"
]
},
{
"metadata": {
"id": "N6Tgye8bcYvX",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -308,7 +377,12 @@
"metadata": {
"id": "6y3mj9zKcYva",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -318,15 +392,31 @@
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "EO_McKgE5il2",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"[pandas](https://pandas.pydata.org/) provides some convenient utilities for data analysis. Here's a list of columns available in the Census Income dataset:"
"The columns can be grouped into two types—categorical\n",
"and continuous columns:\n",
"The columns are grouped into two types: *categorical* and *continuous* columns:\n",
"\n",
"* A column is called **categorical** if its value can only be one of the\n",
" categories in a finite set. For example, the relationship status of a person\n",
" (wife, husband, unmarried, etc.) or the education level (high school,\n",
" college, etc.) are categorical columns.\n",
"* A column is called **continuous** if its value can be any numerical value in\n",
" a continuous range. For example, the capital gain of a person (e.g. $14,084)\n",
" is a continuous column.\n",
"\n",
"Here's a list of columns available in the Census Income dataset:\n",
"* A column is called *categorical* if its value can only be one of the categories in a finite set. For example, the relationship status of a person (wife, husband, unmarried, etc.) or the education level (high school, college, etc.) are categorical columns.\n",
"* A column is called *continuous* if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.\n",
"\n",
"## Converting Data into Tensors\n",
"\n",
"When building a tf.estimator model, the input data is specified by means of an\n",
"input function or `input_fn`. This builder function returns a `tf.data.Dataset`\n",
"of batches of `(features-dict,label)` pairs. It will not be called until it is\n",
"later passed to `tf.estimator.Estimator` methods such as `train` and `evaluate`.\n",
"When building a `tf.estimator` model, the input data is specified by using an *input function* (or `input_fn`). This builder function returns a `tf.data.Dataset` of batches of `(features-dict, label)` pairs. It is not called until it is passed to `tf.estimator.Estimator` methods such as `train` and `evaluate`.\n",
"\n",
"In more detail, the input builder function returns the following as a pair:\n",
"The input builder function returns the following pair:\n",
"\n",
"1. `features`: A dict from feature names to `Tensors` or\n",
" `SparseTensors` containing batches of features.\n",
"2. `labels`: A `Tensor` containing batches of labels.\n",
"1. `features`: A dict from feature names to `Tensors` or `SparseTensors` containing batches of features.\n",
"2. `labels`: A `Tensor` containing batches of labels.\n",
"\n",
"The keys of the `features` will be used to configure the model's input layer.\n",
"The keys of the `features` are used to configure the model's input layer.\n",
"\n",
"Note that the input function will be called while\n",
"constructing the TensorFlow graph, not while running the graph. What it is\n",
"returning is a representation of the input data as sequence of tensorflow graph\n",
"operations.\n",
"Note: The input function is called while constructing the TensorFlow graph, *not* while running the graph. It is returning a representation of the input data as a sequence of TensorFlow graph operations.\n",
"\n",
"For small problems like this it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:"
"For small problems like this, it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:"
]
},
{
"metadata": {
"id": "N7zNJflKcYvg",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -407,14 +487,19 @@
},
"cell_type": "markdown",
"source": [
"Since we have eager execution enabled it is easy to inspect the resulting dataset:"
"Since we have eager execution enabled, it's easy to inspect the resulting dataset:"
]
},
{
"metadata": {
"id": "ygaKuikecYvi",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -437,8 +522,7 @@
},
"cell_type": "markdown",
"source": [
"But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n",
"The `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n",
"But this approach has severly-limited scalability. Larger datasets should be streamed from disk. The `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n",
"\n",
"<!-- TODO(markdaoust): This `input_fn` should use `tf.contrib.data.make_csv_dataset` -->"
"## Selecting and Engineering Features for the Model\n",
"\n",
"Estimators use a system called `feature_columns` to describe how the model\n",
"should interpret each of the raw input features. An Estimator exepcts a vector\n",
"of numeric inputs, and feature columns describe how the model should convert\n",
"each feature.\n",
"\n",
"Selecting and crafting the right set of feature columns is key to learning an\n",
"effective model. A **feature column** can be either one of the raw columns in\n",
"the original dataframe (let's call them **base feature columns**), or any new\n",
"columns created based on some transformations defined over one or multiple base\n",
"columns (let's call them **derived feature columns**). Basically, \"feature\n",
"column\" is an abstract concept of any raw or derived variable that can be used\n",
"to predict the target label.\n",
"Estimators use a system called [feature columns](https://www.tensorflow.org/guide/feature_columns) to describe how the model should interpret each of the raw input features. An Estimator expects a vector of numeric inputs, and feature columns describe how the model should convert each feature.\n",
"\n",
"### Base Feature Columns\n",
"Selecting and crafting the right set of feature columns is key to learning an effective model. A *feature column* can be either one of the raw columns in the original data frame (a *base feature column*), or any new columns created using transformations defined over one or multiple base columns (a *derived feature columns*).\n",
"\n",
"A feature column is an abstract concept of any raw or derived variable that can be used to predict the target label."
]
},
{
"metadata": {
"id": "_hh-cWdU__Lq",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Base Feature Columns"
]
},
{
"metadata": {
"id": "BKz6LA8_ACI7",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"#### Numeric columns\n",
"\n",
"The simplest `feature_column` is `numeric_column`. This indicates that a feature is a numeric value that should be input to the model directly. For example:"
...
...
@@ -556,7 +654,12 @@
"metadata": {
"id": "ZX0r2T5OcYv6",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -579,7 +682,12 @@
"metadata": {
"id": "kREtIPfwcYv_",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -595,14 +703,19 @@
},
"cell_type": "markdown",
"source": [
"The following code will train and evaluate a model on only the `age` feature."
"The following will train and evaluate a model using only the `age` feature:"
]
},
{
"metadata": {
"id": "9R5eSJ1pcYwE",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -610,7 +723,7 @@
"classifier.train(train_inpf)\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"clear_output() # used for display in notebook\n",
"You could retrain a model on these features with, just by changing the `feature_columns` argument to the constructor:"
"You could retrain a model on these features by changing the `feature_columns` argument to the constructor:"
]
},
{
"metadata": {
"id": "XN8k5S95cYwR",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -693,6 +794,7 @@
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %s' % (key, value))"
],
...
...
@@ -708,24 +810,27 @@
"source": [
"#### Categorical columns\n",
"\n",
"To define a feature column for a categorical feature, we can create a\n",
"`CategoricalColumn` using one of the `tf.feature_column.categorical_column*` functions.\n",
"To define a feature column for a categorical feature, create a `CategoricalColumn` using one of the `tf.feature_column.categorical_column*` functions.\n",
"\n",
"If you know the set of all possible feature values of a columnand there are only a few of them, you can use `categorical_column_with_vocabulary_list`. Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the `relationship` column we can assign the feature string `Husband` to an integer ID of 0 and \"Not-in-family\" to 1, etc., by doing:"
"If you know the set of all possible feature values of a column—and there are only a few of them—use `categorical_column_with_vocabulary_list`. Each key in the list is assigned an auto-incremented ID starting from 0. For example, for the `relationship` column we can assign the feature string `Husband` to an integer ID of 0 and \"Not-in-family\" to 1, etc."
"This will create a sparse one-hot vector from the raw input feature.\n",
"This creates a sparse one-hot vector from the raw input feature.\n",
"\n",
"The `input_layer` function we're using for demonstration is designed for DNN models, and so expects dense inputs. To demonstrate the categorical column we must wrap it in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this dense-step).\n",
"The `input_layer` function we're using is designed for DNN models and expects dense inputs. To demonstrate the categorical column we must wrap it in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this dense-step).\n",
"\n",
"Note: the other sparse-to-dense option is `tf.feature_column.embedding_column`.\n",
"\n",
...
...
@@ -750,7 +855,12 @@
"metadata": {
"id": "kI43CYlncYwY",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -766,15 +876,19 @@
},
"cell_type": "markdown",
"source": [
"What if we don't know the set of possible values in advance? Not a problem. We\n",
"can use `categorical_column_with_hash_bucket` instead:"
"If we don't know the set of possible values in advance, use the `categorical_column_with_hash_bucket` instead:"
]
},
{
"metadata": {
"id": "8pSBaliCcYwb",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -791,15 +905,19 @@
},
"cell_type": "markdown",
"source": [
"What will happen is that each possible value in the feature column `occupation`\n",
"will be hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:"
"Here, each possible value in the feature column `occupation` is hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:"
]
},
{
"metadata": {
"id": "dCvQNv36cYwe",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -816,14 +934,19 @@
},
"cell_type": "markdown",
"source": [
"If we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`"
"If we run `input_layer` with the hashed column, we see that the output shape is `(batch_size, hash_bucket_size)`:"
]
},
{
"metadata": {
"id": "0Y16peWacYwh",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -841,18 +964,19 @@
},
"cell_type": "markdown",
"source": [
"It's easier to see the actual results if we take the tf.argmax over the `hash_bucket_size` dimension.\n",
"\n",
"In the output below, note how any duplicate occupations are mapped to the same pseudo-random index:\n",
"\n",
"Note: Hash collisions are unavoidable, but often have minimal impact on model quiality. The effeect may be noticable if the hash buckets are being used to compress the input space. See [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb) for a more visual example of the effect of these hash collisions."
"It's easier to see the actual results if we take the `tf.argmax` over the `hash_bucket_size` dimension. Notice how any duplicate occupations are mapped to the same pseudo-random index:"
]
},
{
"metadata": {
"id": "q_ryRglmcYwk",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -868,21 +992,23 @@
},
"cell_type": "markdown",
"source": [
"No matter which way we choose to define a `SparseColumn`, each feature string\n",
"will be mapped into an integer ID by looking up a fixed mapping or by hashing.\n",
"Under the hood, the `LinearModel` class is responsible for\n",
"managing the mapping and creating `tf.Variable` to store the model parameters\n",
"(also known as model weights) for each feature ID. The model parameters will be\n",
"learned through the model training process we'll go through later.\n",
"Note: Hash collisions are unavoidable, but often have minimal impact on model quiality. The effeect may be noticable if the hash buckets are being used to compress the input space. See [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb) for a more visual example of the effect of these hash collisions.\n",
"\n",
"No matter how we choose to define a `SparseColumn`, each feature string is mapped into an integer ID by looking up a fixed mapping or by hashing. Under the hood, the `LinearModel` class is responsible for managing the mapping and creating `tf.Variable` to store the model parameters (model *weights*) for each feature ID. The model parameters are learned through the model training process described later.\n",
"\n",
"We'll do the similar trick to define the other categorical features:"
"Let's do the similar trick to define the other categorical features:"
"#### Making Continuous Features Categorical through Bucketization\n",
"### Derived feature columns"
]
},
{
"metadata": {
"id": "RgYaf_48FSU2",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"#### Make Continuous Features Categorical through Bucketization\n",
"\n",
"Sometimes the relationship between a continuous feature and the label is not\n",
"linear. As a hypothetical example, a person's income may grow with age in the\n",
"early stage of one's career, then the growth may slow at some point, and finally\n",
"the income decreases after retirement. In this scenario, using the raw `age` as\n",
"a real-valued feature column might not be a good choice because the model can\n",
"only learn one of the three cases:\n",
"Sometimes the relationship between a continuous feature and the label is not linear. For example, *age* and *income*—a person's income may grow in the early stage of their career, then the growth may slow at some point, and finally, the income decreases after retirement. In this scenario, using the raw `age` as a real-valued feature column might not be a good choice because the model can only learn one of the three cases:\n",
"\n",
"1. Income always increases at some rate as age grows (positive correlation),\n",
"1. Income always decreases at some rate as age grows (negative correlation), or\n",
"1. Income stays the same no matter at what age (no correlation)\n",
"\n",
"If we want to learn the fine-grained correlation between income and each age\n",
"group separately, we can leverage **bucketization**. Bucketization is a process\n",
"of dividing the entire range of a continuous feature into a set of consecutive\n",
"bins/buckets, and then converting the original numerical feature into a bucket\n",
"ID (as a categorical feature) depending on which bucket that value falls into.\n",
"So, we can define a `bucketized_column` over `age` as:"
"2. Income always decreases at some rate as age grows (negative correlation), or\n",
"3. Income stays the same no matter at what age (no correlation).\n",
"\n",
"If we want to learn the fine-grained correlation between income and each age group separately, we can leverage *bucketization*. Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. So, we can define a `bucketized_column` over `age` as:"
]
},
{
"metadata": {
"id": "KT4pjD9AcYww",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -998,9 +1123,7 @@
},
"cell_type": "markdown",
"source": [
"where the `boundaries` is a list of bucket boundaries. In this case, there are\n",
"10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,\n",
"25-29, ..., to 65 and over).\n",
"`boundaries` is a list of bucket boundaries. In this case, there are 10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24, 25-29, ..., to 65 and over).\n",
"\n",
"With bucketing, the model sees each bucket a one-hot feature:"
]
...
...
@@ -1009,7 +1132,12 @@
"metadata": {
"id": "Lr40vm3qcYwy",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -1027,22 +1155,21 @@
"source": [
"#### Learn complex relationships with crossed column\n",
"\n",
"Using each base feature column separately may not be enough to explain the data.\n",
"For example, the correlation between education and the label (earning > 50,000\n",
"dollars) may be different for different occupations. Therefore, if we only learn\n",
"a single model weight for `education=\"Bachelors\"` and `education=\"Masters\"`, we\n",
"won't be able to capture every single education-occupation combination (e.g.\n",
"distinguishing between `education=\"Bachelors\" AND occupation=\"Exec-managerial\"`\n",
"and `education=\"Bachelors\" AND occupation=\"Craft-repair\"`). To learn the\n",
"differences between different feature combinations, we can add **crossed feature\n",
"columns** to the model."
"Using each base feature column separately may not be enough to explain the data. For example, the correlation between education and the label (earning > 50,000 dollars) may be different for different occupations. Therefore, if we only learn a single model weight for `education=\"Bachelors\"` and `education=\"Masters\"`, we won't capture every education-occupation combination (e.g. distinguishing between `education=\"Bachelors\"` AND `occupation=\"Exec-managerial\"` AND `education=\"Bachelors\" AND occupation=\"Craft-repair\"`).\n",
"\n",
"To learn the differences between different feature combinations, we can add *crossed feature columns* to the model:"
]
},
{
"metadata": {
"id": "IAPhPzXscYw1",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -1059,17 +1186,19 @@
},
"cell_type": "markdown",
"source": [
"We can also create a `crossed_column` over more than two columns. Each\n",
"constituent column can be either a base feature column that is categorical\n",
"(`SparseColumn`), a bucketized real-valued feature column, or even another\n",
"`CrossColumn`. Here's an example:"
"We can also create a `crossed_column` over more than two columns. Each constituent column can be either a base feature column that is categorical (`SparseColumn`), a bucketized real-valued feature column, or even another `CrossColumn`. For example:"
]
},
{
"metadata": {
"id": "y8UaBld9cYw7",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
...
...
@@ -1088,8 +1217,7 @@
"source": [
"These crossed columns always use hash buckets to avoid the exponential explosion in the number of categories, and put the control over number of model weights in the hands of the user.\n",
"\n",
"For a visual example the effect of hash-buckets with crossed columns see [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb)\n",
"\n"
"For a visual example the effect of hash-buckets with crossed columns see [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb)\n"
]
},
{
...
...
@@ -1099,27 +1227,28 @@
},
"cell_type": "markdown",
"source": [
"## Defining The Logistic Regression Model\n",
"## Define the logistic regression model\n",
"\n",
"After processing the input data and defining all the feature columns, we're now\n",
"ready to put them all together and build a Logistic Regression model. In the\n",
"previous section we've seen several types of base and derived feature columns,\n",
"including:\n",
"After processing the input data and defining all the feature columns, we can put them together and build a *logistic regression* model. The previous section showed several types of base and derived feature columns, including:\n",
"\n",
"* `CategoricalColumn`\n",
"* `NumericColumn`\n",
"* `BucketizedColumn`\n",
"* `CrossedColumn`\n",
"\n",
"All of these are subclasses of the abstract `FeatureColumn` class, and can be\n",
"added to the `feature_columns` field of a model:"
"All of these are subclasses of the abstract `FeatureColumn` class and can be added to the `feature_columns` field of a model:"
"The model also automatically learns a bias term, which controls the prediction\n",
"one would make without observing any features (see the section [How Logistic\n",
"Regression Works](#how_it_works) for more explanations). The learned model files will be stored\n",
"in `model_dir`.\n",
"The model automatically learns a bias term, which controls the prediction made without observing any features. The learned model files are stored in `model_dir`.\n",
"\n",
"## Training and evaluating our model\n",
"## Train and evaluate the model\n",
"\n",
"After adding all the features to the model, now let's look at how to actually\n",
"train the model. Training a model is just a single command using the\n",
"tf.estimator API:"
"After adding all the features to the model, let's train the model. Training a model is just a single command using the `tf.estimator` API:"
]
},
{
"metadata": {
"id": "ZlrIBuoecYxD",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"model.train(train_inpf)\n",
"clear_output()"
"\n",
"clear_output() # used for notebook display"
],
"execution_count": 0,
"outputs": []
...
...
@@ -1182,20 +1313,26 @@
},
"cell_type": "markdown",
"source": [
"After the model is trained, we can evaluate how good our model is at predicting\n",
"the labels of the holdout data:"
"After the model is trained, evaluate the accuracy of the model by predicting the labels of the holdout data:"
]
},
{
"metadata": {
"id": "L9nVJEO8cYxI",
"colab_type": "code",
"colab": {}
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
}
},
"cell_type": "code",
"source": [
"results = model.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %0.2f' % (key, value))"
],
...
...
@@ -1209,25 +1346,28 @@
},
"cell_type": "markdown",
"source": [
"The first line of the final output should be something like\n",
"`accuracy: 0.83`, which means the accuracy is 83%. Feel free to try more\n",
"features and transformations and see if you can do even better!\n",
"The first line of the output should display something like: `accuracy: 0.83`, which means the accuracy is 83%. You can try using more features and transformations to see if you can do better!\n",
"\n",
"After the model is evaluated, we can use the model to predict whether an individual has an annual income of over\n",
"50,000 dollars given an individual's information input.\n",
"After the model is evaluated, we can use it to predict whether an individual has an annual income of over 50,000 dollars given an individual's information input.\n",
"\n",
"Let's look in more detail how the model did:"
"Let's look in more detail how the model performed:"
"where \\\\(\\mathbf{w}=[w_1, w_2, ..., w_d]\\\\) are the model weights for the\n",
"features \\\\(\\mathbf{x}=[x_1, x_2, ..., x_d]\\\\). \\\\(b\\\\) is a constant that is\n",
"often called the **bias** of the model. The equation consists of two parts—A\n",
"linear model and a logistic function:\n",
"\n",
"* **Linear Model**: First, we can see that \\\\(\\mathbf{w}^T\\mathbf{x}+b = b +\n",
" w_1x_1 + ... +w_dx_d\\\\) is a linear model where the output is a linear\n",
" function of the input features \\\\(\\mathbf{x}\\\\). The bias \\\\(b\\\\) is the\n",
" prediction one would make without observing any features. The model weight\n",
" \\\\(w_i\\\\) reflects how the feature \\\\(x_i\\\\) is correlated with the positive\n",
" label. If \\\\(x_i\\\\) is positively correlated with the positive label, the\n",
" weight \\\\(w_i\\\\) increases, and the probability \\\\(P(Y=1|\\mathbf{x})\\\\) will\n",
" be closer to 1. On the other hand, if \\\\(x_i\\\\) is negatively correlated\n",
" with the positive label, then the weight \\\\(w_i\\\\) decreases and the\n",
" probability \\\\(P(Y=1|\\mathbf{x})\\\\) will be closer to 0.\n",
"\n",
"* **Logistic Function**: Second, we can see that there's a logistic function\n",
" (also known as the sigmoid function) \\\\(S(t) = 1/(1+\\exp(-t))\\\\) being\n",
" applied to the linear model. The logistic function is used to convert the\n",
" output of the linear model \\\\(\\mathbf{w}^T\\mathbf{x}+b\\\\) from any real\n",
" number into the range of \\\\([0, 1]\\\\), which can be interpreted as a\n",
" probability.\n",
"\n",
"Model training is an optimization problem: The goal is to find a set of model\n",
"weights (i.e. model parameters) to minimize a **loss function** defined over the\n",
"training data, such as logistic loss for Logistic Regression models. The loss\n",
"function measures the discrepancy between the ground-truth label and the model's\n",
"prediction. If the prediction is very close to the ground-truth label, the loss\n",
"value will be low; if the prediction is very far from the label, then the loss\n",
"value would be high."
]
},
{
"metadata": {
"id": "hbXuPYQIcYxV",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Next Steps\n",
"\n",
"For more about estimators see:\n",
"\n",
"- The [Estimator Guide](tensorlfow.org/guide/estimators).\n",
"- The [TensorFlow Hub text classification tutorial](https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub), which uses `hub.text_embedding_column` to easily ingest free form text. \n",
"- The [Gradient-boosted-trees estimator tutorial](https://github.com/tensorflow/models/tree/master/official/boosted_trees)\n",
"- This [blog post]( https://medium.com/tensorflow/classifying-text-with-tensorflow-estimators) on processing text with `Estimators`\n",
"- How to [build a custom CNN estimator](https://www.tensorflow.org/tutorials/estimators/cnn)"
"For a working end-to-end example, download our [example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py) and set the `model_type` flag to `wide`."