Commit 9ba5b316 authored by Mark Daoust's avatar Mark Daoust
Browse files

Convert to colab format

parent 2c929976
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TensorFlow Linear Model Tutorial\n",
"\n",
"In this tutorial, we will use the `tf.estimator` API in TensorFlow to solve a\n",
"binary classification problem: Given census data about a person such as age,\n",
"education, marital status, and occupation (the features), we will try to predict\n",
"whether or not the person earns more than 50,000 dollars a year (the target\n",
"label). We will train a **logistic regression** model, and given an individual's\n",
"information our model will output a number between 0 and 1, which can be\n",
"interpreted as the probability that the individual has an annual income of over\n",
"50,000 dollars.\n",
"\n",
"## Setup\n",
"\n",
"To try the code for this tutorial:\n",
"\n",
"[Install TensorFlow](tensorlfow.org/install) if you haven't already.\n",
"\n",
"Next import the relavant packages:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "tf.enable_eager_execution must be called at program startup.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-42-04d0fb7a9ec6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensorflow\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mtf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensorflow\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfeature_column\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mfc\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mtf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menable_eager_execution\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/venv3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py\u001b[0m in \u001b[0;36menable_eager_execution\u001b[0;34m(config, device_policy, execution_mode)\u001b[0m\n\u001b[1;32m 5238\u001b[0m \"\"\"\n\u001b[1;32m 5239\u001b[0m return enable_eager_execution_internal(\n\u001b[0;32m-> 5240\u001b[0;31m config, device_policy, execution_mode, None)\n\u001b[0m\u001b[1;32m 5241\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5242\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/venv3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py\u001b[0m in \u001b[0;36menable_eager_execution_internal\u001b[0;34m(config, device_policy, execution_mode, server_def)\u001b[0m\n\u001b[1;32m 5306\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5307\u001b[0m raise ValueError(\n\u001b[0;32m-> 5308\u001b[0;31m \"tf.enable_eager_execution must be called at program startup.\")\n\u001b[0m\u001b[1;32m 5309\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5310\u001b[0m \u001b[0;31m# Monkey patch to get rid of an unnecessary conditional since the context is\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: tf.enable_eager_execution must be called at program startup."
]
}
],
"source": [
"import tensorflow as tf\n",
"import tensorflow.feature_column as fc \n",
"tf.enable_eager_execution()\n",
"\n",
"import os\n",
"import sys\n",
"from IPython.display import clear_output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the [tutorial code from github](https://github.com/tensorflow/models/tree/master/official/wide_deep/),\n",
" add the root directory to your python path, and jump to the `wide_deep` directory:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fatal: destination path 'models' already exists and is not an empty directory.\r\n"
]
}
],
"source": [
"if \"wide_deep\" not in os.getcwd():\n",
" ! git clone --depth 1 https://github.com/tensorflow/models\n",
" models_path = os.path.join(os.getcwd(), 'models')\n",
" sys.path.append(models_path) \n",
" os.environ['PYTHONPATH'] += os.pathsep+models_path\n",
" os.chdir(\"models/official/wide_deep\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the data download script:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import census_dataset\n",
"import census_main\n",
"\n",
"census_dataset.download(\"/tmp/census_data/\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the tutorial code with the following command to train the model described in this tutorial, from the command line:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['I0711 14:47:25.747490 139708077598464 tf_logging.py:115] accuracy: 0.833794']\n"
]
}
],
"source": [
"output = !python -m census_main --model_type=wide --train_epochs=2\n",
"print([line for line in output if 'accuracy:' in line])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read on to find out how this code builds its linear model.\n",
"\n",
"## Reading The Census Data\n",
"\n",
"The dataset we're using is the\n",
"[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n",
"We have provided\n",
"[census_dataset.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_dataset.py)\n",
"which downloads the code and performs some additional cleanup.\n",
"\n",
"Since the task is a binary classification problem, we'll construct a label\n",
"column named \"label\" whose value is 1 if the income is over 50K, and 0\n",
"otherwise. For reference, see `input_fn` in\n",
"[census_main.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py).\n",
"\n",
"Next, let's take a look at the data and see which columns we can use to\n",
"predict the target label. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"adult.data adult.test\r\n"
]
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "wide.ipynb",
"version": "0.3.2",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
],
"source": [
"!ls /tmp/census_data/"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"train_file = \"/tmp/census_data/adult.data\"\n",
"test_file = \"/tmp/census_data/adult.test\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>workclass</th>\n",
" <th>fnlwgt</th>\n",
" <th>education</th>\n",
" <th>education_num</th>\n",
" <th>marital_status</th>\n",
" <th>occupation</th>\n",
" <th>relationship</th>\n",
" <th>race</th>\n",
" <th>gender</th>\n",
" <th>capital_gain</th>\n",
" <th>capital_loss</th>\n",
" <th>hours_per_week</th>\n",
" <th>native_country</th>\n",
" <th>income_bracket</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>39</td>\n",
" <td>State-gov</td>\n",
" <td>77516</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Never-married</td>\n",
" <td>Adm-clerical</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>2174</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>50</td>\n",
" <td>Self-emp-not-inc</td>\n",
" <td>83311</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Exec-managerial</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>38</td>\n",
" <td>Private</td>\n",
" <td>215646</td>\n",
" <td>HS-grad</td>\n",
" <td>9</td>\n",
" <td>Divorced</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>53</td>\n",
" <td>Private</td>\n",
" <td>234721</td>\n",
" <td>11th</td>\n",
" <td>7</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Husband</td>\n",
" <td>Black</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>28</td>\n",
" <td>Private</td>\n",
" <td>338409</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Prof-specialty</td>\n",
" <td>Wife</td>\n",
" <td>Black</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>Cuba</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
"cells": [
{
"metadata": {
"id": "Zr7KpBhMcYvE",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# TensorFlow Linear Model Tutorial\n",
"\n"
]
},
{
"metadata": {
"id": "77aETSYDcdoK",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"In this tutorial, we will use the `tf.estimator` API in TensorFlow to solve a\n",
"binary classification problem: Given census data about a person such as age,\n",
"education, marital status, and occupation (the features), we will try to predict\n",
"whether or not the person earns more than 50,000 dollars a year (the target\n",
"label). We will train a **logistic regression** model, and given an individual's\n",
"information our model will output a number between 0 and 1, which can be\n",
"interpreted as the probability that the individual has an annual income of over\n",
"50,000 dollars.\n",
"\n",
"## Setup\n",
"\n",
"To try the code for this tutorial:\n",
"\n",
"[Install TensorFlow](tensorlfow.org/install) if you haven't already.\n",
"\n",
"Next import the relavant packages:"
]
},
{
"metadata": {
"id": "NQgONe5ecYvE",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "7ab0889a-32f9-4ace-f848-6c808893b88c"
},
"cell_type": "code",
"source": [
"import tensorflow as tf\n",
"import tensorflow.feature_column as fc \n",
"tf.enable_eager_execution()\n",
"\n",
"import os\n",
"import sys\n",
"from IPython.display import clear_output"
],
"text/plain": [
" age workclass fnlwgt education education_num \\\n",
"0 39 State-gov 77516 Bachelors 13 \n",
"1 50 Self-emp-not-inc 83311 Bachelors 13 \n",
"2 38 Private 215646 HS-grad 9 \n",
"3 53 Private 234721 11th 7 \n",
"4 28 Private 338409 Bachelors 13 \n",
"\n",
" marital_status occupation relationship race gender \\\n",
"0 Never-married Adm-clerical Not-in-family White Male \n",
"1 Married-civ-spouse Exec-managerial Husband White Male \n",
"2 Divorced Handlers-cleaners Not-in-family White Male \n",
"3 Married-civ-spouse Handlers-cleaners Husband Black Male \n",
"4 Married-civ-spouse Prof-specialty Wife Black Female \n",
"\n",
" capital_gain capital_loss hours_per_week native_country income_bracket \n",
"0 2174 0 40 United-States <=50K \n",
"1 0 0 13 United-States <=50K \n",
"2 0 0 40 United-States <=50K \n",
"3 0 0 40 United-States <=50K \n",
"4 0 0 40 Cuba <=50K "
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"id": "-MPr95UccYvL",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Download the [tutorial code from github](https://github.com/tensorflow/models/tree/master/official/wide_deep/),\n",
" add the root directory to your python path, and jump to the `wide_deep` directory:"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas\n",
"train_df = pandas.read_csv(train_file, header = None, names = census_dataset._CSV_COLUMNS)\n",
"test_df = pandas.read_csv(test_file, header = None, names = census_dataset._CSV_COLUMNS)\n",
"\n",
"train_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The columns can be grouped into two types—categorical\n",
"and continuous columns:\n",
"\n",
"* A column is called **categorical** if its value can only be one of the\n",
" categories in a finite set. For example, the relationship status of a person\n",
" (wife, husband, unmarried, etc.) or the education level (high school,\n",
" college, etc.) are categorical columns.\n",
"* A column is called **continuous** if its value can be any numerical value in\n",
" a continuous range. For example, the capital gain of a person (e.g. $14,084)\n",
" is a continuous column.\n",
"\n",
"Here's a list of columns available in the Census Income dataset:\n",
"\n",
"## Converting Data into Tensors\n",
"\n",
"When building a tf.estimator model, the input data is specified by means of an\n",
"input function or `input_fn`. This builder function returns a `tf.data.Dataset`\n",
"of batches of `(features-dict,label)` pairs. It will not be called until it is\n",
"later passed to `tf.estimator.Estimator` methods such as `train` and `evaluate`.\n",
"\n",
"In more detail, the input builder function returns the following as a pair:\n",
"\n",
"1. `features`: A dict from feature names to `Tensors` or\n",
" `SparseTensors` containing batches of features.\n",
"2. `labels`: A `Tensor` containing batches of labels.\n",
"\n",
"The keys of the `features` will be used to configure the model's input layer.\n",
"\n",
"Note that the input function will be called while\n",
"constructing the TensorFlow graph, not while running the graph. What it is\n",
"returning is a representation of the input data as sequence of tensorflow graph\n",
"operations.\n",
"\n",
"For small problems like this it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):\n",
" df = df.copy()\n",
" label = df.pop(label_key)\n",
" ds = tf.data.Dataset.from_tensor_slices((dict(df),label))\n",
"\n",
" if shuffle:\n",
" ds = ds.shuffle(10000)\n",
"\n",
" ds = ds.batch(batch_size).repeat(num_epochs)\n",
"\n",
" return ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we have eager execution enabled it is easy to inspect the resulting dataset:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Some feature keys: ['capital_gain', 'occupation', 'gender', 'capital_loss', 'workclass']\n",
"\n",
"A batch of Ages : tf.Tensor([61 18 37 47 47 32 18 23 28 37], shape=(10,), dtype=int32)\n",
"\n",
"A batch of Labels: tf.Tensor(\n",
"[b'>50K' b'<=50K' b'>50K' b'>50K' b'>50K' b'>50K' b'<=50K' b'<=50K'\n",
" b'<=50K' b'<=50K'], shape=(10,), dtype=string)\n"
]
}
],
"source": [
"ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)\n",
"\n",
"for feature_batch, label_batch in ds:\n",
" break\n",
" \n",
"print('Some feature keys:', list(feature_batch.keys())[:5])\n",
"print()\n",
"print('A batch of Ages :', feature_batch['age'])\n",
"print()\n",
"print('A batch of Labels:', label_batch )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n",
"the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n",
"\n",
"TODO(markdaoust): This `input_fn` should use `tf.contrib.data.make_csv_dataset`"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"def input_fn(data_file, num_epochs, shuffle, batch_size):\n",
" \"\"\"Generate an input function for the Estimator.\"\"\"\n",
" assert tf.gfile.Exists(data_file), (\n",
" '%s not found. Please make sure you have run census_dataset.py and '\n",
" 'set the --data_dir argument to the correct path.' % data_file)\n",
"\n",
" def parse_csv(value):\n",
" tf.logging.info('Parsing {}'.format(data_file))\n",
" columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)\n",
" features = dict(zip(_CSV_COLUMNS, columns))\n",
" labels = features.pop('income_bracket')\n",
" classes = tf.equal(labels, '>50K') # binary classification\n",
" return features, classes\n",
"\n",
" # Extract lines from input files using the Dataset API.\n",
" dataset = tf.data.TextLineDataset(data_file)\n",
"\n",
" if shuffle:\n",
" dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])\n",
"\n",
" dataset = dataset.map(parse_csv, num_parallel_calls=5)\n",
"\n",
" # We call repeat after shuffling, rather than before, to prevent separate\n",
" # epochs from blending together.\n",
" dataset = dataset.repeat(num_epochs)\n",
" dataset = dataset.batch(batch_size)\n",
" return dataset\n",
"\n"
]
}
],
"source": [
"import inspect\n",
"print(inspect.getsource(census_dataset.input_fn))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This input_fn gives equivalent output:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Parsing /tmp/census_data/adult.data\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Logging before flag parsing goes to stderr.\n",
"I0711 14:47:26.362334 140466218788608 tf_logging.py:115] Parsing /tmp/census_data/adult.data\n"
]
}
],
"source": [
"ds = census_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feature keys: ['capital_gain', 'occupation', 'gender', 'capital_loss', 'workclass']\n",
"\n",
"Age batch : tf.Tensor([46 38 42 37 29 48 46 40 73 49], shape=(10,), dtype=int32)\n",
"\n",
"Label batch : tf.Tensor([False False False False False False False False True False], shape=(10,), dtype=bool)\n"
]
}
],
"source": [
"for feature_batch, label_batch in ds:\n",
" break\n",
" \n",
"print('Feature keys:', list(feature_batch.keys())[:5])\n",
"print()\n",
"print('Age batch :', feature_batch['age'])\n",
"print()\n",
"print('Label batch :', label_batch )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because `Estimators` expect an `input_fn` that takes no arguments, we typically wrap configurable input function into an obejct with the expected signature. For this notebook configure the `train_inpf` to iterate over the data twice:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import functools\n",
"train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)\n",
"test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting and Engineering Features for the Model\n",
"\n",
"Estimators use a system called `feature_columns` to describe how the model\n",
"should interpret each of the raw input features. An Estimator exepcts a vector\n",
"of numeric inputs, and feature columns describe how the model shoukld convert\n",
"each feature.\n",
"\n",
"Selecting and crafting the right set of feature columns is key to learning an\n",
"effective model. A **feature column** can be either one of the raw columns in\n",
"the original dataframe (let's call them **base feature columns**), or any new\n",
"columns created based on some transformations defined over one or multiple base\n",
"columns (let's call them **derived feature columns**). Basically, \"feature\n",
"column\" is an abstract concept of any raw or derived variable that can be used\n",
"to predict the target label.\n",
"\n",
"### Base Feature Columns\n",
"\n",
"#### Numeric columns\n",
"\n",
"The simplest `feature_column` is `numeric_column`. This indicates that a feature is a numeric value that should be input to the model directly. For example:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"age = fc.numeric_column('age')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model will use the `feature_column` definitions to build the model input. You can inspect the resulting output using the `input_layer` function:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<tf.Tensor: id=237, shape=(10, 1), dtype=float32, numpy=\n",
"array([[46.],\n",
" [38.],\n",
" [42.],\n",
" [37.],\n",
" [29.],\n",
" [48.],\n",
" [46.],\n",
" [40.],\n",
" [73.],\n",
" [49.]], dtype=float32)>"
},
{
"metadata": {
"id": "yVvFyhnkcYvL",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 136
},
"outputId": "e57030d7-7f5c-455e-ea0f-55038e909d97"
},
"cell_type": "code",
"source": [
"if \"wide_deep\" not in os.getcwd():\n",
" ! git clone --depth 1 https://github.com/tensorflow/models\n",
" models_path = os.path.join(os.getcwd(), 'models')\n",
" sys.path.append(models_path) \n",
" os.environ['PYTHONPATH'] += os.pathsep+models_path\n",
" os.chdir(\"models/official/wide_deep\")"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": [
"Cloning into 'models'...\n",
"remote: Counting objects: 2826, done.\u001b[K\n",
"remote: Compressing objects: 100% (2375/2375), done.\u001b[K\n",
"remote: Total 2826 (delta 543), reused 1731 (delta 382), pack-reused 0\u001b[K\n",
"Receiving objects: 100% (2826/2826), 371.22 MiB | 39.17 MiB/s, done.\n",
"Resolving deltas: 100% (543/543), done.\n",
"Checking out files: 100% (2934/2934), done.\n"
],
"name": "stdout"
}
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, [age]).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following code will train and evaluate a model on only the `age` feature."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'precision': 0.29166666, 'auc_precision_recall': 0.31132147, 'average_loss': 0.5239897, 'label/mean': 0.23622628, 'auc': 0.6781367, 'loss': 33.4552, 'prediction/mean': 0.22513431, 'accuracy': 0.7631595, 'recall': 0.0018200728, 'global_step': 1018, 'accuracy_baseline': 0.76377374}\n"
]
}
],
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=[age], n_classes=2)\n",
"classifier.train(train_inpf)\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we can define a `NumericColumn` for each continuous feature column\n",
"that we want to use in the model:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"education_num = tf.feature_column.numeric_column('education_num')\n",
"capital_gain = tf.feature_column.numeric_column('capital_gain')\n",
"capital_loss = tf.feature_column.numeric_column('capital_loss')\n",
"hours_per_week = tf.feature_column.numeric_column('hours_per_week')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"my_numeric_columns = [age,education_num, capital_gain, capital_loss, hours_per_week]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<tf.Tensor: id=2160, shape=(10, 5), dtype=float32, numpy=\n",
"array([[4.600e+01, 0.000e+00, 0.000e+00, 6.000e+00, 4.000e+01],\n",
" [3.800e+01, 4.508e+03, 0.000e+00, 1.300e+01, 4.000e+01],\n",
" [4.200e+01, 0.000e+00, 0.000e+00, 1.400e+01, 4.000e+01],\n",
" [3.700e+01, 0.000e+00, 0.000e+00, 1.100e+01, 4.000e+01],\n",
" [2.900e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],\n",
" [4.800e+01, 0.000e+00, 0.000e+00, 1.300e+01, 5.500e+01],\n",
" [4.600e+01, 0.000e+00, 0.000e+00, 9.000e+00, 5.000e+01],\n",
" [4.000e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],\n",
" [7.300e+01, 6.418e+03, 0.000e+00, 4.000e+00, 9.900e+01],\n",
" [4.900e+01, 0.000e+00, 0.000e+00, 4.000e+00, 4.000e+01]],\n",
" dtype=float32)>"
},
{
"metadata": {
"id": "15Ethw-wcYvP",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Execute the data download script:"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, my_numeric_columns).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You could retrain a model on these features with, just by changing the `feature_columns` argument to the constructor:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.7817087\n",
"accuracy_baseline: 0.76377374\n",
"auc: 0.8027547\n",
"auc_precision_recall: 0.5611528\n",
"average_loss: 1.0698086\n",
"global_step: 1018\n",
"label/mean: 0.23622628\n",
"loss: 68.30414\n",
"precision: 0.57025987\n",
"prediction/mean: 0.36397633\n",
"recall: 0.30811232\n"
]
}
],
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns, n_classes=2)\n",
"classifier.train(train_inpf)\n",
"\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %s' % (key, value))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Categorical columns\n",
"\n",
"To define a feature column for a categorical feature, we can create a\n",
"`CategoricalColumn` using one of the `tf.feature_column.categorical_column*` functions.\n",
"\n",
"If you know the set of all possible feature values of a column and there are only a few of them, you can use `categorical_column_with_vocabulary_list`. Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the `relationship` column we can assign the feature string `Husband` to an integer ID of 0 and \"Not-in-family\" to 1, etc., by doing:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"relationship = fc.categorical_column_with_vocabulary_list(\n",
" 'relationship', [\n",
" 'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',\n",
" 'Other-relative'])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create a sparse one-hot vector from the raw input feature.\n",
"\n",
"The `input_layer` function we're using for demonstration is designed for DNN models, and so expects dense inputs. To demonstrate the categorical column we must wrap it in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this dense-step).\n",
"\n",
"Note: the other sparse-to-dense option is `tf.feature_column.embedding_column`.\n",
"\n",
"Run the input layer, configured with both the `age` and `relationship` columns:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<tf.Tensor: id=4490, shape=(10, 7), dtype=float32, numpy=\n",
"array([[46., 0., 0., 0., 0., 1., 0.],\n",
" [38., 1., 0., 0., 0., 0., 0.],\n",
" [42., 0., 1., 0., 0., 0., 0.],\n",
" [37., 1., 0., 0., 0., 0., 0.],\n",
" [29., 1., 0., 0., 0., 0., 0.],\n",
" [48., 1., 0., 0., 0., 0., 0.],\n",
" [46., 1., 0., 0., 0., 0., 0.],\n",
" [40., 1., 0., 0., 0., 0., 0.],\n",
" [73., 1., 0., 0., 0., 0., 0.],\n",
" [49., 1., 0., 0., 0., 0., 0.]], dtype=float32)>"
},
{
"metadata": {
"id": "6QilS4-0cYvQ",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "3faf2df7-677e-4a91-c09b-3d81ca30c9c1"
},
"cell_type": "code",
"source": [
"import census_dataset\n",
"import census_main\n",
"\n",
"census_dataset.download(\"/tmp/census_data/\")"
],
"execution_count": 3,
"outputs": []
},
{
"metadata": {
"id": "cD5e3ibAcYvS",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Execute the tutorial code with the following command to train the model described in this tutorial, from the command line:"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we don't know the set of possible values in advance? Not a problem. We\n",
"can use `categorical_column_with_hash_bucket` instead:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"occupation = tf.feature_column.categorical_column_with_hash_bucket(\n",
" 'occupation', hash_bucket_size=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What will happen is that each possible value in the feature column `occupation`\n",
"will be hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Machine-op-inspct\n",
"Transport-moving\n",
"Prof-specialty\n",
"Adm-clerical\n",
"Handlers-cleaners\n",
"Prof-specialty\n",
"Other-service\n",
"Farming-fishing\n",
"Farming-fishing\n",
"Handlers-cleaners\n"
]
}
],
"source": [
"for item in feature_batch['occupation'].numpy():\n",
" print(item.decode())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 1000)"
},
{
"metadata": {
"id": "vbJ8jPAhcYvT",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "cc0182c0-90d7-4f9c-b421-0dd67166c6d2"
},
"cell_type": "code",
"source": [
"output = !python -m census_main --model_type=wide --train_epochs=2\n",
"print([line for line in output if 'accuracy:' in line])"
],
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": [
"['I0711 22:27:15.442501 140285526747008 tf_logging.py:115] accuracy: 0.8360666']\n"
],
"name": "stdout"
}
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])\n",
"\n",
"occupation_result.numpy().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's easier to see the actual results if we take the tf.argmax over the `hash_bucket_size` dimension.\n",
"\n",
"In the output below, note how any duplicate occupations are mapped to the same pseudo-random index:\n",
"\n",
"Note: Hash collisions are unavoidable, but often have minimal impact on model quiality. The effeect may be noticable if the hash buckets are being used to compress the input space. See [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb) for a more visual example of the effect of these hash collisions."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([911, 420, 979, 96, 10, 979, 527, 936, 936, 10])"
},
{
"metadata": {
"id": "AmZ4CpaOcYvV",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Read on to find out how this code builds its linear model.\n",
"\n",
"## Reading The Census Data\n",
"\n",
"The dataset we're using is the\n",
"[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n",
"We have provided\n",
"[census_dataset.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_dataset.py)\n",
"which downloads the code and performs some additional cleanup.\n",
"\n",
"Since the task is a binary classification problem, we'll construct a label\n",
"column named \"label\" whose value is 1 if the income is over 50K, and 0\n",
"otherwise. For reference, see `input_fn` in\n",
"[census_main.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py).\n",
"\n",
"Next, let's take a look at the data and see which columns we can use to\n",
"predict the target label. "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf.argmax(occupation_result, axis=1).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No matter which way we choose to define a `SparseColumn`, each feature string\n",
"will be mapped into an integer ID by looking up a fixed mapping or by hashing.\n",
"Under the hood, the `LinearModel` class is responsible for\n",
"managing the mapping and creating `tf.Variable` to store the model parameters\n",
"(also known as model weights) for each feature ID. The model parameters will be\n",
"learned through the model training process we'll go through later.\n",
"\n",
"We'll do the similar trick to define the other categorical features:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"education = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'education', [\n",
" 'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',\n",
" 'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',\n",
" '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])\n",
"\n",
"marital_status = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'marital_status', [\n",
" 'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',\n",
" 'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])\n",
"\n",
"workclass = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'workclass', [\n",
" 'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',\n",
" 'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"my_categorical_columns = [relationship, occupation, education, marital_status, workclass]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's easy to use both sets of columns to configure a model that uses all these features:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.83342546\n",
"accuracy_baseline: 0.76377374\n",
"auc: 0.8807037\n",
"auc_precision_recall: 0.6601031\n",
"average_loss: 0.8671454\n",
"global_step: 1018\n",
"label/mean: 0.23622628\n",
"loss: 55.36468\n",
"precision: 0.6496042\n",
"prediction/mean: 0.2628341\n",
"recall: 0.6401456\n"
]
}
],
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns, n_classes=2)\n",
"classifier.train(train_inpf)\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %s' % (key, value))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Derived feature columns\n",
"\n",
"#### Making Continuous Features Categorical through Bucketization\n",
"\n",
"Sometimes the relationship between a continuous feature and the label is not\n",
"linear. As a hypothetical example, a person's income may grow with age in the\n",
"early stage of one's career, then the growth may slow at some point, and finally\n",
"the income decreases after retirement. In this scenario, using the raw `age` as\n",
"a real-valued feature column might not be a good choice because the model can\n",
"only learn one of the three cases:\n",
"\n",
"1. Income always increases at some rate as age grows (positive correlation),\n",
"1. Income always decreases at some rate as age grows (negative correlation), or\n",
"1. Income stays the same no matter at what age (no correlation)\n",
"\n",
"If we want to learn the fine-grained correlation between income and each age\n",
"group separately, we can leverage **bucketization**. Bucketization is a process\n",
"of dividing the entire range of a continuous feature into a set of consecutive\n",
"bins/buckets, and then converting the original numerical feature into a bucket\n",
"ID (as a categorical feature) depending on which bucket that value falls into.\n",
"So, we can define a `bucketized_column` over `age` as:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"age_buckets = tf.feature_column.bucketized_column(\n",
" age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"where the `boundaries` is a list of bucket boundaries. In this case, there are\n",
"10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,\n",
"25-29, ..., to 65 and over).\n",
"\n",
"With bucketing, the model sees each bucket a one-hot feature:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[46., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
" [38., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],\n",
" [42., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],\n",
" [37., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],\n",
" [29., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
" [48., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
" [46., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
" [40., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],\n",
" [73., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],\n",
" [49., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]],\n",
" dtype=float32)"
},
{
"metadata": {
"id": "N6Tgye8bcYvX",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "75152d8d-6afa-4e4e-cc0e-3eac7127f8fd"
},
"cell_type": "code",
"source": [
"!ls /tmp/census_data/"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"adult.data adult.test\r\n"
],
"name": "stdout"
}
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, [age, age_buckets]).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Learn complex relationships with crossed column\n",
"\n",
"Using each base feature column separately may not be enough to explain the data.\n",
"For example, the correlation between education and the label (earning > 50,000\n",
"dollars) may be different for different occupations. Therefore, if we only learn\n",
"a single model weight for `education=\"Bachelors\"` and `education=\"Masters\"`, we\n",
"won't be able to capture every single education-occupation combination (e.g.\n",
"distinguishing between `education=\"Bachelors\" AND occupation=\"Exec-managerial\"`\n",
"and `education=\"Bachelors\" AND occupation=\"Craft-repair\"`). To learn the\n",
"differences between different feature combinations, we can add **crossed feature\n",
"columns** to the model."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"education_x_occupation = tf.feature_column.crossed_column(\n",
" ['education', 'occupation'], hash_bucket_size=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also create a `crossed_column` over more than two columns. Each\n",
"constituent column can be either a base feature column that is categorical\n",
"(`SparseColumn`), a bucketized real-valued feature column, or even another\n",
"`CrossColumn`. Here's an example:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(\n",
" [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These crossed columns always use hash buckets to avoid the exponential explosion in the number of categories, and put the control over number of model weights in the hands of the user.\n",
"\n",
"For a visual example the effect of hash-buckets with crossed columns see [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Defining The Logistic Regression Model\n",
"\n",
"After processing the input data and defining all the feature columns, we're now\n",
"ready to put them all together and build a Logistic Regression model. In the\n",
"previous section we've seen several types of base and derived feature columns,\n",
"including:\n",
"\n",
"* `CategoricalColumn`\n",
"* `NumericColumn`\n",
"* `BucketizedColumn`\n",
"* `CrossedColumn`\n",
"\n",
"All of these are subclasses of the abstract `FeatureColumn` class, and can be\n",
"added to the `feature_columns` field of a model:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Using default config.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"I0711 14:48:54.071429 140466218788608 tf_logging.py:115] Using default config.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Using config: {'_global_id_in_cluster': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': None, '_num_worker_replicas': 1, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc03341f668>, '_evaluation_master': '', '_train_distribute': None, '_model_dir': '/tmp/tmpligbanno', '_session_config': None, '_save_checkpoints_steps': None, '_master': '', '_num_ps_replicas': 0, '_task_type': 'worker', '_log_step_count_steps': 100, '_save_summary_steps': 100, '_service': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"I0711 14:48:54.073915 140466218788608 tf_logging.py:115] Using config: {'_global_id_in_cluster': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': None, '_num_worker_replicas': 1, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc03341f668>, '_evaluation_master': '', '_train_distribute': None, '_model_dir': '/tmp/tmpligbanno', '_session_config': None, '_save_checkpoints_steps': None, '_master': '', '_num_ps_replicas': 0, '_task_type': 'worker', '_log_step_count_steps': 100, '_save_summary_steps': 100, '_service': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5}\n"
]
}
],
"source": [
"import tempfile\n",
"\n",
"base_columns = [\n",
" education, marital_status, relationship, workclass, occupation,\n",
" age_buckets,\n",
"]\n",
"crossed_columns = [\n",
" tf.feature_column.crossed_column(\n",
" ['education', 'occupation'], hash_bucket_size=1000),\n",
" tf.feature_column.crossed_column(\n",
" [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),\n",
"]\n",
"\n",
"model_dir = tempfile.mkdtemp()\n",
"model = tf.estimator.LinearClassifier(\n",
" model_dir=model_dir, feature_columns=base_columns + crossed_columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model also automatically learns a bias term, which controls the prediction\n",
"one would make without observing any features (see the section [How Logistic\n",
"Regression Works](#how_it_works) for more explanations). The learned model files will be stored\n",
"in `model_dir`.\n",
"\n",
"## Training and evaluating our model\n",
"\n",
"After adding all the features to the model, now let's look at how to actually\n",
"train the model. Training a model is just a single command using the\n",
"tf.estimator API:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"model.train(train_inpf)\n",
"clear_output()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After the model is trained, we can evaluate how good our model is at predicting\n",
"the labels of the holdout data:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.84\n",
"accuracy_baseline: 0.76\n",
"auc: 0.88\n",
"auc_precision_recall: 0.70\n",
"average_loss: 0.35\n",
"global_step: 1018.00\n",
"label/mean: 0.24\n",
"loss: 22.37\n",
"precision: 0.69\n",
"prediction/mean: 0.24\n",
"recall: 0.57\n"
]
}
],
"source": [
"results = model.evaluate(test_inpf)\n",
"clear_output()\n",
"for key in sorted(results):\n",
" print('%s: %0.2f' % (key, results[key]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first line of the final output should be something like\n",
"`accuracy: 0.83`, which means the accuracy is 83%. Feel free to try more\n",
"features and transformations and see if you can do even better!\n",
"\n",
"After the model is evaluated, we can use the model to predict whether an individual has an annual income of over\n",
"50,000 dollars given an individual's information input.\n",
"\n",
"Let's look in more detail how the model did:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>income_bracket</th>\n",
" <th>predicted_class</th>\n",
" <th>correct</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
},
{
"metadata": {
"id": "6y3mj9zKcYva",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "3b44b7dd-5a2d-4943-eb19-20f26d5c7098"
},
"cell_type": "code",
"source": [
"train_file = \"/tmp/census_data/adult.data\"\n",
"test_file = \"/tmp/census_data/adult.test\""
],
"text/plain": [
" income_bracket predicted_class correct\n",
"0 <=50K <=50K True\n",
"1 <=50K <=50K True\n",
"2 >50K <=50K False\n",
"3 >50K <=50K False\n",
"4 <=50K <=50K True\n",
"5 <=50K <=50K True\n",
"6 <=50K <=50K True\n",
"7 >50K >50K True\n",
"8 <=50K <=50K True\n",
"9 <=50K <=50K True\n",
"10 >50K <=50K False\n",
"11 <=50K >50K False\n",
"12 <=50K <=50K True\n",
"13 <=50K <=50K True\n",
"14 >50K <=50K False\n",
"15 >50K >50K True\n",
"16 <=50K <=50K True\n",
"17 <=50K <=50K True\n",
"18 <=50K <=50K True\n",
"19 >50K >50K True"
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"id": "vkn1FNmpcYvb",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "4e27b186-b76c-4f19-ea9d-abe19110e93b"
},
"cell_type": "code",
"source": [
"import pandas\n",
"train_df = pandas.read_csv(train_file, header = None, names = census_dataset._CSV_COLUMNS)\n",
"test_df = pandas.read_csv(test_file, header = None, names = census_dataset._CSV_COLUMNS)\n",
"\n",
"train_df.head()"
],
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>workclass</th>\n",
" <th>fnlwgt</th>\n",
" <th>education</th>\n",
" <th>education_num</th>\n",
" <th>marital_status</th>\n",
" <th>occupation</th>\n",
" <th>relationship</th>\n",
" <th>race</th>\n",
" <th>gender</th>\n",
" <th>capital_gain</th>\n",
" <th>capital_loss</th>\n",
" <th>hours_per_week</th>\n",
" <th>native_country</th>\n",
" <th>income_bracket</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>39</td>\n",
" <td>State-gov</td>\n",
" <td>77516</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Never-married</td>\n",
" <td>Adm-clerical</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>2174</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>50</td>\n",
" <td>Self-emp-not-inc</td>\n",
" <td>83311</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Exec-managerial</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>38</td>\n",
" <td>Private</td>\n",
" <td>215646</td>\n",
" <td>HS-grad</td>\n",
" <td>9</td>\n",
" <td>Divorced</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>53</td>\n",
" <td>Private</td>\n",
" <td>234721</td>\n",
" <td>11th</td>\n",
" <td>7</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Husband</td>\n",
" <td>Black</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>28</td>\n",
" <td>Private</td>\n",
" <td>338409</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Prof-specialty</td>\n",
" <td>Wife</td>\n",
" <td>Black</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>Cuba</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age workclass fnlwgt education education_num \\\n",
"0 39 State-gov 77516 Bachelors 13 \n",
"1 50 Self-emp-not-inc 83311 Bachelors 13 \n",
"2 38 Private 215646 HS-grad 9 \n",
"3 53 Private 234721 11th 7 \n",
"4 28 Private 338409 Bachelors 13 \n",
"\n",
" marital_status occupation relationship race gender \\\n",
"0 Never-married Adm-clerical Not-in-family White Male \n",
"1 Married-civ-spouse Exec-managerial Husband White Male \n",
"2 Divorced Handlers-cleaners Not-in-family White Male \n",
"3 Married-civ-spouse Handlers-cleaners Husband Black Male \n",
"4 Married-civ-spouse Prof-specialty Wife Black Female \n",
"\n",
" capital_gain capital_loss hours_per_week native_country income_bracket \n",
"0 2174 0 40 United-States <=50K \n",
"1 0 0 13 United-States <=50K \n",
"2 0 0 40 United-States <=50K \n",
"3 0 0 40 United-States <=50K \n",
"4 0 0 40 Cuba <=50K "
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"predict_df = test_df[:20].copy()\n",
"\n",
"pred_iter = model.predict(\n",
" lambda:easy_input_function(predict_df, label_key='income_bracket',\n",
" num_epochs=1, shuffle=False, batch_size=10))\n",
"\n",
"classes = np.array(['<=50K', '>50K'])\n",
"pred_class_id = []\n",
"for pred_dict in pred_iter:\n",
" pred_class_id.append(pred_dict['class_ids'])\n",
"\n",
"predict_df['predicted_class'] = classes[np.array(pred_class_id)]\n",
"predict_df['correct'] = predict_df['predicted_class'] == predict_df['income_bracket']\n",
"\n",
"clear_output()\n",
"predict_df[['income_bracket','predicted_class', 'correct']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you'd like to see a working end-to-end example, you can download our\n",
"[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py)\n",
"and set the `model_type` flag to `wide`.\n",
"\n",
"## Adding Regularization to Prevent Overfitting\n",
"\n",
"Regularization is a technique used to avoid **overfitting**. Overfitting happens\n",
"when your model does well on the data it is trained on, but worse on test data\n",
"that the model has not seen before, such as live traffic. Overfitting generally\n",
"occurs when a model is excessively complex, such as having too many parameters\n",
"relative to the number of observed training data. Regularization allows for you\n",
"to control your model's complexity and makes the model more generalizable to\n",
"unseen data.\n",
"\n",
"In the Linear Model library, you can add L1 and L2 regularizations to the model\n",
"as:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.84\n",
"accuracy_baseline: 0.76\n",
"auc: 0.89\n",
"auc_precision_recall: 0.70\n",
"average_loss: 0.35\n",
"global_step: 2036.00\n",
"label/mean: 0.24\n",
"loss: 22.29\n",
"precision: 0.69\n",
"prediction/mean: 0.24\n",
"recall: 0.56\n"
]
},
{
"metadata": {
"id": "QZZtXes4cYvf",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The columns can be grouped into two types—categorical\n",
"and continuous columns:\n",
"\n",
"* A column is called **categorical** if its value can only be one of the\n",
" categories in a finite set. For example, the relationship status of a person\n",
" (wife, husband, unmarried, etc.) or the education level (high school,\n",
" college, etc.) are categorical columns.\n",
"* A column is called **continuous** if its value can be any numerical value in\n",
" a continuous range. For example, the capital gain of a person (e.g. $14,084)\n",
" is a continuous column.\n",
"\n",
"Here's a list of columns available in the Census Income dataset:\n",
"\n",
"## Converting Data into Tensors\n",
"\n",
"When building a tf.estimator model, the input data is specified by means of an\n",
"input function or `input_fn`. This builder function returns a `tf.data.Dataset`\n",
"of batches of `(features-dict,label)` pairs. It will not be called until it is\n",
"later passed to `tf.estimator.Estimator` methods such as `train` and `evaluate`.\n",
"\n",
"In more detail, the input builder function returns the following as a pair:\n",
"\n",
"1. `features`: A dict from feature names to `Tensors` or\n",
" `SparseTensors` containing batches of features.\n",
"2. `labels`: A `Tensor` containing batches of labels.\n",
"\n",
"The keys of the `features` will be used to configure the model's input layer.\n",
"\n",
"Note that the input function will be called while\n",
"constructing the TensorFlow graph, not while running the graph. What it is\n",
"returning is a representation of the input data as sequence of tensorflow graph\n",
"operations.\n",
"\n",
"For small problems like this it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:"
]
},
{
"metadata": {
"id": "N7zNJflKcYvg",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "4aebe747-0fca-4209-cf28-3164080ab89f"
},
"cell_type": "code",
"source": [
"def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):\n",
" df = df.copy()\n",
" label = df.pop(label_key)\n",
" ds = tf.data.Dataset.from_tensor_slices((dict(df),label))\n",
"\n",
" if shuffle:\n",
" ds = ds.shuffle(10000)\n",
"\n",
" ds = ds.batch(batch_size).repeat(num_epochs)\n",
"\n",
" return ds"
],
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"id": "WeEgNR9AcYvh",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Since we have eager execution enabled it is easy to inspect the resulting dataset:"
]
},
{
"metadata": {
"id": "ygaKuikecYvi",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 136
},
"outputId": "071665a2-d23f-4c15-da43-ce0d106d473f"
},
"cell_type": "code",
"source": [
"ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)\n",
"\n",
"for feature_batch, label_batch in ds:\n",
" break\n",
" \n",
"print('Some feature keys:', list(feature_batch.keys())[:5])\n",
"print()\n",
"print('A batch of Ages :', feature_batch['age'])\n",
"print()\n",
"print('A batch of Labels:', label_batch )"
],
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"text": [
"Some feature keys: ['age', 'workclass', 'fnlwgt', 'education', 'education_num']\n",
"\n",
"A batch of Ages : tf.Tensor([52 57 31 33 34 22 32 66 35 44], shape=(10,), dtype=int32)\n",
"\n",
"A batch of Labels: tf.Tensor(\n",
"[b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K'\n",
" b'<=50K' b'>50K'], shape=(10,), dtype=string)\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "O_KZxQUucYvm",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n",
"the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n",
"\n",
"TODO(markdaoust): This `input_fn` should use `tf.contrib.data.make_csv_dataset`"
]
},
{
"metadata": {
"id": "vUTeXaEUcYvn",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 493
},
"outputId": "2da7413a-5e54-4e86-f3c5-07387156ab79"
},
"cell_type": "code",
"source": [
"import inspect\n",
"print(inspect.getsource(census_dataset.input_fn))"
],
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": [
"def input_fn(data_file, num_epochs, shuffle, batch_size):\n",
" \"\"\"Generate an input function for the Estimator.\"\"\"\n",
" assert tf.gfile.Exists(data_file), (\n",
" '%s not found. Please make sure you have run census_dataset.py and '\n",
" 'set the --data_dir argument to the correct path.' % data_file)\n",
"\n",
" def parse_csv(value):\n",
" tf.logging.info('Parsing {}'.format(data_file))\n",
" columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)\n",
" features = dict(zip(_CSV_COLUMNS, columns))\n",
" labels = features.pop('income_bracket')\n",
" classes = tf.equal(labels, '>50K') # binary classification\n",
" return features, classes\n",
"\n",
" # Extract lines from input files using the Dataset API.\n",
" dataset = tf.data.TextLineDataset(data_file)\n",
"\n",
" if shuffle:\n",
" dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])\n",
"\n",
" dataset = dataset.map(parse_csv, num_parallel_calls=5)\n",
"\n",
" # We call repeat after shuffling, rather than before, to prevent separate\n",
" # epochs from blending together.\n",
" dataset = dataset.repeat(num_epochs)\n",
" dataset = dataset.batch(batch_size)\n",
" return dataset\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "yyGcv_e-cYvq",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"This input_fn gives equivalent output:"
]
},
{
"metadata": {
"id": "DlsqRZS5cYvr",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 68
},
"outputId": "31dee63f-80f7-4c7e-f749-a5531d33ab95"
},
"cell_type": "code",
"source": [
"ds = census_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)"
],
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": [
"INFO:tensorflow:Parsing /tmp/census_data/adult.data\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"WARNING: Logging before flag parsing goes to stderr.\n",
"I0711 22:27:19.570451 140174775953280 tf_logging.py:115] Parsing /tmp/census_data/adult.data\n"
],
"name": "stderr"
}
]
},
{
"metadata": {
"id": "Mv3as_CEcYvu",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 102
},
"outputId": "3834b00d-9655-488f-d6d2-8d7405848d78"
},
"cell_type": "code",
"source": [
"for feature_batch, label_batch in ds:\n",
" break\n",
" \n",
"print('Feature keys:', list(feature_batch.keys())[:5])\n",
"print()\n",
"print('Age batch :', feature_batch['age'])\n",
"print()\n",
"print('Label batch :', label_batch )"
],
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature keys: ['age', 'workclass', 'fnlwgt', 'education', 'education_num']\n",
"\n",
"Age batch : tf.Tensor([31 88 36 46 20 51 30 40 31 49], shape=(10,), dtype=int32)\n",
"\n",
"Label batch : tf.Tensor([False False True True False True True False False True], shape=(10,), dtype=bool)\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "810fnfY5cYvz",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Because `Estimators` expect an `input_fn` that takes no arguments, we typically wrap configurable input function into an obejct with the expected signature. For this notebook configure the `train_inpf` to iterate over the data twice:"
]
},
{
"metadata": {
"id": "wnQdpEcVcYv0",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "b9050d80-e603-4363-dbe9-11c2b368e29d"
},
"cell_type": "code",
"source": [
"import functools\n",
"train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)\n",
"test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)"
],
"execution_count": 13,
"outputs": []
},
{
"metadata": {
"id": "pboNpNWhcYv4",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Selecting and Engineering Features for the Model\n",
"\n",
"Estimators use a system called `feature_columns` to describe how the model\n",
"should interpret each of the raw input features. An Estimator exepcts a vector\n",
"of numeric inputs, and feature columns describe how the model shoukld convert\n",
"each feature.\n",
"\n",
"Selecting and crafting the right set of feature columns is key to learning an\n",
"effective model. A **feature column** can be either one of the raw columns in\n",
"the original dataframe (let's call them **base feature columns**), or any new\n",
"columns created based on some transformations defined over one or multiple base\n",
"columns (let's call them **derived feature columns**). Basically, \"feature\n",
"column\" is an abstract concept of any raw or derived variable that can be used\n",
"to predict the target label.\n",
"\n",
"### Base Feature Columns\n",
"\n",
"#### Numeric columns\n",
"\n",
"The simplest `feature_column` is `numeric_column`. This indicates that a feature is a numeric value that should be input to the model directly. For example:"
]
},
{
"metadata": {
"id": "ZX0r2T5OcYv6",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "283bf438-2a96-4bf3-fa89-94da99f93927"
},
"cell_type": "code",
"source": [
"age = fc.numeric_column('age')"
],
"execution_count": 14,
"outputs": []
},
{
"metadata": {
"id": "tnLUiaHxcYv-",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The model will use the `feature_column` definitions to build the model input. You can inspect the resulting output using the `input_layer` function:"
]
},
{
"metadata": {
"id": "kREtIPfwcYv_",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"outputId": "197a798b-9809-45e1-a8d4-ed5d237eea9d"
},
"cell_type": "code",
"source": [
"fc.input_layer(feature_batch, [age]).numpy()"
],
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[31.],\n",
" [88.],\n",
" [36.],\n",
" [46.],\n",
" [20.],\n",
" [51.],\n",
" [30.],\n",
" [40.],\n",
" [31.],\n",
" [49.]], dtype=float32)"
]
},
"metadata": {
"tags": []
},
"execution_count": 15
}
]
},
{
"metadata": {
"id": "OPuLduCucYwD",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The following code will train and evaluate a model on only the `age` feature."
]
},
{
"metadata": {
"id": "9R5eSJ1pcYwE",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
},
"outputId": "ea791197-8300-4f31-cee1-f7d1b8209838"
},
"cell_type": "code",
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=[age], n_classes=2)\n",
"classifier.train(train_inpf)\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"print(result)"
],
"execution_count": 16,
"outputs": [
{
"output_type": "stream",
"text": [
"{'accuracy': 0.76334375, 'accuracy_baseline': 0.76377374, 'auc': 0.67818105, 'auc_precision_recall': 0.31133735, 'average_loss': 0.52437353, 'label/mean': 0.23622628, 'loss': 33.479706, 'precision': 0.31578946, 'prediction/mean': 0.22410269, 'recall': 0.0015600624, 'global_step': 1018}\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "YDZGcdTdcYwI",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Similarly, we can define a `NumericColumn` for each continuous feature column\n",
"that we want to use in the model:"
]
},
{
"metadata": {
"id": "uqPbUqlxcYwJ",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "68f4ccfd-d71b-4327-b8e8-25c40e986bed"
},
"cell_type": "code",
"source": [
"education_num = tf.feature_column.numeric_column('education_num')\n",
"capital_gain = tf.feature_column.numeric_column('capital_gain')\n",
"capital_loss = tf.feature_column.numeric_column('capital_loss')\n",
"hours_per_week = tf.feature_column.numeric_column('hours_per_week')"
],
"execution_count": 17,
"outputs": []
},
{
"metadata": {
"id": "yqCF0a4DcYwM",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "0f9097a4-bc79-4e67-bd63-6a4d4461736d"
},
"cell_type": "code",
"source": [
"my_numeric_columns = [age,education_num, capital_gain, capital_loss, hours_per_week]"
],
"execution_count": 18,
"outputs": []
},
{
"metadata": {
"id": "xDrZtAZ0cYwO",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "6fd558ea-9f0c-4deb-cb8a-6211ec233016"
},
"cell_type": "code",
"source": [
"fc.input_layer(feature_batch, my_numeric_columns).numpy()"
],
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[3.1000e+01, 0.0000e+00, 0.0000e+00, 1.4000e+01, 4.3000e+01],\n",
" [8.8000e+01, 0.0000e+00, 0.0000e+00, 1.5000e+01, 4.0000e+01],\n",
" [3.6000e+01, 1.5024e+04, 0.0000e+00, 9.0000e+00, 4.0000e+01],\n",
" [4.6000e+01, 0.0000e+00, 0.0000e+00, 1.4000e+01, 5.5000e+01],\n",
" [2.0000e+01, 0.0000e+00, 0.0000e+00, 1.0000e+01, 1.0000e+01],\n",
" [5.1000e+01, 5.1780e+03, 0.0000e+00, 1.2000e+01, 4.5000e+01],\n",
" [3.0000e+01, 1.5024e+04, 0.0000e+00, 1.4000e+01, 6.0000e+01],\n",
" [4.0000e+01, 0.0000e+00, 0.0000e+00, 9.0000e+00, 4.0000e+01],\n",
" [3.1000e+01, 0.0000e+00, 0.0000e+00, 1.0000e+01, 1.0000e+01],\n",
" [4.9000e+01, 0.0000e+00, 0.0000e+00, 1.3000e+01, 4.0000e+01]],\n",
" dtype=float32)"
]
},
"metadata": {
"tags": []
},
"execution_count": 19
}
]
},
{
"metadata": {
"id": "cBGDN97IcYwQ",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"You could retrain a model on these features with, just by changing the `feature_columns` argument to the constructor:"
]
},
{
"metadata": {
"id": "XN8k5S95cYwR",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "72be27c1-e25c-4609-a703-8297c936177a"
},
"cell_type": "code",
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns, n_classes=2)\n",
"classifier.train(train_inpf)\n",
"\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %s' % (key, value))"
],
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"text": [
"accuracy: 0.76377374\n",
"accuracy_baseline: 0.76377374\n",
"auc: 0.539677\n",
"auc_precision_recall: 0.334656\n",
"average_loss: 1.4886041\n",
"global_step: 1018\n",
"label/mean: 0.23622628\n",
"loss: 95.04299\n",
"precision: 0.0\n",
"prediction/mean: 0.21315515\n",
"recall: 0.0\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "jBRq9_AzcYwU",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"#### Categorical columns\n",
"\n",
"To define a feature column for a categorical feature, we can create a\n",
"`CategoricalColumn` using one of the `tf.feature_column.categorical_column*` functions.\n",
"\n",
"If you know the set of all possible feature values of a column and there are only a few of them, you can use `categorical_column_with_vocabulary_list`. Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the `relationship` column we can assign the feature string `Husband` to an integer ID of 0 and \"Not-in-family\" to 1, etc., by doing:"
]
},
{
"metadata": {
"id": "0IjqSi9tcYwV",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 37
},
"outputId": "859f282d-7a9c-417b-a615-643a15d10118"
},
"cell_type": "code",
"source": [
"relationship = fc.categorical_column_with_vocabulary_list(\n",
" 'relationship', [\n",
" 'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',\n",
" 'Other-relative'])\n"
],
"execution_count": 21,
"outputs": []
},
{
"metadata": {
"id": "-RjoWv-7cYwW",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"This will create a sparse one-hot vector from the raw input feature.\n",
"\n",
"The `input_layer` function we're using for demonstration is designed for DNN models, and so expects dense inputs. To demonstrate the categorical column we must wrap it in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this dense-step).\n",
"\n",
"Note: the other sparse-to-dense option is `tf.feature_column.embedding_column`.\n",
"\n",
"Run the input layer, configured with both the `age` and `relationship` columns:"
]
},
{
"metadata": {
"id": "kI43CYlncYwY",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 224
},
"outputId": "458177e5-4bc0-48f2-b1fb-614b91dd99e6"
},
"cell_type": "code",
"source": [
"fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])"
],
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<tf.Tensor: id=4361, shape=(10, 7), dtype=float32, numpy=\n",
"array([[31., 0., 1., 0., 0., 0., 0.],\n",
" [88., 1., 0., 0., 0., 0., 0.],\n",
" [36., 1., 0., 0., 0., 0., 0.],\n",
" [46., 1., 0., 0., 0., 0., 0.],\n",
" [20., 0., 1., 0., 0., 0., 0.],\n",
" [51., 1., 0., 0., 0., 0., 0.],\n",
" [30., 1., 0., 0., 0., 0., 0.],\n",
" [40., 1., 0., 0., 0., 0., 0.],\n",
" [31., 0., 0., 1., 0., 0., 0.],\n",
" [49., 0., 1., 0., 0., 0., 0.]], dtype=float32)>"
]
},
"metadata": {
"tags": []
},
"execution_count": 22
}
]
},
{
"metadata": {
"id": "tTudP7WHcYwb",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"What if we don't know the set of possible values in advance? Not a problem. We\n",
"can use `categorical_column_with_hash_bucket` instead:"
]
},
{
"metadata": {
"id": "8pSBaliCcYwb",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 37
},
"outputId": "e9b2e611-1311-4933-af0a-489e03fdc960"
},
"cell_type": "code",
"source": [
"occupation = tf.feature_column.categorical_column_with_hash_bucket(\n",
" 'occupation', hash_bucket_size=1000)"
],
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"id": "fSAPrqQkcYwd",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"What will happen is that each possible value in the feature column `occupation`\n",
"will be hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:"
]
},
{
"metadata": {
"id": "dCvQNv36cYwe",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 207
},
"outputId": "23ebfedd-faf8-425b-a855-9897aba20341"
},
"cell_type": "code",
"source": [
"for item in feature_batch['occupation'].numpy():\n",
" print(item.decode())"
],
"execution_count": 24,
"outputs": [
{
"output_type": "stream",
"text": [
"Prof-specialty\n",
"Exec-managerial\n",
"Prof-specialty\n",
"Exec-managerial\n",
"Tech-support\n",
"Sales\n",
"Exec-managerial\n",
"Machine-op-inspct\n",
"?\n",
"Exec-managerial\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "KP5hN2rAcYwh",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`"
]
},
{
"metadata": {
"id": "0Y16peWacYwh",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
},
"outputId": "524b1af5-c492-4d0e-b736-7974ca618089"
},
"cell_type": "code",
"source": [
"occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])\n",
"\n",
"occupation_result.numpy().shape"
],
"execution_count": 25,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(10, 1000)"
]
},
"metadata": {
"tags": []
},
"execution_count": 25
}
]
},
{
"metadata": {
"id": "HMW2MzWAcYwk",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"It's easier to see the actual results if we take the tf.argmax over the `hash_bucket_size` dimension.\n",
"\n",
"In the output below, note how any duplicate occupations are mapped to the same pseudo-random index:\n",
"\n",
"Note: Hash collisions are unavoidable, but often have minimal impact on model quiality. The effeect may be noticable if the hash buckets are being used to compress the input space. See [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb) for a more visual example of the effect of these hash collisions."
]
},
{
"metadata": {
"id": "q_ryRglmcYwk",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
},
"outputId": "e1797664-1200-48e3-c774-52e7e0a18f00"
},
"cell_type": "code",
"source": [
"tf.argmax(occupation_result, axis=1).numpy()"
],
"execution_count": 26,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([979, 800, 979, 800, 413, 631, 800, 911, 65, 800])"
]
},
"metadata": {
"tags": []
},
"execution_count": 26
}
]
},
{
"metadata": {
"id": "j1e5NfyKcYwn",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"No matter which way we choose to define a `SparseColumn`, each feature string\n",
"will be mapped into an integer ID by looking up a fixed mapping or by hashing.\n",
"Under the hood, the `LinearModel` class is responsible for\n",
"managing the mapping and creating `tf.Variable` to store the model parameters\n",
"(also known as model weights) for each feature ID. The model parameters will be\n",
"learned through the model training process we'll go through later.\n",
"\n",
"We'll do the similar trick to define the other categorical features:"
]
},
{
"metadata": {
"id": "0Z5eUrd_cYwo",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 37
},
"outputId": "becd1bda-9014-4b9e-92ef-ba4ee2ed52fa"
},
"cell_type": "code",
"source": [
"education = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'education', [\n",
" 'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',\n",
" 'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',\n",
" '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])\n",
"\n",
"marital_status = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'marital_status', [\n",
" 'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',\n",
" 'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])\n",
"\n",
"workclass = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'workclass', [\n",
" 'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',\n",
" 'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])\n"
],
"execution_count": 27,
"outputs": []
},
{
"metadata": {
"id": "a03l9ozUcYwp",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 37
},
"outputId": "374c7f00-8d2e-458f-ec32-b4cbc6b7386f"
},
"cell_type": "code",
"source": [
"my_categorical_columns = [relationship, occupation, education, marital_status, workclass]"
],
"execution_count": 28,
"outputs": []
},
{
"metadata": {
"id": "ASQJM1pEcYwr",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"It's easy to use both sets of columns to configure a model that uses all these features:"
]
},
{
"metadata": {
"id": "_i_MLoo9cYws",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 224
},
"outputId": "95ab18a4-2ec1-4fad-c207-2f86b607a333"
},
"cell_type": "code",
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns, n_classes=2)\n",
"classifier.train(train_inpf)\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %s' % (key, value))"
],
"execution_count": 29,
"outputs": [
{
"output_type": "stream",
"text": [
"accuracy: 0.81978995\n",
"accuracy_baseline: 0.76377374\n",
"auc: 0.869223\n",
"auc_precision_recall: 0.6459037\n",
"average_loss: 1.9878242\n",
"global_step: 1018\n",
"label/mean: 0.23622628\n",
"loss: 126.916725\n",
"precision: 0.60679156\n",
"prediction/mean: 0.2908891\n",
"recall: 0.6736869\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "zdKEqF6xcYwv",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Derived feature columns\n",
"\n",
"#### Making Continuous Features Categorical through Bucketization\n",
"\n",
"Sometimes the relationship between a continuous feature and the label is not\n",
"linear. As a hypothetical example, a person's income may grow with age in the\n",
"early stage of one's career, then the growth may slow at some point, and finally\n",
"the income decreases after retirement. In this scenario, using the raw `age` as\n",
"a real-valued feature column might not be a good choice because the model can\n",
"only learn one of the three cases:\n",
"\n",
"1. Income always increases at some rate as age grows (positive correlation),\n",
"1. Income always decreases at some rate as age grows (negative correlation), or\n",
"1. Income stays the same no matter at what age (no correlation)\n",
"\n",
"If we want to learn the fine-grained correlation between income and each age\n",
"group separately, we can leverage **bucketization**. Bucketization is a process\n",
"of dividing the entire range of a continuous feature into a set of consecutive\n",
"bins/buckets, and then converting the original numerical feature into a bucket\n",
"ID (as a categorical feature) depending on which bucket that value falls into.\n",
"So, we can define a `bucketized_column` over `age` as:"
]
},
{
"metadata": {
"id": "KT4pjD9AcYww",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "633c1bb5-e5e2-4cf3-8392-5caf473607da"
},
"cell_type": "code",
"source": [
"age_buckets = tf.feature_column.bucketized_column(\n",
" age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])"
],
"execution_count": 30,
"outputs": []
},
{
"metadata": {
"id": "S-XOscrEcYwx",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"where the `boundaries` is a list of bucket boundaries. In this case, there are\n",
"10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,\n",
"25-29, ..., to 65 and over).\n",
"\n",
"With bucketing, the model sees each bucket a one-hot feature:"
]
},
{
"metadata": {
"id": "Lr40vm3qcYwy",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "e53a3d92-f8d4-4ff7-da5e-46f498eb2316"
},
"cell_type": "code",
"source": [
"fc.input_layer(feature_batch, [age, age_buckets]).numpy()"
],
"execution_count": 31,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[31., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],\n",
" [88., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],\n",
" [36., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],\n",
" [46., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
" [20., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
" [51., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],\n",
" [30., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],\n",
" [40., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],\n",
" [31., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],\n",
" [49., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]],\n",
" dtype=float32)"
]
},
"metadata": {
"tags": []
},
"execution_count": 31
}
]
},
{
"metadata": {
"id": "Z_tQI9j8cYw1",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"#### Learn complex relationships with crossed column\n",
"\n",
"Using each base feature column separately may not be enough to explain the data.\n",
"For example, the correlation between education and the label (earning > 50,000\n",
"dollars) may be different for different occupations. Therefore, if we only learn\n",
"a single model weight for `education=\"Bachelors\"` and `education=\"Masters\"`, we\n",
"won't be able to capture every single education-occupation combination (e.g.\n",
"distinguishing between `education=\"Bachelors\" AND occupation=\"Exec-managerial\"`\n",
"and `education=\"Bachelors\" AND occupation=\"Craft-repair\"`). To learn the\n",
"differences between different feature combinations, we can add **crossed feature\n",
"columns** to the model."
]
},
{
"metadata": {
"id": "IAPhPzXscYw1",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 37
},
"outputId": "4dd22eaf-3917-449d-9068-5306ae60b6a6"
},
"cell_type": "code",
"source": [
"education_x_occupation = tf.feature_column.crossed_column(\n",
" ['education', 'occupation'], hash_bucket_size=1000)"
],
"execution_count": 32,
"outputs": []
},
{
"metadata": {
"id": "UeTxMunbcYw5",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"We can also create a `crossed_column` over more than two columns. Each\n",
"constituent column can be either a base feature column that is categorical\n",
"(`SparseColumn`), a bucketized real-valued feature column, or even another\n",
"`CrossColumn`. Here's an example:"
]
},
{
"metadata": {
"id": "y8UaBld9cYw7",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 37
},
"outputId": "4abb43e7-c406-4caf-f15e-71af723ec8df"
},
"cell_type": "code",
"source": [
"age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(\n",
" [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)"
],
"execution_count": 33,
"outputs": []
},
{
"metadata": {
"id": "HvKmW6U5cYw8",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"These crossed columns always use hash buckets to avoid the exponential explosion in the number of categories, and put the control over number of model weights in the hands of the user.\n",
"\n",
"For a visual example the effect of hash-buckets with crossed columns see [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb)\n",
"\n"
]
},
{
"metadata": {
"id": "HtjpheB6cYw9",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Defining The Logistic Regression Model\n",
"\n",
"After processing the input data and defining all the feature columns, we're now\n",
"ready to put them all together and build a Logistic Regression model. In the\n",
"previous section we've seen several types of base and derived feature columns,\n",
"including:\n",
"\n",
"* `CategoricalColumn`\n",
"* `NumericColumn`\n",
"* `BucketizedColumn`\n",
"* `CrossedColumn`\n",
"\n",
"All of these are subclasses of the abstract `FeatureColumn` class, and can be\n",
"added to the `feature_columns` field of a model:"
]
},
{
"metadata": {
"id": "Klmf3OxpcYw-",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
},
"outputId": "a8f46b90-a9d0-4d33-fff5-38b530e35d43"
},
"cell_type": "code",
"source": [
"import tempfile\n",
"\n",
"base_columns = [\n",
" education, marital_status, relationship, workclass, occupation,\n",
" age_buckets,\n",
"]\n",
"crossed_columns = [\n",
" tf.feature_column.crossed_column(\n",
" ['education', 'occupation'], hash_bucket_size=1000),\n",
" tf.feature_column.crossed_column(\n",
" [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),\n",
"]\n",
"\n",
"model_dir = tempfile.mkdtemp()\n",
"model = tf.estimator.LinearClassifier(\n",
" model_dir=model_dir, feature_columns=base_columns + crossed_columns)"
],
"execution_count": 34,
"outputs": [
{
"output_type": "stream",
"text": [
"INFO:tensorflow:Using default config.\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"I0711 22:27:55.502184 140174775953280 tf_logging.py:115] Using default config.\n"
],
"name": "stderr"
},
{
"output_type": "stream",
"text": [
"INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp93vf5hp6', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7cc6df0ba8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"I0711 22:27:55.509107 140174775953280 tf_logging.py:115] Using config: {'_model_dir': '/tmp/tmp93vf5hp6', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7cc6df0ba8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}\n"
],
"name": "stderr"
}
]
},
{
"metadata": {
"id": "jRhnPxUucYxC",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The model also automatically learns a bias term, which controls the prediction\n",
"one would make without observing any features (see the section [How Logistic\n",
"Regression Works](#how_it_works) for more explanations). The learned model files will be stored\n",
"in `model_dir`.\n",
"\n",
"## Training and evaluating our model\n",
"\n",
"After adding all the features to the model, now let's look at how to actually\n",
"train the model. Training a model is just a single command using the\n",
"tf.estimator API:"
]
},
{
"metadata": {
"id": "ZlrIBuoecYxD",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "5aa0bc8c-9496-4301-963a-78bcef54e17a"
},
"cell_type": "code",
"source": [
"model.train(train_inpf)\n",
"clear_output()"
],
"execution_count": 35,
"outputs": []
},
{
"metadata": {
"id": "IvY3a9pzcYxH",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"After the model is trained, we can evaluate how good our model is at predicting\n",
"the labels of the holdout data:"
]
},
{
"metadata": {
"id": "L9nVJEO8cYxI",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "8eb14bd7-9030-4381-c18a-6a5c7c17c569"
},
"cell_type": "code",
"source": [
"results = model.evaluate(test_inpf)\n",
"clear_output()\n",
"for key in sorted(results):\n",
" print('%s: %0.2f' % (key, results[key]))"
],
"execution_count": 36,
"outputs": [
{
"output_type": "stream",
"text": [
"accuracy: 0.84\n",
"accuracy_baseline: 0.76\n",
"auc: 0.88\n",
"auc_precision_recall: 0.70\n",
"average_loss: 0.35\n",
"global_step: 1018.00\n",
"label/mean: 0.24\n",
"loss: 22.42\n",
"precision: 0.71\n",
"prediction/mean: 0.22\n",
"recall: 0.52\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "E0fAibNDcYxL",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The first line of the final output should be something like\n",
"`accuracy: 0.83`, which means the accuracy is 83%. Feel free to try more\n",
"features and transformations and see if you can do even better!\n",
"\n",
"After the model is evaluated, we can use the model to predict whether an individual has an annual income of over\n",
"50,000 dollars given an individual's information input.\n",
"\n",
"Let's look in more detail how the model did:"
]
},
{
"metadata": {
"id": "8R5bz5CxcYxL",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 669
},
"outputId": "71f5e775-0d24-4356-d785-3b06aa385957"
},
"cell_type": "code",
"source": [
"import numpy as np\n",
"predict_df = test_df[:20].copy()\n",
"\n",
"pred_iter = model.predict(\n",
" lambda:easy_input_function(predict_df, label_key='income_bracket',\n",
" num_epochs=1, shuffle=False, batch_size=10))\n",
"\n",
"classes = np.array(['<=50K', '>50K'])\n",
"pred_class_id = []\n",
"for pred_dict in pred_iter:\n",
" pred_class_id.append(pred_dict['class_ids'])\n",
"\n",
"predict_df['predicted_class'] = classes[np.array(pred_class_id)]\n",
"predict_df['correct'] = predict_df['predicted_class'] == predict_df['income_bracket']\n",
"\n",
"clear_output()\n",
"predict_df[['income_bracket','predicted_class', 'correct']]"
],
"execution_count": 37,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>income_bracket</th>\n",
" <th>predicted_class</th>\n",
" <th>correct</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" income_bracket predicted_class correct\n",
"0 <=50K <=50K True\n",
"1 <=50K <=50K True\n",
"2 >50K <=50K False\n",
"3 >50K <=50K False\n",
"4 <=50K <=50K True\n",
"5 <=50K <=50K True\n",
"6 <=50K <=50K True\n",
"7 >50K >50K True\n",
"8 <=50K <=50K True\n",
"9 <=50K <=50K True\n",
"10 >50K <=50K False\n",
"11 <=50K >50K False\n",
"12 <=50K <=50K True\n",
"13 <=50K <=50K True\n",
"14 >50K <=50K False\n",
"15 >50K >50K True\n",
"16 <=50K <=50K True\n",
"17 <=50K <=50K True\n",
"18 <=50K <=50K True\n",
"19 >50K >50K True"
]
},
"metadata": {
"tags": []
},
"execution_count": 37
}
]
},
{
"metadata": {
"id": "N_uCpFTicYxN",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"If you'd like to see a working end-to-end example, you can download our\n",
"[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py)\n",
"and set the `model_type` flag to `wide`.\n",
"\n",
"## Adding Regularization to Prevent Overfitting\n",
"\n",
"Regularization is a technique used to avoid **overfitting**. Overfitting happens\n",
"when your model does well on the data it is trained on, but worse on test data\n",
"that the model has not seen before, such as live traffic. Overfitting generally\n",
"occurs when a model is excessively complex, such as having too many parameters\n",
"relative to the number of observed training data. Regularization allows for you\n",
"to control your model's complexity and makes the model more generalizable to\n",
"unseen data.\n",
"\n",
"In the Linear Model library, you can add L1 and L2 regularizations to the model\n",
"as:"
]
},
{
"metadata": {
"id": "cVv2HsqocYxO",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "68504270-5bcc-4a87-dbfa-7fd94cf54dff"
},
"cell_type": "code",
"source": [
"#TODO(markdaoust): is the regularization strength here not working?\n",
"model = tf.estimator.LinearClassifier(\n",
" model_dir=model_dir, feature_columns=base_columns + crossed_columns,\n",
" optimizer=tf.train.FtrlOptimizer(\n",
" learning_rate=0.1,\n",
" l1_regularization_strength=0.1,\n",
" l2_regularization_strength=0.1))\n",
"\n",
"model.train(train_inpf)\n",
"\n",
"results = model.evaluate(test_inpf)\n",
"clear_output()\n",
"for key in sorted(results):\n",
" print('%s: %0.2f' % (key, results[key]))"
],
"execution_count": 38,
"outputs": [
{
"output_type": "stream",
"text": [
"accuracy: 0.84\n",
"accuracy_baseline: 0.76\n",
"auc: 0.89\n",
"auc_precision_recall: 0.70\n",
"average_loss: 0.35\n",
"global_step: 2036.00\n",
"label/mean: 0.24\n",
"loss: 22.28\n",
"precision: 0.70\n",
"prediction/mean: 0.24\n",
"recall: 0.55\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "5AqvPEQwcYxU",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"One important difference between L1 and L2 regularization is that L1\n",
"regularization tends to make model weights stay at zero, creating sparser\n",
"models, whereas L2 regularization also tries to make the model weights closer to\n",
"zero but not necessarily zero. Therefore, if you increase the strength of L1\n",
"regularization, you will have a smaller model size because many of the model\n",
"weights will be zero. This is often desirable when the feature space is very\n",
"large but sparse, and when there are resource constraints that prevent you from\n",
"serving a model that is too large.\n",
"\n",
"In practice, you should try various combinations of L1, L2 regularization\n",
"strengths and find the best parameters that best control overfitting and give\n",
"you a desirable model size."
]
},
{
"metadata": {
"id": "i5119iMWcYxU",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"<a id=\"how_it_works\"> </a>\n",
"## How Logistic Regression Works\n",
"\n",
"Finally, let's take a minute to talk about what the Logistic Regression model\n",
"actually looks like in case you're not already familiar with it. We'll denote\n",
"the label as \\\\(Y\\\\), and the set of observed features as a feature vector\n",
"\\\\(\\mathbf{x}=[x_1, x_2, ..., x_d]\\\\). We define \\\\(Y=1\\\\) if an individual\n",
"earned > 50,000 dollars and \\\\(Y=0\\\\) otherwise. In Logistic Regression, the\n",
"probability of the label being positive (\\\\(Y=1\\\\)) given the features\n",
"\\\\(\\mathbf{x}\\\\) is given as:\n",
"\n",
"$$ P(Y=1|\\mathbf{x}) = \\frac{1}{1+\\exp(-(\\mathbf{w}^T\\mathbf{x}+b))}$$\n",
"\n",
"where \\\\(\\mathbf{w}=[w_1, w_2, ..., w_d]\\\\) are the model weights for the\n",
"features \\\\(\\mathbf{x}=[x_1, x_2, ..., x_d]\\\\). \\\\(b\\\\) is a constant that is\n",
"often called the **bias** of the model. The equation consists of two parts—A\n",
"linear model and a logistic function:\n",
"\n",
"* **Linear Model**: First, we can see that \\\\(\\mathbf{w}^T\\mathbf{x}+b = b +\n",
" w_1x_1 + ... +w_dx_d\\\\) is a linear model where the output is a linear\n",
" function of the input features \\\\(\\mathbf{x}\\\\). The bias \\\\(b\\\\) is the\n",
" prediction one would make without observing any features. The model weight\n",
" \\\\(w_i\\\\) reflects how the feature \\\\(x_i\\\\) is correlated with the positive\n",
" label. If \\\\(x_i\\\\) is positively correlated with the positive label, the\n",
" weight \\\\(w_i\\\\) increases, and the probability \\\\(P(Y=1|\\mathbf{x})\\\\) will\n",
" be closer to 1. On the other hand, if \\\\(x_i\\\\) is negatively correlated\n",
" with the positive label, then the weight \\\\(w_i\\\\) decreases and the\n",
" probability \\\\(P(Y=1|\\mathbf{x})\\\\) will be closer to 0.\n",
"\n",
"* **Logistic Function**: Second, we can see that there's a logistic function\n",
" (also known as the sigmoid function) \\\\(S(t) = 1/(1+\\exp(-t))\\\\) being\n",
" applied to the linear model. The logistic function is used to convert the\n",
" output of the linear model \\\\(\\mathbf{w}^T\\mathbf{x}+b\\\\) from any real\n",
" number into the range of \\\\([0, 1]\\\\), which can be interpreted as a\n",
" probability.\n",
"\n",
"Model training is an optimization problem: The goal is to find a set of model\n",
"weights (i.e. model parameters) to minimize a **loss function** defined over the\n",
"training data, such as logistic loss for Logistic Regression models. The loss\n",
"function measures the discrepancy between the ground-truth label and the model's\n",
"prediction. If the prediction is very close to the ground-truth label, the loss\n",
"value will be low; if the prediction is very far from the label, then the loss\n",
"value would be high."
]
},
{
"metadata": {
"id": "hbXuPYQIcYxV",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## What Next\n",
"\n",
"For more about estimators:\n",
"\n",
"- The [TensorFlow Hub transfer-learning tutorial](https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub)\n",
"- The [Gradient-boosted-trees estimator tutorial](https://github.com/tensorflow/models/tree/master/official/boosted_trees)\n",
"- This [blog post]( https://medium.com/tensorflow/classifying-text-with-tensorflow-estimators) on processing text with `Estimators`\n",
"- How to [build a custom CNN estimator](https://www.tensorflow.org/tutorials/estimators/cnn)"
]
},
{
"metadata": {
"id": "jpdw2z5WcYxV",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 17
},
"outputId": "403d18f6-d01e-47dc-dfc7-8c95d9a8ec34"
},
"cell_type": "code",
"source": [
""
],
"execution_count": 38,
"outputs": []
}
],
"source": [
"#TODO(markdaoust): is the regularization strength here not working?\n",
"model = tf.estimator.LinearClassifier(\n",
" model_dir=model_dir, feature_columns=base_columns + crossed_columns,\n",
" optimizer=tf.train.FtrlOptimizer(\n",
" learning_rate=0.1,\n",
" l1_regularization_strength=0.1,\n",
" l2_regularization_strength=0.1))\n",
"\n",
"model.train(train_inpf)\n",
"\n",
"results = model.evaluate(test_inpf)\n",
"clear_output()\n",
"for key in sorted(results):\n",
" print('%s: %0.2f' % (key, results[key]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One important difference between L1 and L2 regularization is that L1\n",
"regularization tends to make model weights stay at zero, creating sparser\n",
"models, whereas L2 regularization also tries to make the model weights closer to\n",
"zero but not necessarily zero. Therefore, if you increase the strength of L1\n",
"regularization, you will have a smaller model size because many of the model\n",
"weights will be zero. This is often desirable when the feature space is very\n",
"large but sparse, and when there are resource constraints that prevent you from\n",
"serving a model that is too large.\n",
"\n",
"In practice, you should try various combinations of L1, L2 regularization\n",
"strengths and find the best parameters that best control overfitting and give\n",
"you a desirable model size."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"how_it_works\"> </a>\n",
"## How Logistic Regression Works\n",
"\n",
"Finally, let's take a minute to talk about what the Logistic Regression model\n",
"actually looks like in case you're not already familiar with it. We'll denote\n",
"the label as \\\\(Y\\\\), and the set of observed features as a feature vector\n",
"\\\\(\\mathbf{x}=[x_1, x_2, ..., x_d]\\\\). We define \\\\(Y=1\\\\) if an individual\n",
"earned > 50,000 dollars and \\\\(Y=0\\\\) otherwise. In Logistic Regression, the\n",
"probability of the label being positive (\\\\(Y=1\\\\)) given the features\n",
"\\\\(\\mathbf{x}\\\\) is given as:\n",
"\n",
"$$ P(Y=1|\\mathbf{x}) = \\frac{1}{1+\\exp(-(\\mathbf{w}^T\\mathbf{x}+b))}$$\n",
"\n",
"where \\\\(\\mathbf{w}=[w_1, w_2, ..., w_d]\\\\) are the model weights for the\n",
"features \\\\(\\mathbf{x}=[x_1, x_2, ..., x_d]\\\\). \\\\(b\\\\) is a constant that is\n",
"often called the **bias** of the model. The equation consists of two parts—A\n",
"linear model and a logistic function:\n",
"\n",
"* **Linear Model**: First, we can see that \\\\(\\mathbf{w}^T\\mathbf{x}+b = b +\n",
" w_1x_1 + ... +w_dx_d\\\\) is a linear model where the output is a linear\n",
" function of the input features \\\\(\\mathbf{x}\\\\). The bias \\\\(b\\\\) is the\n",
" prediction one would make without observing any features. The model weight\n",
" \\\\(w_i\\\\) reflects how the feature \\\\(x_i\\\\) is correlated with the positive\n",
" label. If \\\\(x_i\\\\) is positively correlated with the positive label, the\n",
" weight \\\\(w_i\\\\) increases, and the probability \\\\(P(Y=1|\\mathbf{x})\\\\) will\n",
" be closer to 1. On the other hand, if \\\\(x_i\\\\) is negatively correlated\n",
" with the positive label, then the weight \\\\(w_i\\\\) decreases and the\n",
" probability \\\\(P(Y=1|\\mathbf{x})\\\\) will be closer to 0.\n",
"\n",
"* **Logistic Function**: Second, we can see that there's a logistic function\n",
" (also known as the sigmoid function) \\\\(S(t) = 1/(1+\\exp(-t))\\\\) being\n",
" applied to the linear model. The logistic function is used to convert the\n",
" output of the linear model \\\\(\\mathbf{w}^T\\mathbf{x}+b\\\\) from any real\n",
" number into the range of \\\\([0, 1]\\\\), which can be interpreted as a\n",
" probability.\n",
"\n",
"Model training is an optimization problem: The goal is to find a set of model\n",
"weights (i.e. model parameters) to minimize a **loss function** defined over the\n",
"training data, such as logistic loss for Logistic Regression models. The loss\n",
"function measures the discrepancy between the ground-truth label and the model's\n",
"prediction. If the prediction is very close to the ground-truth label, the loss\n",
"value will be low; if the prediction is very far from the label, then the loss\n",
"value would be high."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What Next\n",
"\n",
"For more about estimators:\n",
"\n",
"- The [TensorFlow Hub transfer-learning tutorial](https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub)\n",
"- The [Gradient-boosted-trees estimator tutorial](https://github.com/tensorflow/models/tree/master/official/boosted_trees)\n",
"- This [blog post]( https://medium.com/tensorflow/classifying-text-with-tensorflow-estimators) on processing text with `Estimators`\n",
"- How to [build a custom CNN estimator](https://www.tensorflow.org/tutorials/estimators/cnn)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
]
}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment