Commit 2c929976 authored by Mark Daoust's avatar Mark Daoust
Browse files

Fix `wide.ipynb`

parent c8eda499
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
"source": [ "source": [
"# TensorFlow Linear Model Tutorial\n", "# TensorFlow Linear Model Tutorial\n",
"\n", "\n",
"In this tutorial, we will use the tf.estimator API in TensorFlow to solve a\n", "In this tutorial, we will use the `tf.estimator` API in TensorFlow to solve a\n",
"binary classification problem: Given census data about a person such as age,\n", "binary classification problem: Given census data about a person such as age,\n",
"education, marital status, and occupation (the features), we will try to predict\n", "education, marital status, and occupation (the features), we will try to predict\n",
"whether or not the person earns more than 50,000 dollars a year (the target\n", "whether or not the person earns more than 50,000 dollars a year (the target\n",
...@@ -19,37 +19,112 @@ ...@@ -19,37 +19,112 @@
"\n", "\n",
"To try the code for this tutorial:\n", "To try the code for this tutorial:\n",
"\n", "\n",
"1. @{$install$Install TensorFlow} if you haven't already.\n", "[Install TensorFlow](tensorlfow.org/install) if you haven't already.\n",
"\n", "\n",
"2. Download [the tutorial code](https://github.com/tensorflow/models/tree/master/official/wide_deep/).\n", "Next import the relavant packages:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "tf.enable_eager_execution must be called at program startup.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-42-04d0fb7a9ec6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensorflow\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mtf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mtensorflow\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfeature_column\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mfc\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mtf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menable_eager_execution\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/venv3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py\u001b[0m in \u001b[0;36menable_eager_execution\u001b[0;34m(config, device_policy, execution_mode)\u001b[0m\n\u001b[1;32m 5238\u001b[0m \"\"\"\n\u001b[1;32m 5239\u001b[0m return enable_eager_execution_internal(\n\u001b[0;32m-> 5240\u001b[0;31m config, device_policy, execution_mode, None)\n\u001b[0m\u001b[1;32m 5241\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5242\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/venv3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py\u001b[0m in \u001b[0;36menable_eager_execution_internal\u001b[0;34m(config, device_policy, execution_mode, server_def)\u001b[0m\n\u001b[1;32m 5306\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5307\u001b[0m raise ValueError(\n\u001b[0;32m-> 5308\u001b[0;31m \"tf.enable_eager_execution must be called at program startup.\")\n\u001b[0m\u001b[1;32m 5309\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5310\u001b[0m \u001b[0;31m# Monkey patch to get rid of an unnecessary conditional since the context is\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: tf.enable_eager_execution must be called at program startup."
]
}
],
"source": [
"import tensorflow as tf\n",
"import tensorflow.feature_column as fc \n",
"tf.enable_eager_execution()\n",
"\n", "\n",
"3. Execute the data download script we provide to you:" "import os\n",
"import sys\n",
"from IPython.display import clear_output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the [tutorial code from github](https://github.com/tensorflow/models/tree/master/official/wide_deep/),\n",
" add the root directory to your python path, and jump to the `wide_deep` directory:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 2,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fatal: destination path 'models' already exists and is not an empty directory.\r\n"
]
}
],
"source": [ "source": [
"$ python data_download.py" "if \"wide_deep\" not in os.getcwd():\n",
" ! git clone --depth 1 https://github.com/tensorflow/models\n",
" models_path = os.path.join(os.getcwd(), 'models')\n",
" sys.path.append(models_path) \n",
" os.environ['PYTHONPATH'] += os.pathsep+models_path\n",
" os.chdir(\"models/official/wide_deep\")"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"4. Execute the tutorial code with the following command to train the linear\n", "Execute the data download script:"
"model described in this tutorial:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"$ python wide_deep.py --model_type=wide" "import census_dataset\n",
"import census_main\n",
"\n",
"census_dataset.download(\"/tmp/census_data/\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the tutorial code with the following command to train the model described in this tutorial, from the command line:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['I0711 14:47:25.747490 139708077598464 tf_logging.py:115] accuracy: 0.833794']\n"
]
}
],
"source": [
"output = !python -m census_main --model_type=wide --train_epochs=2\n",
"print([line for line in output if 'accuracy:' in line])"
] ]
}, },
{ {
...@@ -60,19 +135,227 @@ ...@@ -60,19 +135,227 @@
"\n", "\n",
"## Reading The Census Data\n", "## Reading The Census Data\n",
"\n", "\n",
"The dataset we'll be using is the\n", "The dataset we're using is the\n",
"[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", "[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n",
"We have provided\n", "We have provided\n",
"[data_download.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/data_download.py)\n", "[census_dataset.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_dataset.py)\n",
"which downloads the code and performs some additional cleanup.\n", "which downloads the code and performs some additional cleanup.\n",
"\n", "\n",
"Since the task is a binary classification problem, we'll construct a label\n", "Since the task is a binary classification problem, we'll construct a label\n",
"column named \"label\" whose value is 1 if the income is over 50K, and 0\n", "column named \"label\" whose value is 1 if the income is over 50K, and 0\n",
"otherwise. For reference, see `input_fn` in\n", "otherwise. For reference, see `input_fn` in\n",
"[wide_deep.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py).\n", "[census_main.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py).\n",
"\n", "\n",
"Next, let's take a look at the dataframe and see which columns we can use to\n", "Next, let's take a look at the data and see which columns we can use to\n",
"predict the target label. The columns can be grouped into two types—categorical\n", "predict the target label. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"adult.data adult.test\r\n"
]
}
],
"source": [
"!ls /tmp/census_data/"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"train_file = \"/tmp/census_data/adult.data\"\n",
"test_file = \"/tmp/census_data/adult.test\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>workclass</th>\n",
" <th>fnlwgt</th>\n",
" <th>education</th>\n",
" <th>education_num</th>\n",
" <th>marital_status</th>\n",
" <th>occupation</th>\n",
" <th>relationship</th>\n",
" <th>race</th>\n",
" <th>gender</th>\n",
" <th>capital_gain</th>\n",
" <th>capital_loss</th>\n",
" <th>hours_per_week</th>\n",
" <th>native_country</th>\n",
" <th>income_bracket</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>39</td>\n",
" <td>State-gov</td>\n",
" <td>77516</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Never-married</td>\n",
" <td>Adm-clerical</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>2174</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>50</td>\n",
" <td>Self-emp-not-inc</td>\n",
" <td>83311</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Exec-managerial</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>38</td>\n",
" <td>Private</td>\n",
" <td>215646</td>\n",
" <td>HS-grad</td>\n",
" <td>9</td>\n",
" <td>Divorced</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>53</td>\n",
" <td>Private</td>\n",
" <td>234721</td>\n",
" <td>11th</td>\n",
" <td>7</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Husband</td>\n",
" <td>Black</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>28</td>\n",
" <td>Private</td>\n",
" <td>338409</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Prof-specialty</td>\n",
" <td>Wife</td>\n",
" <td>Black</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>Cuba</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age workclass fnlwgt education education_num \\\n",
"0 39 State-gov 77516 Bachelors 13 \n",
"1 50 Self-emp-not-inc 83311 Bachelors 13 \n",
"2 38 Private 215646 HS-grad 9 \n",
"3 53 Private 234721 11th 7 \n",
"4 28 Private 338409 Bachelors 13 \n",
"\n",
" marital_status occupation relationship race gender \\\n",
"0 Never-married Adm-clerical Not-in-family White Male \n",
"1 Married-civ-spouse Exec-managerial Husband White Male \n",
"2 Divorced Handlers-cleaners Not-in-family White Male \n",
"3 Married-civ-spouse Handlers-cleaners Husband Black Male \n",
"4 Married-civ-spouse Prof-specialty Wife Black Female \n",
"\n",
" capital_gain capital_loss hours_per_week native_country income_bracket \n",
"0 2174 0 40 United-States <=50K \n",
"1 0 0 13 United-States <=50K \n",
"2 0 0 40 United-States <=50K \n",
"3 0 0 40 United-States <=50K \n",
"4 0 0 40 Cuba <=50K "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas\n",
"train_df = pandas.read_csv(train_file, header = None, names = census_dataset._CSV_COLUMNS)\n",
"test_df = pandas.read_csv(test_file, header = None, names = census_dataset._CSV_COLUMNS)\n",
"\n",
"train_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The columns can be grouped into two types—categorical\n",
"and continuous columns:\n", "and continuous columns:\n",
"\n", "\n",
"* A column is called **categorical** if its value can only be one of the\n", "* A column is called **categorical** if its value can only be one of the\n",
...@@ -85,89 +368,125 @@ ...@@ -85,89 +368,125 @@
"\n", "\n",
"Here's a list of columns available in the Census Income dataset:\n", "Here's a list of columns available in the Census Income dataset:\n",
"\n", "\n",
"| Column Name | Type | Description |\n",
"| -------------- | ----------- | --------------------------------- |\n",
"| age | Continuous | The age of the individual |\n",
"| workclass | Categorical | The type of employer the |\n",
": : : individual has (government, :\n",
": : : military, private, etc.). :\n",
"| fnlwgt | Continuous | The number of people the census |\n",
": : : takers believe that observation :\n",
": : : represents (sample weight). Final :\n",
": : : weight will not be used. :\n",
"| education | Categorical | The highest level of education |\n",
": : : achieved for that individual. :\n",
"| education_num | Continuous | The highest level of education in |\n",
": : : numerical form. :\n",
"| marital_status | Categorical | Marital status of the individual. |\n",
"| occupation | Categorical | The occupation of the individual. |\n",
"| relationship | Categorical | Wife, Own-child, Husband, |\n",
": : : Not-in-family, Other-relative, :\n",
": : : Unmarried. :\n",
"| race | Categorical | Amer-Indian-Eskimo, Asian-Pac- |\n",
": : : Islander, Black, White, Other. :\n",
"| gender | Categorical | Female, Male. |\n",
"| capital_gain | Continuous | Capital gains recorded. |\n",
"| capital_loss | Continuous | Capital Losses recorded. |\n",
"| hours_per_week | Continuous | Hours worked per week. |\n",
"| native_country | Categorical | Country of origin of the |\n",
": : : individual. :\n",
"| income_bracket | Categorical | \">50K\" or \"<=50K\", meaning |\n",
": : : whether the person makes more :\n",
": : : than $50,000 annually. :\n",
"\n",
"## Converting Data into Tensors\n", "## Converting Data into Tensors\n",
"\n", "\n",
"When building a tf.estimator model, the input data is specified by means of an\n", "When building a tf.estimator model, the input data is specified by means of an\n",
"Input Builder function. This builder function will not be called until it is\n", "input function or `input_fn`. This builder function returns a `tf.data.Dataset`\n",
"later passed to tf.estimator.Estimator methods such as `train` and `evaluate`.\n", "of batches of `(features-dict,label)` pairs. It will not be called until it is\n",
"The purpose of this function is to construct the input data, which is\n", "later passed to `tf.estimator.Estimator` methods such as `train` and `evaluate`.\n",
"represented in the form of @{tf.Tensor}s or @{tf.SparseTensor}s.\n", "\n",
"In more detail, the input builder function returns the following as a pair:\n", "In more detail, the input builder function returns the following as a pair:\n",
"\n", "\n",
"1. `features`: A dict from feature column names to `Tensors` or\n", "1. `features`: A dict from feature names to `Tensors` or\n",
" `SparseTensors`.\n", " `SparseTensors` containing batches of features.\n",
"2. `labels`: A `Tensor` containing the label column.\n", "2. `labels`: A `Tensor` containing batches of labels.\n",
"\n",
"The keys of the `features` will be used to configure the model's input layer.\n",
"\n", "\n",
"The keys of the `features` will be used to construct columns in the next\n", "Note that the input function will be called while\n",
"section. Because we want to call the `train` and `evaluate` methods with\n",
"different data, we define a method that returns an input function based on the\n",
"given data. Note that the returned input function will be called while\n",
"constructing the TensorFlow graph, not while running the graph. What it is\n", "constructing the TensorFlow graph, not while running the graph. What it is\n",
"returning is a representation of the input data as the fundamental unit of\n", "returning is a representation of the input data as sequence of tensorflow graph\n",
"TensorFlow computations, a `Tensor` (or `SparseTensor`).\n", "operations.\n",
"\n", "\n",
"Each continuous column in the train or test data will be converted into a\n", "For small problems like this it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:"
"`Tensor`, which in general is a good format to represent dense data. For\n",
"categorical data, we must represent the data as a `SparseTensor`. This data\n",
"format is good for representing sparse data. Our `input_fn` uses the `tf.data`\n",
"API, which makes it easy to apply transformations to our dataset:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 8,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):\n",
" df = df.copy()\n",
" label = df.pop(label_key)\n",
" ds = tf.data.Dataset.from_tensor_slices((dict(df),label))\n",
"\n",
" if shuffle:\n",
" ds = ds.shuffle(10000)\n",
"\n",
" ds = ds.batch(batch_size).repeat(num_epochs)\n",
"\n",
" return ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we have eager execution enabled it is easy to inspect the resulting dataset:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Some feature keys: ['capital_gain', 'occupation', 'gender', 'capital_loss', 'workclass']\n",
"\n",
"A batch of Ages : tf.Tensor([61 18 37 47 47 32 18 23 28 37], shape=(10,), dtype=int32)\n",
"\n",
"A batch of Labels: tf.Tensor(\n",
"[b'>50K' b'<=50K' b'>50K' b'>50K' b'>50K' b'>50K' b'<=50K' b'<=50K'\n",
" b'<=50K' b'<=50K'], shape=(10,), dtype=string)\n"
]
}
],
"source": [
"ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)\n",
"\n",
"for feature_batch, label_batch in ds:\n",
" break\n",
" \n",
"print('Some feature keys:', list(feature_batch.keys())[:5])\n",
"print()\n",
"print('A batch of Ages :', feature_batch['age'])\n",
"print()\n",
"print('A batch of Labels:', label_batch )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n",
"the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n",
"\n",
"TODO(markdaoust): This `input_fn` should use `tf.contrib.data.make_csv_dataset`"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"def input_fn(data_file, num_epochs, shuffle, batch_size):\n", "def input_fn(data_file, num_epochs, shuffle, batch_size):\n",
" \"\"\"Generate an input function for the Estimator.\"\"\"\n", " \"\"\"Generate an input function for the Estimator.\"\"\"\n",
" assert tf.gfile.Exists(data_file), (\n", " assert tf.gfile.Exists(data_file), (\n",
" '%s not found. Please make sure you have either run data_download.py or '\n", " '%s not found. Please make sure you have run census_dataset.py and '\n",
" 'set both arguments --train_data and --test_data.' % data_file)\n", " 'set the --data_dir argument to the correct path.' % data_file)\n",
"\n", "\n",
" def parse_csv(value):\n", " def parse_csv(value):\n",
" print('Parsing', data_file)\n", " tf.logging.info('Parsing {}'.format(data_file))\n",
" columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)\n", " columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)\n",
" features = dict(zip(_CSV_COLUMNS, columns))\n", " features = dict(zip(_CSV_COLUMNS, columns))\n",
" labels = features.pop('income_bracket')\n", " labels = features.pop('income_bracket')\n",
" return features, tf.equal(labels, '>50K')\n", " classes = tf.equal(labels, '>50K') # binary classification\n",
" return features, classes\n",
"\n", "\n",
" # Extract lines from input files using the Dataset API.\n", " # Extract lines from input files using the Dataset API.\n",
" dataset = tf.data.TextLineDataset(data_file)\n", " dataset = tf.data.TextLineDataset(data_file)\n",
"\n", "\n",
" if shuffle:\n", " if shuffle:\n",
" dataset = dataset.shuffle(buffer_size=_SHUFFLE_BUFFER)\n", " dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])\n",
"\n", "\n",
" dataset = dataset.map(parse_csv, num_parallel_calls=5)\n", " dataset = dataset.map(parse_csv, num_parallel_calls=5)\n",
"\n", "\n",
...@@ -175,10 +494,92 @@ ...@@ -175,10 +494,92 @@
" # epochs from blending together.\n", " # epochs from blending together.\n",
" dataset = dataset.repeat(num_epochs)\n", " dataset = dataset.repeat(num_epochs)\n",
" dataset = dataset.batch(batch_size)\n", " dataset = dataset.batch(batch_size)\n",
" return dataset\n",
"\n"
]
}
],
"source": [
"import inspect\n",
"print(inspect.getsource(census_dataset.input_fn))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This input_fn gives equivalent output:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Parsing /tmp/census_data/adult.data\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Logging before flag parsing goes to stderr.\n",
"I0711 14:47:26.362334 140466218788608 tf_logging.py:115] Parsing /tmp/census_data/adult.data\n"
]
}
],
"source": [
"ds = census_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feature keys: ['capital_gain', 'occupation', 'gender', 'capital_loss', 'workclass']\n",
"\n",
"Age batch : tf.Tensor([46 38 42 37 29 48 46 40 73 49], shape=(10,), dtype=int32)\n",
"\n", "\n",
" iterator = dataset.make_one_shot_iterator()\n", "Label batch : tf.Tensor([False False False False False False False False True False], shape=(10,), dtype=bool)\n"
" features, labels = iterator.get_next()\n", ]
" return features, labels" }
],
"source": [
"for feature_batch, label_batch in ds:\n",
" break\n",
" \n",
"print('Feature keys:', list(feature_batch.keys())[:5])\n",
"print()\n",
"print('Age batch :', feature_batch['age'])\n",
"print()\n",
"print('Label batch :', label_batch )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because `Estimators` expect an `input_fn` that takes no arguments, we typically wrap configurable input function into an obejct with the expected signature. For this notebook configure the `train_inpf` to iterate over the data twice:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import functools\n",
"train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)\n",
"test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)"
] ]
}, },
{ {
...@@ -187,6 +588,11 @@ ...@@ -187,6 +588,11 @@
"source": [ "source": [
"## Selecting and Engineering Features for the Model\n", "## Selecting and Engineering Features for the Model\n",
"\n", "\n",
"Estimators use a system called `feature_columns` to describe how the model\n",
"should interpret each of the raw input features. An Estimator exepcts a vector\n",
"of numeric inputs, and feature columns describe how the model shoukld convert\n",
"each feature.\n",
"\n",
"Selecting and crafting the right set of feature columns is key to learning an\n", "Selecting and crafting the right set of feature columns is key to learning an\n",
"effective model. A **feature column** can be either one of the raw columns in\n", "effective model. A **feature column** can be either one of the raw columns in\n",
"the original dataframe (let's call them **base feature columns**), or any new\n", "the original dataframe (let's call them **base feature columns**), or any new\n",
...@@ -195,27 +601,258 @@ ...@@ -195,27 +601,258 @@
"column\" is an abstract concept of any raw or derived variable that can be used\n", "column\" is an abstract concept of any raw or derived variable that can be used\n",
"to predict the target label.\n", "to predict the target label.\n",
"\n", "\n",
"### Base Categorical Feature Columns\n", "### Base Feature Columns\n",
"\n",
"#### Numeric columns\n",
"\n",
"The simplest `feature_column` is `numeric_column`. This indicates that a feature is a numeric value that should be input to the model directly. For example:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"age = fc.numeric_column('age')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model will use the `feature_column` definitions to build the model input. You can inspect the resulting output using the `input_layer` function:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<tf.Tensor: id=237, shape=(10, 1), dtype=float32, numpy=\n",
"array([[46.],\n",
" [38.],\n",
" [42.],\n",
" [37.],\n",
" [29.],\n",
" [48.],\n",
" [46.],\n",
" [40.],\n",
" [73.],\n",
" [49.]], dtype=float32)>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, [age]).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following code will train and evaluate a model on only the `age` feature."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'precision': 0.29166666, 'auc_precision_recall': 0.31132147, 'average_loss': 0.5239897, 'label/mean': 0.23622628, 'auc': 0.6781367, 'loss': 33.4552, 'prediction/mean': 0.22513431, 'accuracy': 0.7631595, 'recall': 0.0018200728, 'global_step': 1018, 'accuracy_baseline': 0.76377374}\n"
]
}
],
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=[age], n_classes=2)\n",
"classifier.train(train_inpf)\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we can define a `NumericColumn` for each continuous feature column\n",
"that we want to use in the model:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"education_num = tf.feature_column.numeric_column('education_num')\n",
"capital_gain = tf.feature_column.numeric_column('capital_gain')\n",
"capital_loss = tf.feature_column.numeric_column('capital_loss')\n",
"hours_per_week = tf.feature_column.numeric_column('hours_per_week')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"my_numeric_columns = [age,education_num, capital_gain, capital_loss, hours_per_week]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<tf.Tensor: id=2160, shape=(10, 5), dtype=float32, numpy=\n",
"array([[4.600e+01, 0.000e+00, 0.000e+00, 6.000e+00, 4.000e+01],\n",
" [3.800e+01, 4.508e+03, 0.000e+00, 1.300e+01, 4.000e+01],\n",
" [4.200e+01, 0.000e+00, 0.000e+00, 1.400e+01, 4.000e+01],\n",
" [3.700e+01, 0.000e+00, 0.000e+00, 1.100e+01, 4.000e+01],\n",
" [2.900e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],\n",
" [4.800e+01, 0.000e+00, 0.000e+00, 1.300e+01, 5.500e+01],\n",
" [4.600e+01, 0.000e+00, 0.000e+00, 9.000e+00, 5.000e+01],\n",
" [4.000e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],\n",
" [7.300e+01, 6.418e+03, 0.000e+00, 4.000e+00, 9.900e+01],\n",
" [4.900e+01, 0.000e+00, 0.000e+00, 4.000e+00, 4.000e+01]],\n",
" dtype=float32)>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, my_numeric_columns).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You could retrain a model on these features with, just by changing the `feature_columns` argument to the constructor:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.7817087\n",
"accuracy_baseline: 0.76377374\n",
"auc: 0.8027547\n",
"auc_precision_recall: 0.5611528\n",
"average_loss: 1.0698086\n",
"global_step: 1018\n",
"label/mean: 0.23622628\n",
"loss: 68.30414\n",
"precision: 0.57025987\n",
"prediction/mean: 0.36397633\n",
"recall: 0.30811232\n"
]
}
],
"source": [
"classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns, n_classes=2)\n",
"classifier.train(train_inpf)\n",
"\n",
"result = classifier.evaluate(test_inpf)\n",
"\n",
"clear_output()\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %s' % (key, value))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Categorical columns\n",
"\n", "\n",
"To define a feature column for a categorical feature, we can create a\n", "To define a feature column for a categorical feature, we can create a\n",
"`CategoricalColumn` using the tf.feature_column API. If you know the set of all\n", "`CategoricalColumn` using one of the `tf.feature_column.categorical_column*` functions.\n",
"possible feature values of a column and there are only a few of them, you can\n", "\n",
"use `categorical_column_with_vocabulary_list`. Each key in the list will get\n", "If you know the set of all possible feature values of a column and there are only a few of them, you can use `categorical_column_with_vocabulary_list`. Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the `relationship` column we can assign the feature string `Husband` to an integer ID of 0 and \"Not-in-family\" to 1, etc., by doing:"
"assigned an auto-incremental ID starting from 0. For example, for the\n",
"`relationship` column we can assign the feature string \"Husband\" to an integer\n",
"ID of 0 and \"Not-in-family\" to 1, etc., by doing:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 21,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"relationship = tf.feature_column.categorical_column_with_vocabulary_list(\n", "relationship = fc.categorical_column_with_vocabulary_list(\n",
" 'relationship', [\n", " 'relationship', [\n",
" 'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',\n", " 'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',\n",
" 'Other-relative'])" " 'Other-relative'])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create a sparse one-hot vector from the raw input feature.\n",
"\n",
"The `input_layer` function we're using for demonstration is designed for DNN models, and so expects dense inputs. To demonstrate the categorical column we must wrap it in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this dense-step).\n",
"\n",
"Note: the other sparse-to-dense option is `tf.feature_column.embedding_column`.\n",
"\n",
"Run the input layer, configured with both the `age` and `relationship` columns:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<tf.Tensor: id=4490, shape=(10, 7), dtype=float32, numpy=\n",
"array([[46., 0., 0., 0., 0., 1., 0.],\n",
" [38., 1., 0., 0., 0., 0., 0.],\n",
" [42., 0., 1., 0., 0., 0., 0.],\n",
" [37., 1., 0., 0., 0., 0., 0.],\n",
" [29., 1., 0., 0., 0., 0., 0.],\n",
" [48., 1., 0., 0., 0., 0., 0.],\n",
" [46., 1., 0., 0., 0., 0., 0.],\n",
" [40., 1., 0., 0., 0., 0., 0.],\n",
" [73., 1., 0., 0., 0., 0., 0.],\n",
" [49., 1., 0., 0., 0., 0., 0.]], dtype=float32)>"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])"
] ]
}, },
{ {
...@@ -228,7 +865,7 @@ ...@@ -228,7 +865,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 24,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
...@@ -241,23 +878,103 @@ ...@@ -241,23 +878,103 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"What will happen is that each possible value in the feature column `occupation`\n", "What will happen is that each possible value in the feature column `occupation`\n",
"will be hashed to an integer ID as we encounter them in training. See an example\n", "will be hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:"
"illustration below:\n", ]
"\n", },
"ID | Feature\n", {
"--- | -------------\n", "cell_type": "code",
"... |\n", "execution_count": 25,
"9 | `\"Machine-op-inspct\"`\n", "metadata": {},
"... |\n", "outputs": [
"103 | `\"Farming-fishing\"`\n", {
"... |\n", "name": "stdout",
"375 | `\"Protective-serv\"`\n", "output_type": "stream",
"... |\n", "text": [
"Machine-op-inspct\n",
"Transport-moving\n",
"Prof-specialty\n",
"Adm-clerical\n",
"Handlers-cleaners\n",
"Prof-specialty\n",
"Other-service\n",
"Farming-fishing\n",
"Farming-fishing\n",
"Handlers-cleaners\n"
]
}
],
"source": [
"for item in feature_batch['occupation'].numpy():\n",
" print(item.decode())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 1000)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])\n",
"\n",
"occupation_result.numpy().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's easier to see the actual results if we take the tf.argmax over the `hash_bucket_size` dimension.\n",
"\n",
"In the output below, note how any duplicate occupations are mapped to the same pseudo-random index:\n",
"\n", "\n",
"Note: Hash collisions are unavoidable, but often have minimal impact on model quiality. The effeect may be noticable if the hash buckets are being used to compress the input space. See [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb) for a more visual example of the effect of these hash collisions."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([911, 420, 979, 96, 10, 979, 527, 936, 936, 10])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf.argmax(occupation_result, axis=1).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No matter which way we choose to define a `SparseColumn`, each feature string\n", "No matter which way we choose to define a `SparseColumn`, each feature string\n",
"will be mapped into an integer ID by looking up a fixed mapping or by hashing.\n", "will be mapped into an integer ID by looking up a fixed mapping or by hashing.\n",
"Note that hashing collisions are possible, but may not significantly impact the\n", "Under the hood, the `LinearModel` class is responsible for\n",
"model quality. Under the hood, the `LinearModel` class is responsible for\n",
"managing the mapping and creating `tf.Variable` to store the model parameters\n", "managing the mapping and creating `tf.Variable` to store the model parameters\n",
"(also known as model weights) for each feature ID. The model parameters will be\n", "(also known as model weights) for each feature ID. The model parameters will be\n",
"learned through the model training process we'll go through later.\n", "learned through the model training process we'll go through later.\n",
...@@ -267,7 +984,7 @@ ...@@ -267,7 +984,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 29,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
...@@ -282,49 +999,68 @@ ...@@ -282,49 +999,68 @@
" 'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',\n", " 'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',\n",
" 'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])\n", " 'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])\n",
"\n", "\n",
"relationship = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'relationship', [\n",
" 'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',\n",
" 'Other-relative'])\n",
"\n",
"workclass = tf.feature_column.categorical_column_with_vocabulary_list(\n", "workclass = tf.feature_column.categorical_column_with_vocabulary_list(\n",
" 'workclass', [\n", " 'workclass', [\n",
" 'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',\n", " 'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',\n",
" 'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])\n", " 'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])\n"
"\n", ]
"# To show an example of hashing:\n", },
"occupation = tf.feature_column.categorical_column_with_hash_bucket(\n", {
" 'occupation', hash_bucket_size=1000)" "cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"my_categorical_columns = [relationship, occupation, education, marital_status, workclass]"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Base Continuous Feature Columns\n", "It's easy to use both sets of columns to configure a model that uses all these features:"
"\n",
"Similarly, we can define a `NumericColumn` for each continuous feature column\n",
"that we want to use in the model:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 31,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.83342546\n",
"accuracy_baseline: 0.76377374\n",
"auc: 0.8807037\n",
"auc_precision_recall: 0.6601031\n",
"average_loss: 0.8671454\n",
"global_step: 1018\n",
"label/mean: 0.23622628\n",
"loss: 55.36468\n",
"precision: 0.6496042\n",
"prediction/mean: 0.2628341\n",
"recall: 0.6401456\n"
]
}
],
"source": [ "source": [
"age = tf.feature_column.numeric_column('age')\n", "classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns, n_classes=2)\n",
"education_num = tf.feature_column.numeric_column('education_num')\n", "classifier.train(train_inpf)\n",
"capital_gain = tf.feature_column.numeric_column('capital_gain')\n", "result = classifier.evaluate(test_inpf)\n",
"capital_loss = tf.feature_column.numeric_column('capital_loss')\n", "\n",
"hours_per_week = tf.feature_column.numeric_column('hours_per_week')" "clear_output()\n",
"for key,value in sorted(result.items()):\n",
" print('%s: %s' % (key, value))"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Making Continuous Features Categorical through Bucketization\n", "### Derived feature columns\n",
"\n",
"#### Making Continuous Features Categorical through Bucketization\n",
"\n", "\n",
"Sometimes the relationship between a continuous feature and the label is not\n", "Sometimes the relationship between a continuous feature and the label is not\n",
"linear. As a hypothetical example, a person's income may grow with age in the\n", "linear. As a hypothetical example, a person's income may grow with age in the\n",
...@@ -347,7 +1083,7 @@ ...@@ -347,7 +1083,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 32,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
...@@ -363,7 +1099,44 @@ ...@@ -363,7 +1099,44 @@
"10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,\n", "10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,\n",
"25-29, ..., to 65 and over).\n", "25-29, ..., to 65 and over).\n",
"\n", "\n",
"### Intersecting Multiple Columns with CrossedColumn\n", "With bucketing, the model sees each bucket a one-hot feature:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[46., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
" [38., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],\n",
" [42., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],\n",
" [37., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],\n",
" [29., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
" [48., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
" [46., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
" [40., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],\n",
" [73., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],\n",
" [49., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]],\n",
" dtype=float32)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fc.input_layer(feature_batch, [age, age_buckets]).numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Learn complex relationships with crossed column\n",
"\n", "\n",
"Using each base feature column separately may not be enough to explain the data.\n", "Using each base feature column separately may not be enough to explain the data.\n",
"For example, the correlation between education and the label (earning > 50,000\n", "For example, the correlation between education and the label (earning > 50,000\n",
...@@ -378,7 +1151,7 @@ ...@@ -378,7 +1151,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 34,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
...@@ -390,15 +1163,15 @@ ...@@ -390,15 +1163,15 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We can also create a `CrossedColumn` over more than two columns. Each\n", "We can also create a `crossed_column` over more than two columns. Each\n",
"constituent column can be either a base feature column that is categorical\n", "constituent column can be either a base feature column that is categorical\n",
"(`SparseColumn`), a bucketized real-valued feature column (`BucketizedColumn`),\n", "(`SparseColumn`), a bucketized real-valued feature column, or even another\n",
"or even another `CrossColumn`. Here's an example:" "`CrossColumn`. Here's an example:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 35,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
...@@ -406,6 +1179,16 @@ ...@@ -406,6 +1179,16 @@
" [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)" " [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These crossed columns always use hash buckets to avoid the exponential explosion in the number of categories, and put the control over number of model weights in the hands of the user.\n",
"\n",
"For a visual example the effect of hash-buckets with crossed columns see [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb)\n",
"\n"
]
},
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
...@@ -428,10 +1211,41 @@ ...@@ -428,10 +1211,41 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 36,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Using default config.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"I0711 14:48:54.071429 140466218788608 tf_logging.py:115] Using default config.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Using config: {'_global_id_in_cluster': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': None, '_num_worker_replicas': 1, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc03341f668>, '_evaluation_master': '', '_train_distribute': None, '_model_dir': '/tmp/tmpligbanno', '_session_config': None, '_save_checkpoints_steps': None, '_master': '', '_num_ps_replicas': 0, '_task_type': 'worker', '_log_step_count_steps': 100, '_save_summary_steps': 100, '_service': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"I0711 14:48:54.073915 140466218788608 tf_logging.py:115] Using config: {'_global_id_in_cluster': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': None, '_num_worker_replicas': 1, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc03341f668>, '_evaluation_master': '', '_train_distribute': None, '_model_dir': '/tmp/tmpligbanno', '_session_config': None, '_save_checkpoints_steps': None, '_master': '', '_num_ps_replicas': 0, '_task_type': 'worker', '_log_step_count_steps': 100, '_save_summary_steps': 100, '_service': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5}\n"
]
}
],
"source": [ "source": [
"import tempfile\n",
"\n",
"base_columns = [\n", "base_columns = [\n",
" education, marital_status, relationship, workclass, occupation,\n", " education, marital_status, relationship, workclass, occupation,\n",
" age_buckets,\n", " age_buckets,\n",
...@@ -453,11 +1267,11 @@ ...@@ -453,11 +1267,11 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"The model also automatically learns a bias term, which controls the prediction\n", "The model also automatically learns a bias term, which controls the prediction\n",
"one would make without observing any features (see the section \"How Logistic\n", "one would make without observing any features (see the section [How Logistic\n",
"Regression Works\" for more explanations). The learned model files will be stored\n", "Regression Works](#how_it_works) for more explanations). The learned model files will be stored\n",
"in `model_dir`.\n", "in `model_dir`.\n",
"\n", "\n",
"## Training and Evaluating Our Model\n", "## Training and evaluating our model\n",
"\n", "\n",
"After adding all the features to the model, now let's look at how to actually\n", "After adding all the features to the model, now let's look at how to actually\n",
"train the model. Training a model is just a single command using the\n", "train the model. Training a model is just a single command using the\n",
...@@ -466,11 +1280,12 @@ ...@@ -466,11 +1280,12 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 38,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"model.train(input_fn=lambda: input_fn(train_data, num_epochs, True, batch_size))" "model.train(train_inpf)\n",
"clear_output()"
] ]
}, },
{ {
...@@ -483,14 +1298,32 @@ ...@@ -483,14 +1298,32 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 39,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.84\n",
"accuracy_baseline: 0.76\n",
"auc: 0.88\n",
"auc_precision_recall: 0.70\n",
"average_loss: 0.35\n",
"global_step: 1018.00\n",
"label/mean: 0.24\n",
"loss: 22.37\n",
"precision: 0.69\n",
"prediction/mean: 0.24\n",
"recall: 0.57\n"
]
}
],
"source": [ "source": [
"results = model.evaluate(input_fn=lambda: input_fn(\n", "results = model.evaluate(test_inpf)\n",
" test_data, 1, False, batch_size))\n", "clear_output()\n",
"for key in sorted(results):\n", "for key in sorted(results):\n",
" print('%s: %s' % (key, results[key]))" " print('%s: %0.2f' % (key, results[key]))"
] ]
}, },
{ {
...@@ -498,32 +1331,226 @@ ...@@ -498,32 +1331,226 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"The first line of the final output should be something like\n", "The first line of the final output should be something like\n",
"`accuracy: 0.83557522`, which means the accuracy is 83.6%. Feel free to try more\n", "`accuracy: 0.83`, which means the accuracy is 83%. Feel free to try more\n",
"features and transformations and see if you can do even better!\n", "features and transformations and see if you can do even better!\n",
"\n", "\n",
"After the model is evaluated, we can use the model to predict whether an individual has an annual income of over\n", "After the model is evaluated, we can use the model to predict whether an individual has an annual income of over\n",
"50,000 dollars given an individual's information input." "50,000 dollars given an individual's information input.\n",
"\n",
"Let's look in more detail how the model did:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 40,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>income_bracket</th>\n",
" <th>predicted_class</th>\n",
" <th>correct</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>&gt;50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>&lt;=50K</td>\n",
" <td>&lt;=50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>&gt;50K</td>\n",
" <td>&gt;50K</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" income_bracket predicted_class correct\n",
"0 <=50K <=50K True\n",
"1 <=50K <=50K True\n",
"2 >50K <=50K False\n",
"3 >50K <=50K False\n",
"4 <=50K <=50K True\n",
"5 <=50K <=50K True\n",
"6 <=50K <=50K True\n",
"7 >50K >50K True\n",
"8 <=50K <=50K True\n",
"9 <=50K <=50K True\n",
"10 >50K <=50K False\n",
"11 <=50K >50K False\n",
"12 <=50K <=50K True\n",
"13 <=50K <=50K True\n",
"14 >50K <=50K False\n",
"15 >50K >50K True\n",
"16 <=50K <=50K True\n",
"17 <=50K <=50K True\n",
"18 <=50K <=50K True\n",
"19 >50K >50K True"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [ "source": [
" pred_iter = model.predict(input_fn=lambda: input_fn(FLAGS.test_data, 1, False, 1))\n", "import numpy as np\n",
" for pred in pred_iter:\n", "predict_df = test_df[:20].copy()\n",
" print(pred['classes'])" "\n",
"pred_iter = model.predict(\n",
" lambda:easy_input_function(predict_df, label_key='income_bracket',\n",
" num_epochs=1, shuffle=False, batch_size=10))\n",
"\n",
"classes = np.array(['<=50K', '>50K'])\n",
"pred_class_id = []\n",
"for pred_dict in pred_iter:\n",
" pred_class_id.append(pred_dict['class_ids'])\n",
"\n",
"predict_df['predicted_class'] = classes[np.array(pred_class_id)]\n",
"predict_df['correct'] = predict_df['predicted_class'] == predict_df['income_bracket']\n",
"\n",
"clear_output()\n",
"predict_df[['income_bracket','predicted_class', 'correct']]"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The model prediction output would be like `[b'1']` or `[b'0']` which means whether corresponding individual has an annual income of over 50,000 dollars or not.\n",
"\n",
"If you'd like to see a working end-to-end example, you can download our\n", "If you'd like to see a working end-to-end example, you can download our\n",
"[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py)\n", "[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py)\n",
"and set the `model_type` flag to `wide`.\n", "and set the `model_type` flag to `wide`.\n",
"\n", "\n",
"## Adding Regularization to Prevent Overfitting\n", "## Adding Regularization to Prevent Overfitting\n",
...@@ -542,16 +1569,42 @@ ...@@ -542,16 +1569,42 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 41,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.84\n",
"accuracy_baseline: 0.76\n",
"auc: 0.89\n",
"auc_precision_recall: 0.70\n",
"average_loss: 0.35\n",
"global_step: 2036.00\n",
"label/mean: 0.24\n",
"loss: 22.29\n",
"precision: 0.69\n",
"prediction/mean: 0.24\n",
"recall: 0.56\n"
]
}
],
"source": [ "source": [
"#TODO(markdaoust): is the regularization strength here not working?\n",
"model = tf.estimator.LinearClassifier(\n", "model = tf.estimator.LinearClassifier(\n",
" model_dir=model_dir, feature_columns=base_columns + crossed_columns,\n", " model_dir=model_dir, feature_columns=base_columns + crossed_columns,\n",
" optimizer=tf.train.FtrlOptimizer(\n", " optimizer=tf.train.FtrlOptimizer(\n",
" learning_rate=0.1,\n", " learning_rate=0.1,\n",
" l1_regularization_strength=1.0,\n", " l1_regularization_strength=0.1,\n",
" l2_regularization_strength=1.0))" " l2_regularization_strength=0.1))\n",
"\n",
"model.train(train_inpf)\n",
"\n",
"results = model.evaluate(test_inpf)\n",
"clear_output()\n",
"for key in sorted(results):\n",
" print('%s: %0.2f' % (key, results[key]))"
] ]
}, },
{ {
...@@ -569,8 +1622,14 @@ ...@@ -569,8 +1622,14 @@
"\n", "\n",
"In practice, you should try various combinations of L1, L2 regularization\n", "In practice, you should try various combinations of L1, L2 regularization\n",
"strengths and find the best parameters that best control overfitting and give\n", "strengths and find the best parameters that best control overfitting and give\n",
"you a desirable model size.\n", "you a desirable model size."
"\n", ]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"how_it_works\"> </a>\n",
"## How Logistic Regression Works\n", "## How Logistic Regression Works\n",
"\n", "\n",
"Finally, let's take a minute to talk about what the Logistic Regression model\n", "Finally, let's take a minute to talk about what the Logistic Regression model\n",
...@@ -612,18 +1671,50 @@ ...@@ -612,18 +1671,50 @@
"function measures the discrepancy between the ground-truth label and the model's\n", "function measures the discrepancy between the ground-truth label and the model's\n",
"prediction. If the prediction is very close to the ground-truth label, the loss\n", "prediction. If the prediction is very close to the ground-truth label, the loss\n",
"value will be low; if the prediction is very far from the label, then the loss\n", "value will be low; if the prediction is very far from the label, then the loss\n",
"value would be high.\n", "value would be high."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What Next\n",
"\n", "\n",
"## Learn Deeper\n", "For more about estimators:\n",
"\n", "\n",
"If you're interested in learning more, check out our\n", "- The [TensorFlow Hub transfer-learning tutorial](https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub)\n",
"@{$wide_and_deep$Wide & Deep Learning Tutorial} where we'll show you how to\n", "- The [Gradient-boosted-trees estimator tutorial](https://github.com/tensorflow/models/tree/master/official/boosted_trees)\n",
"combine the strengths of linear models and deep neural networks by jointly\n", "- This [blog post]( https://medium.com/tensorflow/classifying-text-with-tensorflow-estimators) on processing text with `Estimators`\n",
"training them using the tf.estimator API." "- How to [build a custom CNN estimator](https://www.tensorflow.org/tutorials/estimators/cnn)"
] ]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
} }
], ],
"metadata": {}, "metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 2 "nbformat_minor": 2
} }
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment