add wide.md from tensorflow/tensorflow

from https://github.com/tensorflow/tensorflow/blob/8a6ef2cb4f98bacc1f821f60c21914b4bd5faaef/tensorflow/docs_src/tutorials/representation/wide.md

add wide.md from tensorflow/tensorflow
from https://github.com/tensorflow/tensorflow/blob/8a6ef2cb4f98bacc1f821f60c21914b4bd5faaef/tensorflow/docs_src/tutorials/representation/wide.md
b0cfb2a6 · Mark Daoust · 31587277 · b0cfb2a6
Commit b0cfb2a6 authored Jul 11, 2018 by Mark Daoust
Hide whitespace changes
Inline Side-by-side

Showing with 461 additions and 0 deletions

samples/core/tutorials/estimators/wide.md samples/core/tutorials/estimators/wide.md +461 -0

No files found.
--- a/samples/core/tutorials/estimators/wide.md
+++ b/samples/core/tutorials/estimators/wide.md
+# TensorFlow Linear Model Tutorial
+In this tutorial, we will use the tf.estimator API in TensorFlow to solve a
+binary classification problem: Given census data about a person such as age,
+education, marital status, and occupation (the features), we will try to predict
+whether or not the person earns more than 50,000 dollars a year (the target
+label). We will train a **logistic regression** model, and given an individual's
+information our model will output a number between 0 and 1, which can be
+interpreted as the probability that the individual has an annual income of over
+50,000 dollars.
+## Setup
+To try the code for this tutorial:
+1.  @{$install$Install TensorFlow} if you haven't already.
+2.  Download [the tutorial code](https://github.com/tensorflow/models/tree/master/official/wide_deep/).
+3. Execute the data download script we provide to you:
+        $ python data_download.py
+4. Execute the tutorial code with the following command to train the linear
+model described in this tutorial:
+        $ python wide_deep.py --model_type=wide
+Read on to find out how this code builds its linear model.
+## Reading The Census Data
+The dataset we'll be using is the
+[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).
+We have provided
+[data_download.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/data_download.py)
+which downloads the code and performs some additional cleanup.
+Since the task is a binary classification problem, we'll construct a label
+column named "label" whose value is 1 if the income is over 50K, and 0
+otherwise. For reference, see `input_fn` in
+[wide_deep.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py).
+Next, let's take a look at the dataframe and see which columns we can use to
+predict the target label. The columns can be grouped into two types—categorical
+and continuous columns:
+*   A column is called **categorical** if its value can only be one of the
+    categories in a finite set. For example, the relationship status of a person
+    (wife, husband, unmarried, etc.) or the education level (high school,
+    college, etc.) are categorical columns.
+*   A column is called **continuous** if its value can be any numerical value in
+    a continuous range. For example, the capital gain of a person (e.g. $14,084)
+    is a continuous column.
+Here's a list of columns available in the Census Income dataset:
+| Column Name    | Type        | Description                       |
+| -------------- | ----------- | --------------------------------- |
+| age            | Continuous  | The age of the individual         |
+| workclass      | Categorical | The type of employer the          |
+:                :             : individual has (government,       :
+:                :             : military, private, etc.).         :
+| fnlwgt         | Continuous  | The number of people the census   |
+:                :             : takers believe that observation   :
+:                :             : represents (sample weight). Final :
+:                :             : weight will not be used.          :
+| education      | Categorical | The highest level of education    |
+:                :             : achieved for that individual.     :
+| education_num  | Continuous  | The highest level of education in |
+:                :             : numerical form.                   :
+| marital_status | Categorical | Marital status of the individual. |
+| occupation     | Categorical | The occupation of the individual. |
+| relationship   | Categorical | Wife, Own-child, Husband,         |
+:                :             : Not-in-family, Other-relative,    :
+:                :             : Unmarried.                        :
+| race           | Categorical | Amer-Indian-Eskimo, Asian-Pac-    |
+:                :             : Islander, Black, White, Other.    :
+| gender         | Categorical | Female, Male.                     |
+| capital_gain   | Continuous  | Capital gains recorded.           |
+| capital_loss   | Continuous  | Capital Losses recorded.          |
+| hours_per_week | Continuous  | Hours worked per week.            |
+| native_country | Categorical | Country of origin of the          |
+:                :             : individual.                       :
+| income_bracket | Categorical | ">50K" or "<=50K", meaning        |
+:                :             : whether the person makes more     :
+:                :             : than $50,000 annually.            :
+## Converting Data into Tensors
+When building a tf.estimator model, the input data is specified by means of an
+Input Builder function. This builder function will not be called until it is
+later passed to tf.estimator.Estimator methods such as `train` and `evaluate`.
+The purpose of this function is to construct the input data, which is
+represented in the form of @{tf.Tensor}s or @{tf.SparseTensor}s.
+In more detail, the input builder function returns the following as a pair:
+1.  `features`: A dict from feature column names to `Tensors` or
+    `SparseTensors`.
+2.  `labels`: A `Tensor` containing the label column.
+The keys of the `features` will be used to construct columns in the next
+section. Because we want to call the `train` and `evaluate` methods with
+different data, we define a method that returns an input function based on the
+given data. Note that the returned input function will be called while
+constructing the TensorFlow graph, not while running the graph. What it is
+returning is a representation of the input data as the fundamental unit of
+TensorFlow computations, a `Tensor` (or `SparseTensor`).
+Each continuous column in the train or test data will be converted into a
+`Tensor`, which in general is a good format to represent dense data. For
+categorical data, we must represent the data as a `SparseTensor`. This data
+format is good for representing sparse data. Our `input_fn` uses the `tf.data`
+API, which makes it easy to apply transformations to our dataset:
+```python
+def input_fn(data_file, num_epochs, shuffle, batch_size):
+  """Generate an input function for the Estimator."""
+  assert tf.gfile.Exists(data_file), (
+      '%s not found. Please make sure you have either run data_download.py or '
+      'set both arguments --train_data and --test_data.' % data_file)
+  def parse_csv(value):
+    print('Parsing', data_file)
+    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
+    features = dict(zip(_CSV_COLUMNS, columns))
+    labels = features.pop('income_bracket')
+    return features, tf.equal(labels, '>50K')
+  # Extract lines from input files using the Dataset API.
+  dataset = tf.data.TextLineDataset(data_file)
+  if shuffle:
+    dataset = dataset.shuffle(buffer_size=_SHUFFLE_BUFFER)
+  dataset = dataset.map(parse_csv, num_parallel_calls=5)
+  # We call repeat after shuffling, rather than before, to prevent separate
+  # epochs from blending together.
+  dataset = dataset.repeat(num_epochs)
+  dataset = dataset.batch(batch_size)
+  iterator = dataset.make_one_shot_iterator()
+  features, labels = iterator.get_next()
+  return features, labels
+```
+## Selecting and Engineering Features for the Model
+Selecting and crafting the right set of feature columns is key to learning an
+effective model. A **feature column** can be either one of the raw columns in
+the original dataframe (let's call them **base feature columns**), or any new
+columns created based on some transformations defined over one or multiple base
+columns (let's call them **derived feature columns**). Basically, "feature
+column" is an abstract concept of any raw or derived variable that can be used
+to predict the target label.
+### Base Categorical Feature Columns
+To define a feature column for a categorical feature, we can create a
+`CategoricalColumn` using the tf.feature_column API. If you know the set of all
+possible feature values of a column and there are only a few of them, you can
+use `categorical_column_with_vocabulary_list`. Each key in the list will get
+assigned an auto-incremental ID starting from 0. For example, for the
+`relationship` column we can assign the feature string "Husband" to an integer
+ID of 0 and "Not-in-family" to 1, etc., by doing:
+```python
+relationship = tf.feature_column.categorical_column_with_vocabulary_list(
+    'relationship', [
+        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
+        'Other-relative'])
+```
+What if we don't know the set of possible values in advance? Not a problem. We
+can use `categorical_column_with_hash_bucket` instead:
+```python
+occupation = tf.feature_column.categorical_column_with_hash_bucket(
+    'occupation', hash_bucket_size=1000)
+```
+What will happen is that each possible value in the feature column `occupation`
+will be hashed to an integer ID as we encounter them in training. See an example
+illustration below:
+ID  | Feature
+--- | -------------
+... |
+9   | `"Machine-op-inspct"`
+... |
+103 | `"Farming-fishing"`
+... |
+375 | `"Protective-serv"`
+... |
+No matter which way we choose to define a `SparseColumn`, each feature string
+will be mapped into an integer ID by looking up a fixed mapping or by hashing.
+Note that hashing collisions are possible, but may not significantly impact the
+model quality. Under the hood, the `LinearModel` class is responsible for
+managing the mapping and creating `tf.Variable` to store the model parameters
+(also known as model weights) for each feature ID. The model parameters will be
+learned through the model training process we'll go through later.
+We'll do the similar trick to define the other categorical features:
+```python
+education = tf.feature_column.categorical_column_with_vocabulary_list(
+    'education', [
+        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
+        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
+        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])
+marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
+    'marital_status', [
+        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
+        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])
+relationship = tf.feature_column.categorical_column_with_vocabulary_list(
+    'relationship', [
+        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
+        'Other-relative'])
+workclass = tf.feature_column.categorical_column_with_vocabulary_list(
+    'workclass', [
+        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
+        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])
+# To show an example of hashing:
+occupation = tf.feature_column.categorical_column_with_hash_bucket(
+    'occupation', hash_bucket_size=1000)
+```
+### Base Continuous Feature Columns
+Similarly, we can define a `NumericColumn` for each continuous feature column
+that we want to use in the model:
+```python
+age = tf.feature_column.numeric_column('age')
+education_num = tf.feature_column.numeric_column('education_num')
+capital_gain = tf.feature_column.numeric_column('capital_gain')
+capital_loss = tf.feature_column.numeric_column('capital_loss')
+hours_per_week = tf.feature_column.numeric_column('hours_per_week')
+```
+### Making Continuous Features Categorical through Bucketization
+Sometimes the relationship between a continuous feature and the label is not
+linear. As a hypothetical example, a person's income may grow with age in the
+early stage of one's career, then the growth may slow at some point, and finally
+the income decreases after retirement. In this scenario, using the raw `age` as
+a real-valued feature column might not be a good choice because the model can
+only learn one of the three cases:
+1.  Income always increases at some rate as age grows (positive correlation),
+1.  Income always decreases at some rate as age grows (negative correlation), or
+1.  Income stays the same no matter at what age (no correlation)
+If we want to learn the fine-grained correlation between income and each age
+group separately, we can leverage **bucketization**. Bucketization is a process
+of dividing the entire range of a continuous feature into a set of consecutive
+bins/buckets, and then converting the original numerical feature into a bucket
+ID (as a categorical feature) depending on which bucket that value falls into.
+So, we can define a `bucketized_column` over `age` as:
+```python
+age_buckets = tf.feature_column.bucketized_column(
+    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
+```
+where the `boundaries` is a list of bucket boundaries. In this case, there are
+10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,
+25-29, ..., to 65 and over).
+### Intersecting Multiple Columns with CrossedColumn
+Using each base feature column separately may not be enough to explain the data.
+For example, the correlation between education and the label (earning > 50,000
+dollars) may be different for different occupations. Therefore, if we only learn
+a single model weight for `education="Bachelors"` and `education="Masters"`, we
+won't be able to capture every single education-occupation combination (e.g.
+distinguishing between `education="Bachelors" AND occupation="Exec-managerial"`
+and `education="Bachelors" AND occupation="Craft-repair"`). To learn the
+differences between different feature combinations, we can add **crossed feature
+columns** to the model.
+```python
+education_x_occupation = tf.feature_column.crossed_column(
+    ['education', 'occupation'], hash_bucket_size=1000)
+```
+We can also create a `CrossedColumn` over more than two columns. Each
+constituent column can be either a base feature column that is categorical
+(`SparseColumn`), a bucketized real-valued feature column (`BucketizedColumn`),
+or even another `CrossColumn`. Here's an example:
+```python
+age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
+    [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)
+```
+## Defining The Logistic Regression Model
+After processing the input data and defining all the feature columns, we're now
+ready to put them all together and build a Logistic Regression model. In the
+previous section we've seen several types of base and derived feature columns,
+including:
+*   `CategoricalColumn`
+*   `NumericColumn`
+*   `BucketizedColumn`
+*   `CrossedColumn`
+All of these are subclasses of the abstract `FeatureColumn` class, and can be
+added to the `feature_columns` field of a model:
+```python
+base_columns = [
+    education, marital_status, relationship, workclass, occupation,
+    age_buckets,
+]
+crossed_columns = [
+    tf.feature_column.crossed_column(
+        ['education', 'occupation'], hash_bucket_size=1000),
+    tf.feature_column.crossed_column(
+        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
+]
+model_dir = tempfile.mkdtemp()
+model = tf.estimator.LinearClassifier(
+    model_dir=model_dir, feature_columns=base_columns + crossed_columns)
+```
+The model also automatically learns a bias term, which controls the prediction
+one would make without observing any features (see the section "How Logistic
+Regression Works" for more explanations). The learned model files will be stored
+in `model_dir`.
+## Training and Evaluating Our Model
+After adding all the features to the model, now let's look at how to actually
+train the model. Training a model is just a single command using the
+tf.estimator API:
+```python
+model.train(input_fn=lambda: input_fn(train_data, num_epochs, True, batch_size))
+```
+After the model is trained, we can evaluate how good our model is at predicting
+the labels of the holdout data:
+```python
+results = model.evaluate(input_fn=lambda: input_fn(
+    test_data, 1, False, batch_size))
+for key in sorted(results):
+  print('%s: %s' % (key, results[key]))
+```
+The first line of the final output should be something like
+`accuracy: 0.83557522`, which means the accuracy is 83.6%. Feel free to try more
+features and transformations and see if you can do even better!
+After the model is evaluated, we can use the model to predict whether an individual has an annual income of over
+50,000 dollars given an individual's information input.
+```python
+  pred_iter = model.predict(input_fn=lambda: input_fn(FLAGS.test_data, 1, False, 1))
+  for pred in pred_iter:
+    print(pred['classes'])
+```
+The model prediction output would be like `[b'1']` or `[b'0']` which means whether corresponding individual has an annual income of over 50,000 dollars or not.
+If you'd like to see a working end-to-end example, you can download our
+[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py)
+and set the `model_type` flag to `wide`.
+## Adding Regularization to Prevent Overfitting
+Regularization is a technique used to avoid **overfitting**. Overfitting happens
+when your model does well on the data it is trained on, but worse on test data
+that the model has not seen before, such as live traffic. Overfitting generally
+occurs when a model is excessively complex, such as having too many parameters
+relative to the number of observed training data. Regularization allows for you
+to control your model's complexity and makes the model more generalizable to
+unseen data.
+In the Linear Model library, you can add L1 and L2 regularizations to the model
+as:
+```
+model = tf.estimator.LinearClassifier(
+    model_dir=model_dir, feature_columns=base_columns + crossed_columns,
+    optimizer=tf.train.FtrlOptimizer(
+        learning_rate=0.1,
+        l1_regularization_strength=1.0,
+        l2_regularization_strength=1.0))
+```
+One important difference between L1 and L2 regularization is that L1
+regularization tends to make model weights stay at zero, creating sparser
+models, whereas L2 regularization also tries to make the model weights closer to
+zero but not necessarily zero. Therefore, if you increase the strength of L1
+regularization, you will have a smaller model size because many of the model
+weights will be zero. This is often desirable when the feature space is very
+large but sparse, and when there are resource constraints that prevent you from
+serving a model that is too large.
+In practice, you should try various combinations of L1, L2 regularization
+strengths and find the best parameters that best control overfitting and give
+you a desirable model size.
+## How Logistic Regression Works
+Finally, let's take a minute to talk about what the Logistic Regression model
+actually looks like in case you're not already familiar with it. We'll denote
+the label as \\(Y\\), and the set of observed features as a feature vector
+\\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). We define \\(Y=1\\) if an individual
+earned > 50,000 dollars and \\(Y=0\\) otherwise. In Logistic Regression, the
+probability of the label being positive (\\(Y=1\\)) given the features
+\\(\mathbf{x}\\) is given as:
+$$ P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$$
+where \\(\mathbf{w}=[w_1, w_2, ..., w_d]\\) are the model weights for the
+features \\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). \\(b\\) is a constant that is
+often called the **bias** of the model. The equation consists of two parts—A
+linear model and a logistic function:
+*   **Linear Model**: First, we can see that \\(\mathbf{w}^T\mathbf{x}+b = b +
+    w_1x_1 + ... +w_dx_d\\) is a linear model where the output is a linear
+    function of the input features \\(\mathbf{x}\\). The bias \\(b\\) is the
+    prediction one would make without observing any features. The model weight
+    \\(w_i\\) reflects how the feature \\(x_i\\) is correlated with the positive
+    label. If \\(x_i\\) is positively correlated with the positive label, the
+    weight \\(w_i\\) increases, and the probability \\(P(Y=1|\mathbf{x})\\) will
+    be closer to 1. On the other hand, if \\(x_i\\) is negatively correlated
+    with the positive label, then the weight \\(w_i\\) decreases and the
+    probability \\(P(Y=1|\mathbf{x})\\) will be closer to 0.
+*   **Logistic Function**: Second, we can see that there's a logistic function
+    (also known as the sigmoid function) \\(S(t) = 1/(1+\exp(-t))\\) being
+    applied to the linear model. The logistic function is used to convert the
+    output of the linear model \\(\mathbf{w}^T\mathbf{x}+b\\) from any real
+    number into the range of \\([0, 1]\\), which can be interpreted as a
+    probability.
+Model training is an optimization problem: The goal is to find a set of model
+weights (i.e. model parameters) to minimize a **loss function** defined over the
+training data, such as logistic loss for Logistic Regression models. The loss
+function measures the discrepancy between the ground-truth label and the model's
+prediction. If the prediction is very close to the ground-truth label, the loss
+value will be low; if the prediction is very far from the label, then the loss
+value would be high.
+## Learn Deeper
+If you're interested in learning more, check out our
+@{$wide_and_deep$Wide & Deep Learning Tutorial} where we'll show you how to
+combine the strengths of linear models and deep neural networks by jointly
+training them using the tf.estimator API.