"git@developer.sourcefind.cn:change/sglang.git" did not exist on "54b9a2de0a457709607d6df917d5e6ac5004f72b"
Commit ca0b7cae authored by Billy Lamberta's avatar Billy Lamberta
Browse files

Fix notebook text

parent 78bd4c9a
...@@ -178,7 +178,7 @@ ...@@ -178,7 +178,7 @@
"source": [ "source": [
"## The Iris classification problem\n", "## The Iris classification problem\n",
"\n", "\n",
"Imagine you are a botanist seeking an automated way to categorize each Iris flower you find. Machine learning provides many algorithms to statistically classify flowers. For instance, a sophisticated machine learning program could classify flowers based on photographs. Our ambitions are more modest—we're going to classify Iris flowers based on the length and width measurements of their [sepals](https://en.wikipedia.org/wiki/Sepal) and [petals](https://en.wikipedia.org/wiki/Petal).\n", "Imagine you are a botanist seeking an automated way to categorize each Iris flower you find. Machine learning provides many algorithms to classify flowers statistically. For instance, a sophisticated machine learning program could classify flowers based on photographs. Our ambitions are more modest—we're going to classify Iris flowers based on the length and width measurements of their [sepals](https://en.wikipedia.org/wiki/Sepal) and [petals](https://en.wikipedia.org/wiki/Petal).\n",
"\n", "\n",
"The Iris genus entails about 300 species, but our program will only classify the following three:\n", "The Iris genus entails about 300 species, but our program will only classify the following three:\n",
"\n", "\n",
...@@ -208,7 +208,7 @@ ...@@ -208,7 +208,7 @@
"source": [ "source": [
"## Import and parse the training dataset\n", "## Import and parse the training dataset\n",
"\n", "\n",
"Download the dataset file and convert it to a structure that can be used by this Python program.\n", "Download the dataset file and convert it into a structure that can be used by this Python program.\n",
"\n", "\n",
"### Download the dataset\n", "### Download the dataset\n",
"\n", "\n",
...@@ -555,7 +555,7 @@ ...@@ -555,7 +555,7 @@
"\n", "\n",
"### Why model?\n", "### Why model?\n",
"\n", "\n",
"A *[model](https://developers.google.com/machine-learning/crash-course/glossary#model)* is the relationship between features and the label. For the Iris classification problem, the model defines the relationship between the sepal and petal measurements and the predicted Iris species. Some simple models can be described with a few lines of algebra, but complex machine learning models have a large number of parameters that are difficult to summarize.\n", "A *[model](https://developers.google.com/machine-learning/crash-course/glossary#model)* is a relationship between features and the label. For the Iris classification problem, the model defines the relationship between the sepal and petal measurements and the predicted Iris species. Some simple models can be described with a few lines of algebra, but complex machine learning models have a large number of parameters that are difficult to summarize.\n",
"\n", "\n",
"Could you determine the relationship between the four features and the Iris species *without* using machine learning? That is, could you use traditional programming techniques (for example, a lot of conditional statements) to create a model? Perhaps—if you analyzed the dataset long enough to determine the relationships between petal and sepal measurements to a particular species. And this becomes difficult—maybe impossible—on more complicated datasets. A good machine learning approach *determines the model for you*. If you feed enough representative examples into the right machine learning model type, the program will figure out the relationships for you.\n", "Could you determine the relationship between the four features and the Iris species *without* using machine learning? That is, could you use traditional programming techniques (for example, a lot of conditional statements) to create a model? Perhaps—if you analyzed the dataset long enough to determine the relationships between petal and sepal measurements to a particular species. And this becomes difficult—maybe impossible—on more complicated datasets. A good machine learning approach *determines the model for you*. If you feed enough representative examples into the right machine learning model type, the program will figure out the relationships for you.\n",
"\n", "\n",
...@@ -905,7 +905,7 @@ ...@@ -905,7 +905,7 @@
"5. Keep track of some stats for visualization.\n", "5. Keep track of some stats for visualization.\n",
"6. Repeat for each epoch.\n", "6. Repeat for each epoch.\n",
"\n", "\n",
"The `num_epochs` variable is the amount of times to loop over the dataset collection. Counter-intuitively, training a model longer does not guarantee a better model. `num_epochs` is a *[hyperparameter](https://developers.google.com/machine-learning/glossary/#hyperparameter)* that you can tune. Choosing the right number usually requires both experience and experimentation." "The `num_epochs` variable is the number of times to loop over the dataset collection. Counter-intuitively, training a model longer does not guarantee a better model. `num_epochs` is a *[hyperparameter](https://developers.google.com/machine-learning/glossary/#hyperparameter)* that you can tune. Choosing the right number usually requires both experience and experimentation."
] ]
}, },
{ {
......
...@@ -133,7 +133,7 @@ ...@@ -133,7 +133,7 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"In this guide, we will train a neural network model to classify images of clothing, like sneakers and shirts. It's fine if you don't understand all the details, this is a fast-paced overview of a complete TensorFlow program with the details explained as we go.\n", "This guide trains a neural network model to classify images of clothing, like sneakers and shirts. It's okay if you don't understand all the details, this is a fast-paced overview of a complete TensorFlow program with the details explained as we go.\n",
"\n", "\n",
"This guide uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow." "This guide uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow."
] ]
...@@ -195,9 +195,9 @@ ...@@ -195,9 +195,9 @@
"\n", "\n",
"Fashion MNIST is intended as a drop-in replacement for the classic [MNIST](http://yann.lecun.com/exdb/mnist/) dataset—often used as the \"Hello, World\" of machine learning programs for computer vision. The MNIST dataset contains images of handwritten digits (0, 1, 2, etc) in an identical format to the articles of clothing we'll use here.\n", "Fashion MNIST is intended as a drop-in replacement for the classic [MNIST](http://yann.lecun.com/exdb/mnist/) dataset—often used as the \"Hello, World\" of machine learning programs for computer vision. The MNIST dataset contains images of handwritten digits (0, 1, 2, etc) in an identical format to the articles of clothing we'll use here.\n",
"\n", "\n",
"This guide uses Fashion MNIST for variety, and because it's a slightly more challenging problem than regular MNIST. Both datasets are relatively small and are useful to verify that an algorithm works as expected. They're good starting points to test and debug code. \n", "This guide uses Fashion MNIST for variety, and because it's a slightly more challenging problem than regular MNIST. Both datasets are relatively small and are used to verify that an algorithm works as expected. They're good starting points to test and debug code. \n",
"\n", "\n",
"We will use 60,000 images to train the network and 10,000 images to evaluate how accurately the network learned to classify images. You can access the Fashon MNIST directly from TensorFlow, just import and load the data:" "We will use 60,000 images to train the network and 10,000 images to evaluate how accurately the network learned to classify images. You can access the Fashion MNIST directly from TensorFlow, just import and load the data:"
] ]
}, },
{ {
...@@ -229,10 +229,10 @@ ...@@ -229,10 +229,10 @@
"source": [ "source": [
"Loading the dataset returns four NumPy arrays:\n", "Loading the dataset returns four NumPy arrays:\n",
"\n", "\n",
"* The `train_images` and `train_labels` arrays are the *training set*, this is the data the model uses to learn.\n", "* The `train_images` and `train_labels` arrays are the *training set*the data the model uses to learn.\n",
"* The model is tested against the *test set*, the `test_images` and `test_labels` arrays.\n", "* The model is tested against the *test set*, the `test_images`, and `test_labels` arrays.\n",
"\n", "\n",
"The images are 28x28 numpy arrays, with pixel values ranging between 0 and 255. The *labels* are an array of integers, ranging from 0 to 9. These correspond to the *class* of clothing the image represents:\n", "The images are 28x28 NumPy arrays, with pixel values ranging between 0 and 255. The *labels* are an array of integers, ranging from 0 to 9. These correspond to the *class* of clothing the image represents:\n",
"\n", "\n",
"<table>\n", "<table>\n",
" <tr>\n", " <tr>\n",
...@@ -485,7 +485,7 @@ ...@@ -485,7 +485,7 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"We will scale these values to a range of 0 to 1 before feeding to the neural network model. For this, cast the datatype of the image components from and integer to a float, and divide by 255. Here's the function to preprocess the images:" "We scale these values to a range of 0 to 1 before feeding to the neural network model. For this, cast the datatype of the image components from an integer to a float, and divide by 255. Here's the function to preprocess the images:"
] ]
}, },
{ {
...@@ -611,9 +611,9 @@ ...@@ -611,9 +611,9 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"The first layer in this network, `tf.keras.layers.Flatten`, transforms the format of the images from a 2d-array (of 28 by 28 pixels), to a 1d-array of 28 * 28 = 784 pixels. Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn, it only reformats the data.\n", "The first layer in this network, `tf.keras.layers.Flatten`, transforms the format of the images from a 2d-array (of 28 by 28 pixels), to a 1d-array of 28 * 28 = 784 pixels. Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.\n",
"\n", "\n",
"After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely-connected, or fully-connected, neural layers. The first `Dense` layer has 128 nodes, or neurons. The second (and last) layer is a 10-node *softmax* layer—this returns an array of 10 probability scores that sum to 1. Each node contains a score that indicates the probability that the current image belongs to one of the 10 digit classes.\n", "After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely-connected, or fully-connected, neural layers. The first `Dense` layer has 128 nodes (or neurons). The second (and last) layer is a 10-node *softmax* layer—this returns an array of 10 probability scores that sum to 1. Each node contains a score that indicates the probability that the current image belongs to one of the 10 digit classes.\n",
"\n", "\n",
"### Compile the model\n", "### Compile the model\n",
"\n", "\n",
...@@ -657,7 +657,7 @@ ...@@ -657,7 +657,7 @@
"\n", "\n",
"1. Feed the training data to the model—in this example, the `train_images` and `train_labels` arrays.\n", "1. Feed the training data to the model—in this example, the `train_images` and `train_labels` arrays.\n",
"2. The model learns to associate images and labels.\n", "2. The model learns to associate images and labels.\n",
"3. We ask the model to make predictions about a test set—in this example, the `test_images` array. We verify that the predictions match the labels from the `test_labels` array.. \n", "3. We ask the model to make predictions about a test set—in this example, the `test_images` array. We verify that the predictions match the labels from the `test_labels` array. \n",
"\n", "\n",
"To start training, call the `model.fit` method—the model is \"fit\" to the training data:" "To start training, call the `model.fit` method—the model is \"fit\" to the training data:"
] ]
...@@ -797,7 +797,7 @@ ...@@ -797,7 +797,7 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"A prediction is an array of 10 numbers. These describe the \"confidence\" of the model that the image corresponds to each of the 10 different articles of clothing. We can see see which label has the highest confidence value:" "A prediction is an array of 10 numbers. These describe the \"confidence\" of the model that the image corresponds to each of the 10 different articles of clothing. We can see which label has the highest confidence value:"
] ]
}, },
{ {
......
...@@ -241,20 +241,20 @@ ...@@ -241,20 +241,20 @@
"The dataset contains 13 different features:\n", "The dataset contains 13 different features:\n",
"\n", "\n",
"1. Per capita crime rate.\n", "1. Per capita crime rate.\n",
"2. Proportion of residential land zoned for lots over 25,000 square feet.\n", "2. The proportion of residential land zoned for lots over 25,000 square feet.\n",
"3. Proportion of non-retail business acres per town.\n", "3. The proportion of non-retail business acres per town.\n",
"4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n", "4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n",
"5. Nitric oxides concentration (parts per 10 million).\n", "5. Nitric oxides concentration (parts per 10 million).\n",
"6. Average number of rooms per dwelling.\n", "6. The average number of rooms per dwelling.\n",
"7. Proportion of owner-occupied units built prior to 1940.\n", "7. The proportion of owner-occupied units built before 1940.\n",
"8. Weighted distances to five Boston employment centres.\n", "8. Weighted distances to five Boston employment centers.\n",
"9. Index of accessibility to radial highways.\n", "9. Index of accessibility to radial highways.\n",
"10. Full-value property-tax rate per $10,000.\n", "10. Full-value property-tax rate per $10,000.\n",
"11. Pupil-teacher ratio by town.\n", "11. Pupil-teacher ratio by town.\n",
"12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.\n", "12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.\n",
"13. Percentage lower status of the population.\n", "13. Percentage lower status of the population.\n",
"\n", "\n",
"Each one of these input data features is stored using a different scale. Some feature are represented by a proportion between 0 and 1, other features are ranges between 1 and 12, some are ranges between 0 and 100, and so on. This is often the case with real-world data, and understanding how to explore and clean such data is an important skill to develop.\n", "Each one of these input data features is stored using a different scale. Some features are represented by a proportion between 0 and 1, other features are ranges between 1 and 12, some are ranges between 0 and 100, and so on. This is often the case with real-world data, and understanding how to explore and clean such data is an important skill to develop.\n",
"\n", "\n",
"Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is a feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/)." "Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is a feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/)."
] ]
...@@ -272,7 +272,7 @@ ...@@ -272,7 +272,7 @@
}, },
"cell_type": "code", "cell_type": "code",
"source": [ "source": [
"print(train_data[0]) # Display sample features, notice they different scales" "print(train_data[0]) # Display sample features, notice the different scales"
], ],
"execution_count": 0, "execution_count": 0,
"outputs": [] "outputs": []
...@@ -397,7 +397,7 @@ ...@@ -397,7 +397,7 @@
"source": [ "source": [
"## Create the model\n", "## Create the model\n",
"\n", "\n",
"Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output later that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on." "Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on."
] ]
}, },
{ {
...@@ -629,7 +629,7 @@ ...@@ -629,7 +629,7 @@
"source": [ "source": [
"## Conclusion\n", "## Conclusion\n",
"\n", "\n",
"This notebook introduced a few techniques to handle a regresson problem.\n", "This notebook introduced a few techniques to handle a regression problem.\n",
"\n", "\n",
"* Mean Squared Error (MSE) is a common loss function used for regression problems (different than classification problems).\n", "* Mean Squared Error (MSE) is a common loss function used for regression problems (different than classification problems).\n",
"* Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).\n", "* Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).\n",
......
...@@ -205,7 +205,7 @@ ...@@ -205,7 +205,7 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"The argument `num_words=10000` keeps the top 10,000 most frequently occurring words in the training data. The rare words are discarded to keep the size of the data managable." "The argument `num_words=10000` keeps the top 10,000 most frequently occurring words in the training data. The rare words are discarded to keep the size of the data manageable."
] ]
}, },
{ {
...@@ -217,7 +217,7 @@ ...@@ -217,7 +217,7 @@
"source": [ "source": [
"## Explore the data \n", "## Explore the data \n",
"\n", "\n",
"Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review. Each label is an integer value of either 0 or 1, where 0 is a negative review and 1 is a positive review." "Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review. Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review."
] ]
}, },
{ {
...@@ -374,13 +374,13 @@ ...@@ -374,13 +374,13 @@
"source": [ "source": [
"## Prepare the data\n", "## Prepare the data\n",
"\n", "\n",
"The reviews—the arrays of integers—must be converted to tensors before fed into the neural network. This conversion can be done a couple ways:\n", "The reviews—the arrays of integers—must be converted to tensors before fed into the neural network. This conversion can be done a couple of ways:\n",
"\n", "\n",
"* One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network—a Dense layer—that can handle floating point vector data. This approach is memory intensive, though, requiring a `num_words * num_reviews` size matrix.\n", "* One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network—a Dense layer—that can handle floating point vector data. This approach is memory intensive, though, requiring a `num_words * num_reviews` size matrix.\n",
"\n", "\n",
"* Alternatively, we can pad the arrays so they all have the same length, then create an integer tensor of shape `num_examples * max_length`. We can use an embedding layer capable of handing this shape as the first layer in our network.\n", "* Alternatively, we can pad the arrays so they all have the same length, then create an integer tensor of shape `num_examples * max_length`. We can use an embedding layer capable of handling this shape as the first layer in our network.\n",
"\n", "\n",
"In this tutorial, we willl use the second approach. \n", "In this tutorial, we will use the second approach. \n",
"\n", "\n",
"Since the movie reviews must be the same length, we will use the [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) function to standardize the lengths:" "Since the movie reviews must be the same length, we will use the [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) function to standardize the lengths:"
] ]
...@@ -481,7 +481,7 @@ ...@@ -481,7 +481,7 @@
"* How many layers to use in the model?\n", "* How many layers to use in the model?\n",
"* How many *hidden units* to use for each layer?\n", "* How many *hidden units* to use for each layer?\n",
"\n", "\n",
"In this example, the input data consists of array of word-indices. The labels to predict are either 0 or 1. Let's build a model for this problem:" "In this example, the input data consists of an array of word-indices. The labels to predict are either 0 or 1. Let's build a model for this problem:"
] ]
}, },
{ {
...@@ -523,7 +523,7 @@ ...@@ -523,7 +523,7 @@
"1. The first layer is an `Embedding` layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`.\n", "1. The first layer is an `Embedding` layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`.\n",
"2. Next, a `GlobalAveragePooling1D` layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model can handle input of variable length, in the simplest way possible.\n", "2. Next, a `GlobalAveragePooling1D` layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model can handle input of variable length, in the simplest way possible.\n",
"3. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.\n", "3. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.\n",
"4. The last layer is densely connected with a single output node. Using the `sigmoid` activation function, this value is a float between 0 and 1, representing a probabilty, or confidence level." "4. The last layer is densely connected with a single output node. Using the `sigmoid` activation function, this value is a float between 0 and 1, representing a probability, or confidence level."
] ]
}, },
{ {
...@@ -535,7 +535,7 @@ ...@@ -535,7 +535,7 @@
"source": [ "source": [
"### Hidden units\n", "### Hidden units\n",
"\n", "\n",
"The above model has two intermediate or \"hidden\" layers, between the input and output. The number of outputs (units, nodes , or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.\n", "The above model has two intermediate or \"hidden\" layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.\n",
"\n", "\n",
"If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called *overfitting*, and we'll explore it later." "If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called *overfitting*, and we'll explore it later."
] ]
...@@ -549,9 +549,9 @@ ...@@ -549,9 +549,9 @@
"source": [ "source": [
"### Loss function and optimizer\n", "### Loss function and optimizer\n",
"\n", "\n",
"A model need a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs of a probability (a single-unit layer with a sigmoid activation), we'll use the `binary_crossentropy` loss function. \n", "A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs of a probability (a single-unit layer with a sigmoid activation), we'll use the `binary_crossentropy` loss function. \n",
"\n", "\n",
"This isn't the only choice of loss function, you could, for instance, choose `mean_squared_error`. But, generally, `binary_crossentropy` is better for dealing with out probabilities—it measures the \"distance\" between probability distributions, or in our case, between the ground-truth distribution and the predictions.\n", "This isn't the only choice for a loss function, you could, for instance, choose `mean_squared_error`. But, generally, `binary_crossentropy` is better for dealing without probabilities—it measures the \"distance\" between probability distributions, or in our case, between the ground-truth distribution and the predictions.\n",
"\n", "\n",
"Later, when we are exploring regression problems (say, to predict the price of a house), we will see how to use another loss function called mean squared error.\n", "Later, when we are exploring regression problems (say, to predict the price of a house), we will see how to use another loss function called mean squared error.\n",
"\n", "\n",
...@@ -809,7 +809,7 @@ ...@@ -809,7 +809,7 @@
"\n", "\n",
"This isn't the case for the validation loss and accuracy—they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations *specific* to the training data that do not *generalize* to test data.\n", "This isn't the case for the validation loss and accuracy—they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations *specific* to the training data that do not *generalize* to test data.\n",
"\n", "\n",
"For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs . Later, you'll see how to do this automatically with a callback." "For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs. Later, you'll see how to do this automatically with a callback."
] ]
} }
] ]
......
...@@ -144,9 +144,9 @@ ...@@ -144,9 +144,9 @@
"\n", "\n",
"The opposite of overfitting is *underfitting*. Underfitting occurs when there is still room for improvement on the test data if you continue to train for more epochs. This means the network has not yet learned all the relevant patterns in the training data. \n", "The opposite of overfitting is *underfitting*. Underfitting occurs when there is still room for improvement on the test data if you continue to train for more epochs. This means the network has not yet learned all the relevant patterns in the training data. \n",
"\n", "\n",
"If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. We need to strike a balance. Understanding how to train for an appriopriate number of epochs as we'll explore below is a useful skill.\n", "If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. We need to strike a balance. Understanding how to train for an appropriate number of epochs as we'll explore below is a useful skill.\n",
"\n", "\n",
"To prevent overfitting, the best solution is use more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constaints on the quantity and type of information your model is able to store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.\n", "To prevent overfitting, the best solution is to use more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quantity and type of information your model can store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.\n",
"\n", "\n",
"In this notebook, we'll explore two common regularization techniques—weight regularization and dropout—and use them to improve our IMDB movie review classification notebook." "In this notebook, we'll explore two common regularization techniques—weight regularization and dropout—and use them to improve our IMDB movie review classification notebook."
] ]
...@@ -582,7 +582,7 @@ ...@@ -582,7 +582,7 @@
"source": [ "source": [
"You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the \"simplest\" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.\n", "You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the \"simplest\" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.\n",
"\n", "\n",
"A \"simple model\" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to only take small values, which makes the distribution of weight values more \"regular\". This is called \"weight regularization\", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:\n", "A \"simple model\" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more \"regular\". This is called \"weight regularization\", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:\n",
"\n", "\n",
"* L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the \"L1 norm\" of the weights).\n", "* L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the \"L1 norm\" of the weights).\n",
"\n", "\n",
...@@ -675,7 +675,7 @@ ...@@ -675,7 +675,7 @@
"source": [ "source": [
"### Add dropout\n", "### Add dropout\n",
"\n", "\n",
"Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly \"dropping out\" (i.e. setting to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, \n", "Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly \"dropping out\" (i.e. set to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, \n",
"1.3, 0, 1.1]. The \"dropout rate\" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.\n", "1.3, 0, 1.1]. The \"dropout rate\" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.\n",
"\n", "\n",
"In tf.keras you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before.\n", "In tf.keras you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before.\n",
...@@ -743,15 +743,15 @@ ...@@ -743,15 +743,15 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"Adding dropout is clear improvement over the baseline model. \n", "Adding dropout is a clear improvement over the baseline model. \n",
"\n", "\n",
"\n", "\n",
"To recap: here the most common ways to prevent overfitting in neural networks:\n", "To recap: here the most common ways to prevent overfitting in neural networks:\n",
"\n", "\n",
"* Getting more training data.\n", "* Get more training data.\n",
"* Reducing the capacity of the network.\n", "* Reduce the capacity of the network.\n",
"* Adding weight regularization.\n", "* Add weight regularization.\n",
"* Adding dropout.\n", "* Add dropout.\n",
"\n", "\n",
"And two important approaches not covered in this guide are data-augmentation and batch normalization." "And two important approaches not covered in this guide are data-augmentation and batch normalization."
] ]
......
...@@ -134,7 +134,7 @@ ...@@ -134,7 +134,7 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"Model progress can be saved during—and after—training. This means a model can resume where it left off and avoid long training times. Saving also means you can share your model and others can recreate your work. When publishing research models and techniques, most machine learning practioners share:\n", "Model progress can be saved during—and after—training. This means a model can resume where it left off and avoid long training times. Saving also means you can share your model and others can recreate your work. When publishing research models and techniques, most machine learning practitioners share:\n",
"\n", "\n",
"* code to create the model, and\n", "* code to create the model, and\n",
"* the trained weights, or parameters, for the model\n", "* the trained weights, or parameters, for the model\n",
...@@ -581,7 +581,7 @@ ...@@ -581,7 +581,7 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"The above code stores the weights to a collection of [checkpoint](https://www.tensorflow.org/guide/saved_model#save_and_restore_variables)-formatted files that contains only the trained weights in a binary format. Checkpoints contain:\n", "The above code stores the weights to a collection of [checkpoint](https://www.tensorflow.org/guide/saved_model#save_and_restore_variables)-formatted files that contain only the trained weights in a binary format. Checkpoints contain:\n",
"* One or more shards that contain your model's weights. \n", "* One or more shards that contain your model's weights. \n",
"* An index file that indicates which weights are stored in a which shard. \n", "* An index file that indicates which weights are stored in a which shard. \n",
"\n", "\n",
...@@ -595,7 +595,7 @@ ...@@ -595,7 +595,7 @@
}, },
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"## Manualy save weights\n", "## Manually save weights\n",
"\n", "\n",
"Above you saw how to load the weights into a model.\n", "Above you saw how to load the weights into a model.\n",
"\n", "\n",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment