"git@developer.sourcefind.cn:modelzoo/resnet50_tensorflow.git" did not exist on "e5b652ef5f93db1ef1afa526faa9db15d73a1f77"
Commit 61ec6026 authored by Hyungjun Lim's avatar Hyungjun Lim Committed by Karmel Allison
Browse files

Sentiment analysis implmenttion in pure Keras. (#4806)

* Sentiment analysis implmenttion in pure Keras.

- This is an update for the sentiment analysis model's pure Keras version.
-- Converting it from the version using Tensorflow's estimator, as it has a issue that affects the accuracy of the model negatively.
- The implementation is with the reference to paddle version.
-- Adjustment of the hyperparameters was done to achieve the accuracy of ~90%

* addressing comments

* addressig comments

also adding

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

for each module as it seemingly is a standard in the repo.

* addressing the final comment.
parent 76d3d72e
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
## Overview ## Overview
This is an implementation of the Sentiment Analysis model as described in the [this paper](https://arxiv.org/abs/1412.1058). The implementation is with the reference to [paddle version](https://github.com/mlperf/reference/tree/master/sentiment_analysis/paddle). This is an implementation of the Sentiment Analysis model as described in the [this paper](https://arxiv.org/abs/1412.1058). The implementation is with the reference to [paddle version](https://github.com/mlperf/reference/tree/master/sentiment_analysis/paddle).
The model makes use of concatenation of two CNN layers with different kernel sizes. Dropout and batch normalization layers are used to prevent over-fitting. The model makes use of concatenation of two CNN layers with different kernel sizes. Batch normalization and dropout layers are used to prevent over-fitting.
## Dataset ## Dataset
The [keras](https://keras.io)'s [IMDB Movie reviews sentiment classification](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) dataset is used. The dataset file download is handled by keras module, and the downloaded files are stored at ``~/.keras/datasets` directory. The compressed file's filesize as of June 15 2018 is 17MB. The [keras](https://keras.io)'s [IMDB Movie reviews sentiment classification](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) dataset is used. The dataset file download is handled by keras module, and the downloaded files are stored at ``~/.keras/datasets` directory. The compressed file's filesize as of June 15 2018 is 17MB.
...@@ -14,10 +14,9 @@ To train and evaluate the model, issue the following command: ...@@ -14,10 +14,9 @@ To train and evaluate the model, issue the following command:
python sentiment_main.py python sentiment_main.py
``` ```
Arguments: Arguments:
* `--vocabulary_size`: The number of words included in the dataset. The most frequent words are chosen. The default is 6000.
* `--sentence_length`: The length of the sentence
* `--dataset`: The dataset name to be downloaded and preprocessed. By default, it is `imdb`. * `--dataset`: The dataset name to be downloaded and preprocessed. By default, it is `imdb`.
There are other arguments about models and training process. Use the `--help` or `-h` flag to get a full list of possible arguments with detailed descriptions. There are other arguments about models and training process. Use the `--help` or `-h` flag to get a full list of possible arguments with detailed descriptions.
## Benchmarks (TBA) ## Benchmarks
\ No newline at end of file The model was recorded to have the accuracy of 90.1% for the IMDB dataset.
"""Dataset module for sentiment analysis.
Currently imdb dataset is available.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import data.imdb as imdb import data.imdb as imdb
DATASET_IMDB = "imdb" DATASET_IMDB = "imdb"
def construct_input_fns(dataset, batch_size, vocabulary_size, def load(dataset, vocabulary_size, sentence_length):
sentence_length, repeat=1): """Returns training and evaluation input.
"""Returns training and evaluation input functions.
Args: Args:
dataset: Dataset to be trained and evaluated. dataset: Dataset to be trained and evaluated.
Currently only imdb is supported. Currently only imdb is supported.
batch_size: Number of data in each batch.
vocabulary_size: The number of the most frequent tokens vocabulary_size: The number of the most frequent tokens
to be used from the corpus. to be used from the corpus.
sentence_length: The number of words in each sentence. sentence_length: The number of words in each sentence.
Longer sentences get cut, shorter ones padded. Longer sentences get cut, shorter ones padded.
repeat: The number of epoch.
Raises: Raises:
ValueError: if the dataset value is not valid. ValueError: if the dataset value is not valid.
Returns: Returns:
A tuple of training and evaluation input function. A tuple of length 4, for training sentences, labels,
evaluation sentences, and evaluation labels,
each being an numpy array.
""" """
if dataset == DATASET_IMDB: if dataset == DATASET_IMDB:
train_input_fn, eval_input_fn = imdb.construct_input_fns( return imdb.load(vocabulary_size, sentence_length)
vocabulary_size, sentence_length, batch_size, repeat=repeat)
return train_input_fn, eval_input_fn
else: else:
raise ValueError("unsupported dataset: " + dataset) raise ValueError("unsupported dataset: " + dataset)
...@@ -38,7 +44,7 @@ def get_num_class(dataset): ...@@ -38,7 +44,7 @@ def get_num_class(dataset):
Raises: Raises:
ValueError: if the dataset value is not valid. ValueError: if the dataset value is not valid.
Returns: Returns:
str: The dataset name. int: The number of label classes.
""" """
if dataset == DATASET_IMDB: if dataset == DATASET_IMDB:
return imdb.NUM_CLASS return imdb.NUM_CLASS
......
from data.util import pad_sentence, to_dataset, START_CHAR, OOV_CHAR """IMDB Dataset module for sentiment analysis."""
import tensorflow as tf, numpy as np
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
from data.util import OOV_CHAR
from data.util import pad_sentence
from data.util import START_CHAR
NUM_CLASS = 2 NUM_CLASS = 2
def construct_input_fns(vocabulary_size, sentence_length, def load(vocabulary_size, sentence_length):
batch_size, repeat=1): """Returns training and evaluation input for imdb dataset.
"""Returns training and evaluation input functions.
Args: Args:
vocabulary_size: The number of the most frequent tokens vocabulary_size: The number of the most frequent tokens
to be used from the corpus. to be used from the corpus.
sentence_length: The number of words in each sentence. sentence_length: The number of words in each sentence.
Longer sentences get cut, shorter ones padded. Longer sentences get cut, shorter ones padded.
batch_size: Number of data in each batch.
repeat: The number of epoch.
Raises: Raises:
ValueError: if the dataset value is not valid. ValueError: if the dataset value is not valid.
Returns: Returns:
A tuple of training and evaluation input function. A tuple of length 4, for training and evaluation data,
each being an numpy array.
""" """
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data( (x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(
path="imdb.npz", path="imdb.npz",
...@@ -28,19 +36,19 @@ def construct_input_fns(vocabulary_size, sentence_length, ...@@ -28,19 +36,19 @@ def construct_input_fns(vocabulary_size, sentence_length,
seed=113, seed=113,
start_char=START_CHAR, start_char=START_CHAR,
oov_char=OOV_CHAR, oov_char=OOV_CHAR,
index_from=OOV_CHAR + 1) index_from=OOV_CHAR+1)
def train_input_fn(): x_train_processed = []
dataset = to_dataset( for sen in x_train:
np.array([pad_sentence(s, sentence_length) for s in x_train]), sen = pad_sentence(sen, sentence_length)
np.eye(NUM_CLASS)[y_train], batch_size, repeat) x_train_processed.append(np.array(sen))
dataset = dataset.shuffle(len(x_train), reshuffle_each_iteration=True) x_train_processed = np.array(x_train_processed)
return dataset
x_test_processed = []
def eval_input_fn(): for sen in x_test:
dataset = to_dataset( sen = pad_sentence(sen, sentence_length)
np.array([pad_sentence(s, sentence_length) for s in x_test]), x_test_processed.append(np.array(sen))
np.eye(NUM_CLASS)[y_test], batch_size, repeat) x_test_processed = np.array(x_test_processed)
return dataset
return x_train_processed, np.eye(NUM_CLASS)[y_train], \
return train_input_fn, eval_input_fn x_test_processed, np.eye(NUM_CLASS)[y_test]
"""Utility module for sentiment analysis."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np import numpy as np
import tensorflow as tf
START_CHAR = 1 START_CHAR = 1
END_CHAR = 2 END_CHAR = 2
OOV_CHAR = 3 OOV_CHAR = 3
def pad_sentence(sen, sentence_length): def pad_sentence(sentence, sentence_length):
sen = sen[:sentence_length] """Pad the given sentense at the end.
if len(sen) < sentence_length:
sen = np.pad(sen, (0, sentence_length - len(sen)), "constant",
constant_values=(START_CHAR, END_CHAR))
return sen
def to_dataset(x, y, batch_size, repeat): If the input is longer than sentence_length,
dataset = tf.data.Dataset.from_tensor_slices((x, y)) the remaining portion is dropped.
END_CHAR is used for the padding.
# Repeat and batch the dataset Args:
dataset = dataset.repeat(repeat) sentence: A numpy array of integers.
dataset = dataset.batch(batch_size) sentence_length: The length of the input after the padding.
Returns:
A numpy array of integers of the given length.
"""
sentence = sentence[:sentence_length]
if len(sentence) < sentence_length:
sentence = np.pad(sentence, (0, sentence_length - len(sentence)),
"constant", constant_values=(START_CHAR, END_CHAR))
# Prefetch to improve speed of input pipeline. return sentence
dataset = dataset.prefetch(10)
return dataset
"""The main module for sentiment analysis. """Main function for the sentiment analysis model.
The model makes use of concatenation of two CNN layers The model makes use of concatenation of two CNN layers with
with different kernel sizes. different kernel sizes. See `sentiment_model.py`
See `sentiment_model.py` for more details about the models. for more details about the models.
""" """
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
from absl import app as absl_app import argparse
from absl import flags
from data import dataset
from official.utils.flags import core as flags_core
from official.utils.logs import hooks_helper
from official.utils.logs import logger
from official.utils.misc import distribution_utils
import sentiment_model
import tensorflow as tf
def convert_keras_to_estimator(keras_model, num_gpus, model_dir=None):
"""Convert keras model into tensorflow estimator."""
keras_model.compile(optimizer="rmsprop",
loss="categorical_crossentropy", metrics=["accuracy"])
distribution = distribution_utils.get_distribution_strategy(
num_gpus, all_reduce_alg=None)
run_config = tf.estimator.RunConfig(train_distribute=distribution)
estimator = tf.keras.estimator.model_to_estimator(
keras_model=keras_model, model_dir=model_dir, config=run_config)
return estimator
def run_model(flags_obj):
"""Run training and eval loop."""
num_class = dataset.get_num_class(flags_obj.dataset) import tensorflow as tf
tf.logging.info("Loading the dataset...")
train_input_fn, eval_input_fn = dataset.construct_input_fns(
flags_obj.dataset, flags_obj.batch_size, flags_obj.vocabulary_size,
flags_obj.sentence_length, repeat=flags_obj.epochs_between_evals)
keras_model = sentiment_model.CNN(
flags_obj.embedding_dim, flags_obj.vocabulary_size,
flags_obj.sentence_length,
flags_obj.cnn_filters, num_class, flags_obj.dropout_rate)
num_gpus = flags_core.get_num_gpus(FLAGS)
tf.logging.info("Creating Estimator from Keras model...")
estimator = convert_keras_to_estimator(
keras_model, num_gpus, flags_obj.model_dir)
# Create hooks that log information about the training and metric values
train_hooks = hooks_helper.get_train_hooks(
flags_obj.hooks,
batch_size=flags_obj.batch_size # for ExamplesPerSecondHook
)
run_params = {
"batch_size": flags_obj.batch_size,
"train_epochs": flags_obj.train_epochs,
}
benchmark_logger = logger.get_benchmark_logger()
benchmark_logger.log_run_info(
model_name="sentiment_analysis",
dataset_name=flags_obj.dataset,
run_params=run_params,
test_id=flags_obj.benchmark_test_id)
# Training and evaluation cycle
total_training_cycle = flags_obj.train_epochs\
// flags_obj.epochs_between_evals
for cycle_index in range(total_training_cycle):
tf.logging.info("Starting a training cycle: {}/{}".format(
cycle_index + 1, total_training_cycle))
# Train the model
estimator.train(input_fn=train_input_fn, hooks=train_hooks)
# Evaluate the model
eval_results = estimator.evaluate(input_fn=eval_input_fn)
# Benchmark the evaluation results
benchmark_logger.log_evaluation_result(eval_results)
tf.logging.info("Iteration {}".format(eval_results)) from data import dataset
import sentiment_model
# Clear the session explicitly to avoid session delete error _DROPOUT_RATE = 0.95
tf.keras.backend.clear_session()
def main(_): def run_model(dataset_name, emb_dim, voc_size, sen_len,
with logger.benchmark_context(FLAGS): hid_dim, batch_size, epochs):
run_model(FLAGS) """Run training loop and an evaluation at the end.
Args:
dataset_name: Dataset name to be trained and evaluated.
emb_dim: The dimension of the Embedding layer.
voc_size: The number of the most frequent tokens
to be used from the corpus.
sen_len: The number of words in each sentence.
Longer sentences get cut, shorter ones padded.
hid_dim: The dimension of the Embedding layer.
batch_size: The size of each batch during training.
epochs: The number of the iteration over the training set for training.
"""
def define_flags(): model = sentiment_model.CNN(emb_dim, voc_size, sen_len,
"""Add flags to run the main function.""" hid_dim, dataset.get_num_class(dataset_name),
_DROPOUT_RATE)
model.summary()
# Add common flags model.compile(loss="categorical_crossentropy",
flags_core.define_base(export_dir=False) optimizer="rmsprop",
flags_core.define_performance( metrics=["accuracy"])
num_parallel_calls=False,
inter_op=False,
intra_op=False,
synthetic_data=False,
max_train_steps=False,
dtype=False
)
flags_core.define_benchmark()
flags.adopt_module_key_flags(flags_core) tf.logging.info("Loading the data")
x_train, y_train, x_test, y_test = dataset.load(
dataset_name, voc_size, sen_len)
flags_core.set_defaults( model.fit(x_train, y_train, batch_size=batch_size,
model_dir=None, validation_split=0.4, epochs=epochs)
train_epochs=30, score = model.evaluate(x_test, y_test, batch_size=batch_size)
batch_size=30, tf.logging.info("Score: {}".format(score))
hooks="")
# Add domain-specific flags
flags.DEFINE_enum(
name="dataset", default=dataset.DATASET_IMDB,
enum_values=[dataset.DATASET_IMDB], case_sensitive=False,
help=flags_core.help_wrap(
"Dataset to be trained and evaluated."))
flags.DEFINE_integer(
name="vocabulary_size", default=6000,
help=flags_core.help_wrap(
"The number of the most frequent tokens"
"to be used from the corpus."))
flags.DEFINE_integer(
name="sentence_length", default=200,
help=flags_core.help_wrap(
"The number of words in each sentence. Longer sentences get cut,"
"shorter ones padded."))
flags.DEFINE_integer(
name="embedding_dim", default=256,
help=flags_core.help_wrap("The dimension of the Embedding layer."))
flags.DEFINE_integer(
name="cnn_filters", default=512,
help=flags_core.help_wrap("The number of the CNN layer filters."))
flags.DEFINE_float(
name="dropout_rate", default=0.7,
help=flags_core.help_wrap("The rate for the Dropout layer."))
if __name__ == "__main__": if __name__ == "__main__":
tf.logging.set_verbosity(tf.logging.INFO) parser = argparse.ArgumentParser()
define_flags() parser.add_argument("-d", "--dataset", help="Dataset to be trained "
FLAGS = flags.FLAGS "and evaluated.",
absl_app.run(main) type=str, choices=["imdb"], default="imdb")
parser.add_argument("-e", "--embedding_dim",
help="The dimension of the Embedding layer.",
type=int, default=512)
parser.add_argument("-v", "--vocabulary_size",
help="The number of the words to be considered "
"in the dataset corpus.",
type=int, default=6000)
parser.add_argument("-s", "--sentence_length",
help="The number of words in a data point."
"Entries of smaller length are padded.",
type=int, default=600)
parser.add_argument("-c", "--hidden_dim",
help="The number of the CNN layer filters.",
type=int, default=512)
parser.add_argument("-b", "--batch_size",
help="The size of each batch for training.",
type=int, default=500)
parser.add_argument("-p", "--epochs",
help="The number of epochs for training.",
type=int, default=55)
args = parser.parse_args()
run_model(args.dataset, args.embedding_dim, args.vocabulary_size,
args.sentence_length, args.hidden_dim,
args.batch_size, args.epochs)
"""Model for sentiment analysis.
The model makes use of concatenation of two CNN layers with
different kernel sizes.
"""
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
...@@ -5,53 +11,11 @@ from __future__ import print_function ...@@ -5,53 +11,11 @@ from __future__ import print_function
import tensorflow as tf import tensorflow as tf
def _dynamic_pooling(w_embs):
"""Dynamic Pooling layer.
Given the variable-sized output of the convolution layer,
the pooling with a fixed pooling kernel size and stride would
produce variable-sized output, whereas the following fully-connected
layer expects fixes input layer size.
Thus we fix the number of pooling units (to 2) and dynamically
determine the pooling region size on each data point.
Args:
w_embs: a input tensor with dimensionality of 1.
Returns:
A tensor of size 2.
"""
# a Lambda layer maintain separate context, so that tf should be imported
# here.
import tensorflow as tf
t = tf.expand_dims(w_embs, 2)
pool_size = w_embs.shape[1].value / 2
pooled = tf.keras.backend.pool2d(t, (pool_size, 1), strides=(
pool_size, 1), data_format="channels_last")
return tf.squeeze(pooled, 2)
def _dynamic_pooling_output_shape(input_shape):
"""Output shape for the dynamic pooling layer.
This function is used for keras Lambda layer to indicate
the output shape of the dynamic poolic layer.
Args:
input_shape: A tuple for the input shape.
Returns:
output shape for the dynamic pooling layer.
"""
shape = list(input_shape)
assert len(shape) == 2 # only valid for 2D tensors
shape[1] = 2
return tuple(shape)
class CNN(tf.keras.models.Model): class CNN(tf.keras.models.Model):
"""CNN for sentimental analysis.""" """CNN for sentimental analysis."""
def __init__(self, emb_dim, num_words, sentence_length, hid_dim, def __init__(self, emb_dim, num_words, sentence_length, hid_dim,
class_dim, dropout_rate): class_dim, dropout_rate):
"""Initialize CNN model. """Initialize CNN model.
Args: Args:
...@@ -64,27 +28,23 @@ class CNN(tf.keras.models.Model): ...@@ -64,27 +28,23 @@ class CNN(tf.keras.models.Model):
class_dim: The number of the CNN layer filters. class_dim: The number of the CNN layer filters.
dropout_rate: The portion of kept value in the Dropout layer. dropout_rate: The portion of kept value in the Dropout layer.
Returns: Returns:
tf.keras.models.Model: A model. tf.keras.models.Model: A Keras model.
""" """
input = tf.keras.layers.Input(shape=(sentence_length,), dtype=tf.int32) input_layer = tf.keras.layers.Input(shape=(sentence_length,), dtype=tf.int32)
layer = tf.keras.layers.Embedding(num_words, output_dim=emb_dim)(input) layer = tf.keras.layers.Embedding(num_words, output_dim=emb_dim)(input_layer)
layer_conv3 = tf.keras.layers.Conv1D(hid_dim, 3, activation="relu")(layer) layer_conv3 = tf.keras.layers.Conv1D(hid_dim, 3, activation="relu")(layer)
layer_conv3 = tf.keras.layers.Lambda(_dynamic_pooling, layer_conv3 = tf.keras.layers.GlobalMaxPooling1D()(layer_conv3)
output_shape=_dynamic_pooling_output_shape)(layer_conv3)
layer_conv3 = tf.keras.layers.Flatten()(layer_conv3)
layer_conv2 = tf.keras.layers.Conv1D(hid_dim, 2, activation="relu")(layer) layer_conv4 = tf.keras.layers.Conv1D(hid_dim, 2, activation="relu")(layer)
layer_conv2 = tf.keras.layers.Lambda(_dynamic_pooling, layer_conv4 = tf.keras.layers.GlobalMaxPooling1D()(layer_conv4)
output_shape=_dynamic_pooling_output_shape)(layer_conv2)
layer_conv2 = tf.keras.layers.Flatten()(layer_conv2)
layer = tf.keras.layers.concatenate([layer_conv2, layer_conv3], axis=1) layer = tf.keras.layers.concatenate([layer_conv4, layer_conv3], axis=1)
layer = tf.keras.layers.Dropout(dropout_rate)(layer)
layer = tf.keras.layers.BatchNormalization()(layer) layer = tf.keras.layers.BatchNormalization()(layer)
layer = tf.keras.layers.Dropout(dropout_rate)(layer)
output = tf.keras.layers.Dense(class_dim, activation="softmax")(layer) output = tf.keras.layers.Dense(class_dim, activation="softmax")(layer)
super(CNN, self).__init__(inputs=[input], outputs=output) super(CNN, self).__init__(inputs=[input_layer], outputs=output)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment