
# Adversarial Text Classification
Code for [*Adversarial Training Methods for Semi-Supervised Text Classification*](https://arxiv.org/abs/1605.07725) and [*Semi-Supervised Sequence Learning*](https://arxiv.org/abs/1511.01432).
The directory `/tmp/aclImdb` contains the raw IMDB data.
### Generate vocabulary
```bash
$ IMDB_DATA_DIR=/tmp/imdb
$ python gen_vocab.py \
--output_dir=$IMDB_DATA_DIR\
--dataset=imdb \
--imdb_input_dir=/tmp/aclImdb \
--lowercase=False
```
Vocabulary and frequency files will be generated in `$IMDB_DATA_DIR`.
### Generate training, validation, and test data
```bash
$ python gen_data.py \
--output_dir=$IMDB_DATA_DIR\
--dataset=imdb \
--imdb_input_dir=/tmp/aclImdb \
--lowercase=False \
--label_gain=False
```
`$IMDB_DATA_DIR` contains TFRecords files.
### Pretrain IMDB Language Model
```bash
$ PRETRAIN_DIR=/tmp/models/imdb_pretrain
$ python pretrain.py \
--train_dir=$PRETRAIN_DIR\
--data_dir=$IMDB_DATA_DIR\
--vocab_size=86934 \
--embedding_dims=256 \
--rnn_cell_size=1024 \
--num_candidate_samples=1024 \
--batch_size=256 \
--learning_rate=0.001 \
--learning_rate_decay_factor=0.9999 \
--max_steps=100000 \
--max_grad_norm=1.0 \
--num_timesteps=400 \
--keep_prob_emb=0.5 \
--normalize_embeddings
```
`$PRETRAIN_DIR` contains checkpoints of the pretrained language model.
### Train classifier
Most flags stay the same, save for the removal of candidate sampling and the
addition of `pretrained_model_dir`, from which the classifier will load the
pretrained embedding and LSTM variables, and flags related to adversarial
training and classification.
```bash
$ TRAIN_DIR=/tmp/models/imdb_classify
$ python train_classifier.py \
--train_dir=$TRAIN_DIR\
--pretrained_model_dir=$PRETRAIN_DIR\
--data_dir=$IMDB_DATA_DIR\
--vocab_size=86934 \
--embedding_dims=256 \
--rnn_cell_size=1024 \
--cl_num_layers=1 \
--cl_hidden_size=30 \
--batch_size=64 \
--learning_rate=0.0005 \
--learning_rate_decay_factor=0.9998 \
--max_steps=15000 \
--max_grad_norm=1.0 \
--num_timesteps=400 \
--keep_prob_emb=0.5 \
--normalize_embeddings\
--adv_training_method=vat \
--perturb_norm_length=5.0
```
### Evaluate on test data
```bash
$ EVAL_DIR=/tmp/models/imdb_eval
$ python evaluate.py \
--eval_dir=$EVAL_DIR\
--checkpoint_dir=$TRAIN_DIR\
--eval_data=test\
--run_once\
--num_examples=25000 \
--data_dir=$IMDB_DATA_DIR\
--vocab_size=86934 \
--embedding_dims=256 \
--rnn_cell_size=1024 \
--batch_size=256 \
--num_timesteps=400 \
--normalize_embeddings
```
## Code Overview
The main entry points are the binaries listed below. Each training binary builds
a `VatxtModel`, defined in `graphs.py`, which in turn uses graph building blocks
defined in `inputs.py` (defines input data reading and parsing), `layers.py`
(defines core model components), and `adversarial_losses.py` (defines
adversarial training losses). The training loop itself is defined in
`train_utils.py`.
### Binaries
* Pretraining: `pretrain.py`
* Classifier Training: `train_classifier.py`
* Evaluation: `evaluate.py`
### Command-Line Flags
Flags related to distributed training and the training loop itself are defined
in [`train_utils.py`](https://github.com/tensorflow/models/tree/master/research/adversarial_text/train_utils.py).
Flags related to model hyperparameters are defined in [`graphs.py`](https://github.com/tensorflow/models/tree/master/research/adversarial_text/graphs.py).
Flags related to adversarial training are defined in [`adversarial_losses.py`](https://github.com/tensorflow/models/tree/master/research/adversarial_text/adversarial_losses.py).
Flags particular to each job are defined in the main binary files.
* Data generation: [`gen_data.py`](https://github.com/tensorflow/models/tree/master/research/adversarial_text/gen_data.py)
Command-line flags defined in [`document_generators.py`](https://github.com/tensorflow/models/tree/master/research/adversarial_text/data/document_generators.py)
You will also need to include it into the `datasets/__init__.py` and specify the
dataset name in the command line.
```
python train.py --dataset_name=newtextdataset
```
Please note that eval.py will also require the same flag.
To learn how to store a data in the FSNS
format please refer to the https://stackoverflow.com/a/44461910/743658.
2. Define a new dataset format. The model needs the following data to train:
- images: input images, shape [batch_size x H x W x 3];
- labels: ground truth label ids, shape=[batch_size x seq_length];
- labels_one_hot: labels in one-hot encoding, shape [batch_size x seq_length x num_char_classes];
Refer to [python/data_provider.py](https://github.com/tensorflow/models/blob/master/research/attention_ocr/python/data_provider.py#L33)
for more details. You can use [python/datasets/fsns.py](https://github.com/tensorflow/models/blob/master/research/attention_ocr/python/datasets/fsns.py)
as the example.
## How to use a pre-trained model
The inference part was not released yet, but it is pretty straightforward to
implement one in Python or C++.
The recommended way is to use the [Serving infrastructure][serving].
Alternatively you can:
1. define a placeholder for images (or use directly an numpy array)
2.[create a graph ](https://github.com/tensorflow/models/blob/master/research/attention_ocr/python/eval.py#L60)