Unfortunately, we will not be able to release the pre-processed dataset that is
exactly used in the paper. Users need to crawl the URLs and the recommended
pre-processing is using an open-sourced library to download and parse the news
content including title and leading paragraphs. For ease of this process, we
provide a config of [news-please](https://github.com/fhamborg/news-please) that
will crawl and extract news articles on a local machine.
First, install the `news-please` CLI (requires python 3.x)
```shell
$ pip3 install news-please
```
Next, run the crawler with our provided config and URL list
```shell
# Sets to path of the downloaded data folder
$ DATA_FOLDER=/path/to/downloaded_dataset
# Uses CLI interface to crawl
$ news-please -c$DATA_FOLDER/news_please
```
By default, it will store crawled
articles under `/tmp/nhnet/`. To terminate the process press `CTRL+C`.
The crawling may take some days (48 hours in our test) and it depends on the
network environment and #threads set in the config. As the crawling tool won't
stop automatically, it is not straightforward to check the progress. We suggest
to terminate the job if there are no new articles crawled in a short time period
(e.g., 10 minutes) by running
```shell
$ find /tmp/nhnet -type f | wc-l
```
Please note that it is expected that some URLs are no longer available on the
web as time goes by.
### Data Processing
Given the crawled articles under `/tmp/nhnet/`, we would like to transform these
textual articles into a set of `TFRecord` files containing serialized
tensorflow.Example protocol buffers, with feature keys following the BERT
[[2]](#2) tradition but is extended for multiple text segments. We will later
use these processed TFRecords for training and evaluation.
To do this, please first download a [BERT pretrained checkpoint](https://github.com/tensorflow/models/tree/master/official/nlp/bert#access-to-pretrained-checkpoints)
(`BERT-Base,Uncased` preferred for efficiency) and decompress the `tar.gz` file.
We need the vocabulary file and later use the checkpoint for NHNet
initialization.
Next, we can run the following data preprocess script which may take a few hours
to read files and tokenize article content.
```shell
# Recall that we use DATA_FOLDER=/path/to/downloaded_dataset
$ python3 raw_data_preprocess.py \
-crawled_articles=/tmp/nhnet \
-vocab=/path/to/bert_checkpoint/vocab.txt \
-do_lower_case=True \
-len_title=15 \
-len_passage=200 \
-max_num_articles=5 \
-data_folder=$DATA_FOLDER
```
This python script will export processed train/valid/eval files under
`$DATA_FOLDER/processed/`.
## Training
Please first install TensorFlow 2 and Tensorflow Model Garden following the