[Example] metapath2vec (#764)

* Prepare for metapath sampler This file is just for reviewing metapath sampling algorithm (Python version). * Delete metapath_sampler * Prepare for metapath sampler This file is just for reviewing metapath sampling algorithm (Python code). * Add files via upload * Create metapath2vec.md * Add files via upload * Delete data_handler.py * Delete word2vec.py * Delete word_train.py * Add files via upload Metapath2vec implementations. Metapath2vec++ needs negative sampler optimization. * Delete shuffle_training.py * Delete test.py * Add files via upload * Delete sampler.py * Delete metapath_sampler.md * Add files via upload * Update and rename shuffle_training.py to metapath2vec.py * Update reading_data.py * Update metapath2vec.md * Update metapath2vec.md * Update metapath2vec.md * Update metapath2vec.md * Update metapath2vec.md * Create label 2 * Delete label 2 * Create testing.md * Add files via upload * Create sample.md * Add files via upload * Delete sampler.py * Add files via upload * Delete googlescholar.8area.author.label.txt * Delete googlescholar.8area.venue.label.txt * Delete testing.md * Delete id_author.txt * Delete id_conf.txt * Delete paper.txt * Delete paper_author.txt * Delete paper_conf.txt * Delete sample.md * Delete sampler.py * Add files via upload * Add files via upload * Add files via upload * Delete reading_data.py * Add files via upload * Add files via upload * Delete metapath2vec.py * Add files via upload * Rename shuffle_training.py to metapath2vec.py * Update metapath2vec.md * Delete reading_data.py * add comments and remov e commented codes

[Example] metapath2vec (#764)
* Prepare for metapath sampler This file is just for reviewing metapath sampling algorithm (Python version). * Delete metapath_sampler * Prepare for metapath sampler This file is just for reviewing metapath sampling algorithm (Python code). * Add files via upload * Create metapath2vec.md * Add files via upload * Delete data_handler.py * Delete word2vec.py * Delete word_train.py * Add files via upload Metapath2vec implementations. Metapath2vec++ needs negative sampler optimization. * Delete shuffle_training.py * Delete test.py * Add files via upload * Delete sampler.py * Delete metapath_sampler.md * Add files via upload * Update and rename shuffle_training.py to metapath2vec.py * Update reading_data.py * Update metapath2vec.md * Update metapath2vec.md * Update metapath2vec.md * Update metapath2vec.md * Update metapath2vec.md * Create label 2 * Delete label 2 * Create testing.md * Add files via upload * Create sample.md * Add files via upload * Delete sampler.py * Add files via upload * Delete googlescholar.8area.author.label.txt * Delete googlescholar.8area.venue.label.txt * Delete testing.md * Delete id_author.txt * Delete id_conf.txt * Delete paper.txt * Delete paper_author.txt * Delete paper_conf.txt * Delete sample.md * Delete sampler.py * Add files via upload * Add files via upload * Add files via upload * Delete reading_data.py * Add files via upload * Add files via upload * Delete metapath2vec.py * Add files via upload * Rename shuffle_training.py to metapath2vec.py * Update metapath2vec.md * Delete reading_data.py * add comments and remov e commented codes
ddeb86f9 · ziqiaomeng · Da Zheng · 98954c56 · ddeb86f9 · ddeb86f9
Commit ddeb86f9 authored Sep 02, 2019 by ziqiaomeng Committed by Da Zheng Sep 02, 2019
7 changed files
--- a/examples/pytorch/metapath2vec/download.py
+++ b/examples/pytorch/metapath2vec/download.py
+import os
+import torch as th
+import torch.nn as nn
+
+
+class AminerDataset:
+    """
+    Download Aminer Dataset from Amazon S3 bucket. 
+    """
+    def __init__(self, path):
+
+        self.url = 'https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/aminer.zip'
+
+        if not os.path.exists(os.path.join(path, 'aminer')):
+            print('File not found. Downloading from', self.url)
+            self._download_and_extract(path, 'aminer.zip')
+
+    def _download_and_extract(self, path, filename):
+        import shutil, zipfile, zlib
+        from tqdm import tqdm
+        import requests
+
+        fn = os.path.join(path, filename)
+
+        if os.path.exists(path):
+            shutil.rmtree(path, ignore_errors=True)
+        os.makedirs(path)
+        f_remote = requests.get(self.url, stream=True)
+        assert f_remote.status_code == 200, 'fail to open {}'.format(self.url)
+        with open(fn, 'wb') as writer:
+            for chunk in tqdm(f_remote.iter_content(chunk_size=1024*1024*3)):
+                writer.write(chunk)
+        print('Download finished. Unzipping the file...')
+
+        with zipfile.ZipFile(fn) as zf:
+            zf.extractall(path)
+        print('Unzip finished.')
+        self.fn = fn
--- a/examples/pytorch/metapath2vec/metapath2vec.md
+++ b/examples/pytorch/metapath2vec/metapath2vec.md
+Metapath2vec
+============
+
+- Paper link: [metapath2vec: Scalable Representation Learning for Heterogeneous Networks](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf)
+- Author's code repo: [https://ericdongyx.github.io/metapath2vec/m2v.html](https://ericdongyx.github.io/metapath2vec/m2v.html). 
+
+Dependencies
+------------
+- PyTorch 1.0.1+
+
+How to run the code
+-----
+Run with the following procedures:
+
+1, Run sampler.py on your graph dataset. Note that: the input text file should be list of mappings so you probably need to preprocess your graph dataset. Files with sample format are available in "net_dbis" file. Of course you could also use your own metapath sampler implementation.
+
+2, Run the following command:
+```bash
+python metapath2vec.py --download "where/you/want/to/download" --output_file "your_output_file_path"
+```
+
+Tips: Change num_workers based on your GPU instances; Running 3 or 4 epochs is actually enough. 
+
+Tricks included in the implementation:
+-------
+1, Sub-sampling;
+
+2, Negative Sampling without repeatedly calling numpy random choices;
+
+Performance and Explanations:
+-------
+Venue Classification Results for Metapath2vec:
+
+| Metric | 5% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% |
+| ------ | -- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Macro-F1 | 0.3033 | 0.5247 | 0.8033 | 0.8971 | 0.9406 | 0.9532 | 0.9529 | 0.9701 | 0.9683 | 0.9670 |
+| Micro-F1 | 0.4173 | 0.5975 | 0.8327 | 0.9011 | 0.9400 | 0.9522 | 0.9537 | 0.9725 | 0.9815 | 0.9857 |
+
+Author Classfication Results for Metapath2vec:
+
+| Metric | 5% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% |
+| ------ | -- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Macro-F1 | 0.9216 | 0.9262 | 0.9292 | 0.9303 | 0.9309 | 0.9314 | 0.9315 | 0.9316 | 0.9319 | 0.9320 |
+| Micro-F1 | 0.9279 | 0.9319 | 0.9346 | 0.9356 | 0.9361 | 0.9365 | 0.9365 | 0.9365 | 0.9367 | 0.9369 |
+
+Note that: 
+
+Testing files are available in "label 2" file;
+
+The above are results listed in the paper, in real experiments, exact numbers might be slightly different:
+
+1, For venue node classification results, when the size of the training dataset is small (e.g. 5%), the variance of the performance is large since the number of available labeled venues is small. 
+
+2, For author node classification results, the performance is stable since the number of available labeled authors is huge, so even 5% training data would be sufficient.
+
+3, In the test.py, you could change experiment times you want, especially it is very slow to test author classification so you could only do 1 or 2 times.
--- a/examples/pytorch/metapath2vec/metapath2vec.py
+++ b/examples/pytorch/metapath2vec/metapath2vec.py
+import torch
+import argparse
+import torch.optim as optim
+from torch.utils.data import DataLoader
+
+from tqdm import tqdm
+
+from reading_data import DataReader, Metapath2vecDataset
+from model import SkipGramModel
+
+
+class Metapath2VecTrainer:
+    def __init__(self, args):
+        self.data = DataReader(args.download, args.min_count, args.care_type)
+        dataset = Metapath2vecDataset(self.data, args.window_size)
+        self.dataloader = DataLoader(dataset, batch_size=args.batch_size,
+                                     shuffle=True, num_workers=args.num_workers, collate_fn=dataset.collate)
+
+        self.output_file_name = args.output_file
+        self.emb_size = len(self.data.word2id)
+        self.emb_dimension = args.dim
+        self.batch_size = args.batch_size
+        self.iterations = args.iterations
+        self.initial_lr = args.initial_lr
+        self.skip_gram_model = SkipGramModel(self.emb_size, self.emb_dimension)
+
+        self.use_cuda = torch.cuda.is_available()
+        self.device = torch.device("cuda" if self.use_cuda else "cpu")
+        if self.use_cuda:
+            self.skip_gram_model.cuda()
+
+    def train(self):
+
+        for iteration in range(self.iterations):
+            print("\n\n\nIteration: " + str(iteration + 1))
+            optimizer = optim.SparseAdam(self.skip_gram_model.parameters(), lr=self.initial_lr)
+            scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, len(self.dataloader))
+
+            running_loss = 0.0
+            for i, sample_batched in enumerate(tqdm(self.dataloader)):
+
+                if len(sample_batched[0]) > 1:
+                    pos_u = sample_batched[0].to(self.device)
+                    pos_v = sample_batched[1].to(self.device)
+                    neg_v = sample_batched[2].to(self.device)
+
+                    scheduler.step()
+                    optimizer.zero_grad()
+                    loss = self.skip_gram_model.forward(pos_u, pos_v, neg_v)
+                    loss.backward()
+                    optimizer.step()
+
+                    running_loss = running_loss * 0.9 + loss.item() * 0.1
+                    if i > 0 and i % 500 == 0:
+                        print(" Loss: " + str(running_loss))
+
+            self.skip_gram_model.save_embedding(self.data.id2word, self.output_file_name)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Metapath2vec")
+    #parser.add_argument('--input_file', type=str, help="input_file")
+    parser.add_argument('--download', type=str, help="download_path")
+    parser.add_argument('--output_file', type=str, help='output_file')
+    parser.add_argument('--dim', default=128, type=int, help="embedding dimensions")
+    parser.add_argument('--window_size', default=7, type=int, help="context window size")
+    parser.add_argument('--iterations', default=5, type=int, help="iterations")
+    parser.add_argument('--batch_size', default=50, type=int, help="batch size")
+    parser.add_argument('--care_type', default=0, type=int, help="if 1, heterogeneous negative sampling, else normal negative sampling")
+    parser.add_argument('--initial_lr', default=0.025, type=float, help="learning rate")
+    parser.add_argument('--min_count', default=5, type=int, help="min count")
+    parser.add_argument('--num_workers', default=16, type=int, help="number of workers")
+    args = parser.parse_args()
+    m2v = Metapath2VecTrainer(args)
+    m2v.train()
--- a/examples/pytorch/metapath2vec/model.py
+++ b/examples/pytorch/metapath2vec/model.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import init
+
+"""
+    u_embedding: Embedding for center word.
+    v_embedding: Embedding for neighbor words.
+"""
+
+
+class SkipGramModel(nn.Module):
+
+    def __init__(self, emb_size, emb_dimension):
+        super(SkipGramModel, self).__init__()
+        self.emb_size = emb_size
+        self.emb_dimension = emb_dimension
+        self.u_embeddings = nn.Embedding(emb_size, emb_dimension, sparse=True)
+        self.v_embeddings = nn.Embedding(emb_size, emb_dimension, sparse=True)
+
+        initrange = 1.0 / self.emb_dimension
+        init.uniform_(self.u_embeddings.weight.data, -initrange, initrange)
+        init.constant_(self.v_embeddings.weight.data, 0)
+
+    def forward(self, pos_u, pos_v, neg_v):
+        emb_u = self.u_embeddings(pos_u)
+        emb_v = self.v_embeddings(pos_v)
+        emb_neg_v = self.v_embeddings(neg_v)
+
+        score = torch.sum(torch.mul(emb_u, emb_v), dim=1)
+        score = torch.clamp(score, max=10, min=-10)
+        score = -F.logsigmoid(score)
+
+        neg_score = torch.bmm(emb_neg_v, emb_u.unsqueeze(2)).squeeze()
+        neg_score = torch.clamp(neg_score, max=10, min=-10)
+        neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
+
+        return torch.mean(score + neg_score)
+
+    def save_embedding(self, id2word, file_name):
+        embedding = self.u_embeddings.weight.cpu().data.numpy()
+        with open(file_name, 'w') as f:
+            f.write('%d %d\n' % (len(id2word), self.emb_dimension))
+            for wid, w in id2word.items():
+                e = ' '.join(map(lambda x: str(x), embedding[wid]))
+                f.write('%s %s\n' % (w, e))
\ No newline at end of file
--- a/examples/pytorch/metapath2vec/reading_data.py
+++ b/examples/pytorch/metapath2vec/reading_data.py
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from download import AminerDataset
+np.random.seed(12345)
+
+class DataReader:
+    NEGATIVE_TABLE_SIZE = 1e8
+
+    def __init__(self, download, min_count, care_type):
+
+        self.negatives = []
+        self.discards = []
+        self.negpos = 0
+        self.care_type = care_type
+        self.word2id = dict()
+        self.id2word = dict()
+        self.sentences_count = 0
+        self.token_count = 0
+        self.word_frequency = dict()
+        self.download = download
+        FB = AminerDataset(self.download)
+        self.inputFileName = FB.fn
+        self.read_words(min_count)
+        self.initTableNegatives()
+        self.initTableDiscards()
+
+    def read_words(self, min_count):
+        word_frequency = dict()
+        for line in open(self.inputFileName, encoding="ISO-8859-1"):
+            line = line.split()
+            if len(line) > 1:
+                self.sentences_count += 1
+                for word in line:
+                    if len(word) > 0:
+                        self.token_count += 1
+                        word_frequency[word] = word_frequency.get(word, 0) + 1
+
+                        if self.token_count % 1000000 == 0:
+                            print("Read " + str(int(self.token_count / 1000000)) + "M words.")
+
+        wid = 0
+        for w, c in word_frequency.items():
+            if c < min_count:
+                continue
+            self.word2id[w] = wid
+            self.id2word[wid] = w
+            self.word_frequency[wid] = c
+            wid += 1
+
+        self.word_count = len(self.word2id)
+        print("Total embeddings: " + str(len(self.word2id)))
+
+    def initTableDiscards(self):
+        # get a frequency table for sub-sampling. Note that the frequency is adjusted by
+        # sub-sampling tricks.
+        t = 0.0001
+        f = np.array(list(self.word_frequency.values())) / self.token_count
+        self.discards = np.sqrt(t / f) + (t / f)
+
+    def initTableNegatives(self):
+        # get a table for negative sampling, if word with index 2 appears twice, then 2 will be listed
+        # in the table twice.
+        pow_frequency = np.array(list(self.word_frequency.values())) ** 0.75
+        words_pow = sum(pow_frequency)
+        ratio = pow_frequency / words_pow
+        count = np.round(ratio * DataReader.NEGATIVE_TABLE_SIZE)
+        for wid, c in enumerate(count):
+            self.negatives += [wid] * int(c)
+        self.negatives = np.array(self.negatives)
+        np.random.shuffle(self.negatives)
+        self.sampling_prob = ratio
+
+    def getNegatives(self, target, size):  # TODO check equality with target
+        if self.care_type == 0:
+            response = self.negatives[self.negpos:self.negpos + size]
+            self.negpos = (self.negpos + size) % len(self.negatives)
+            if len(response) != size:
+                return np.concatenate((response, self.negatives[0:self.negpos]))
+        return response
+
+
+# -----------------------------------------------------------------------------------------------------------------
+
+class Metapath2vecDataset(Dataset):
+    def __init__(self, data, window_size):
+        # read in data, window_size and input filename
+        self.data = data
+        self.window_size = window_size
+        self.input_file = open(data.inputFileName, encoding="ISO-8859-1")
+
+    def __len__(self):
+        # return the number of walks
+        return self.data.sentences_count
+
+    def __getitem__(self, idx):
+        # return the list of pairs (center, context, 5 negatives)
+        while True:
+            line = self.input_file.readline()
+            if not line:
+                self.input_file.seek(0, 0)
+                line = self.input_file.readline()
+
+            if len(line) > 1:
+                words = line.split()
+
+                if len(words) > 1:
+                    word_ids = [self.data.word2id[w] for w in words if
+                                w in self.data.word2id and np.random.rand() < self.data.discards[self.data.word2id[w]]]
+
+                    pair_catch = []
+                    for i, u in enumerate(word_ids):
+                        for j, v in enumerate(
+                                word_ids[max(i - self.window_size, 0):i + self.window_size]):
+                            assert u < self.data.word_count
+                            assert v < self.data.word_count
+                            if i == j:
+                                continue
+                            pair_catch.append((u, v, self.data.getNegatives(v,5)))
+                    return pair_catch
+
+
+    @staticmethod
+    def collate(batches):
+        all_u = [u for batch in batches for u, _, _ in batch if len(batch) > 0]
+        all_v = [v for batch in batches for _, v, _ in batch if len(batch) > 0]
+        all_neg_v = [neg_v for batch in batches for _, _, neg_v in batch if len(batch) > 0]
+
+        return torch.LongTensor(all_u), torch.LongTensor(all_v), torch.LongTensor(all_neg_v)
--- a/examples/pytorch/metapath2vec/sampler.py
+++ b/examples/pytorch/metapath2vec/sampler.py
+import numpy as np
+import torch
+import torchvision
+from torch.autograd import Variable
+import random
+import time
+
+Metapath = "Conference-Paper-Author-Paper-Conference"
+num_walks_per_node = 1000
+walk_length = 100
+
+#construct mapping from text, could be changed to DGL later
+def construct_id_dict():
+    id_to_paper = {}
+    id_to_author = {}
+    id_to_conf = {}
+    f_3 = open(".../id_author.txt", encoding="ISO-8859-1")
+    f_4 = open(".../id_conf.txt", encoding="ISO-8859-1")
+    f_5 = open(".../paper.txt", encoding="ISO-8859-1")
+    while True:
+        z = f_3.readline()
+        if not z:
+            break
+        z = z.split('\t')
+        identity = int(z[0])
+        id_to_author[identity] = z[1].strip("\n")
+    while True:
+        w = f_4.readline()
+        if not w:
+            break;
+        w = w.split('\t')
+        identity = int(w[0])
+        id_to_conf[identity] = w[1].strip("\n")
+    while True:
+        v = f_5.readline()
+        if not v:
+            break;
+        v = v.split(' ')
+        identity = int(v[0])
+        paper_name = ""
+        for s in range(5, len(v)):
+            paper_name += v[s]
+        paper_name = 'p' + paper_name
+        id_to_paper[identity] = paper_name.strip('\n')
+    f_3.close()
+    f_4.close()
+    f_5.close()
+    return id_to_paper, id_to_author, id_to_conf
+
+#construct mapping from text, could be changed to DGL later
+def construct_types_mappings():
+    paper_to_author = {}
+    author_to_paper = {}
+    paper_to_conf = {}
+    conf_to_paper = {}
+    f_1 = open(".../paper_author.txt", "r")
+    f_2 = open(".../paper_conf.txt", "r")
+    for x in f_1:
+        x = x.split('\t')
+        x[0] = int(x[0])
+        x[1] = int(x[1].strip('\n'))
+        if x[0] in paper_to_author:
+            paper_to_author[x[0]].append(x[1])
+        else:
+            paper_to_author[x[0]] = []
+            paper_to_author[x[0]].append(x[1])
+        if x[1] in author_to_paper:
+            author_to_paper[x[1]].append(x[0])
+        else:
+            author_to_paper[x[1]] = []
+            author_to_paper[x[1]].append(x[0])
+    for y in f_2:
+        y = y.split('\t')
+        y[0] = int(y[0])
+        y[1] = int(y[1].strip('\n'))
+        if y[0] in paper_to_conf:
+            paper_to_conf[y[0]].append(y[1])
+        else:
+            paper_to_conf[y[0]] = []
+            paper_to_conf[y[0]].append(y[1])
+        if y[1] in conf_to_paper:
+            conf_to_paper[y[1]].append(y[0])
+        else:
+            conf_to_paper[y[1]] = []
+            conf_to_paper[y[1]].append(y[0])
+    f_1.close()
+    f_2.close()
+    return paper_to_author, author_to_paper, paper_to_conf, conf_to_paper
+
+#"conference - paper - Author - paper - conference" metapath sampling
+def generate_metapath():
+    output_path = open(".../output_path.txt", "w")
+    id_to_paper, id_to_author, id_to_conf = construct_id_dict()
+    paper_to_author, author_to_paper, paper_to_conf, conf_to_paper = construct_types_mappings()
+    count = 0
+    #loop all conferences
+    for conf_id in conf_to_paper.keys():
+        start_time = time.time()
+        print("sampling" + str(count))
+        conf = id_to_conf[conf_id]
+        conf0 = conf
+        #for each conference, simulate num_walks_per_node walks
+        for i in range(num_walks_per_node):
+            outline = conf0
+            # each walk with length walk_length
+            for j in range(walk_length):
+                # C - P
+                paper_list_1 = conf_to_paper[conf_id]
+                # check whether the paper nodes link to any author node
+                connections_1 = False
+                available_paper_1 = []
+                for k in range(len(paper_list_1)):
+                    if paper_list_1[k] in paper_to_author:
+                        available_paper_1.append(paper_list_1[k])
+                num_p_1 = len(available_paper_1)
+                if num_p_1 != 0:
+                    connections_1 = True
+                    paper_1_index = random.randrange(num_p_1)
+                    #paper_id_1 = paper_list_1[paper_1_index]
+                    paper_id_1 = available_paper_1[paper_1_index]
+                    paper_1 = id_to_paper[paper_id_1]
+                    outline += " " + paper_1
+                else:
+                    break
+                # C - P - A
+                author_list = paper_to_author[paper_id_1]
+                num_a = len(author_list)
+                # No need to check
+                author_index = random.randrange(num_a)
+                author_id = author_list[author_index]
+                author = id_to_author[author_id]
+                outline += " " + author
+                # C - P - A - P
+                paper_list_2 = author_to_paper[author_id]
+                #check whether paper node links to any conference node
+                connections_2 = False
+                available_paper_2 = []
+                for m in range(len(paper_list_2)):
+                    if paper_list_2[m] in paper_to_conf:
+                        available_paper_2.append(paper_list_2[m])
+                num_p_2 = len(available_paper_2)
+                if num_p_2 != 0:
+                    connections_2 = True
+                    paper_2_index = random.randrange(num_p_2)
+                    paper_id_2 = available_paper_2[paper_2_index]
+                    paper_2 = id_to_paper[paper_id_2]
+                    outline += " " + paper_2
+                else:
+                    break
+                # C - P - A - P - C
+                conf_list = paper_to_conf[paper_id_2]
+                num_c = len(conf_list)
+                conf_index = random.randrange(num_c)
+                conf_id = conf_list[conf_index]
+                conf = id_to_conf[conf_id]
+                outline += " " + conf
+            if connections_1 and connections_2:
+                output_path.write(outline + "\n")
+            else:
+                break
+            # Note that the original mapping text has type indicator in front of each node just like "cVLDB"
+            # So the sampling sequence looks like "cconference ppaper aauthor ppaper cconference"
+        count += 1
+        print("--- %s seconds ---" % (time.time() - start_time))
+    output_path.close()
+
+
+if __name__ == "__main__":
+    generate_metapath()
--- a/examples/pytorch/metapath2vec/test.py
+++ b/examples/pytorch/metapath2vec/test.py
+import numpy as np
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import f1_score
+
+
+if __name__ == "__main__":   
+    venue_count = 133
+    author_count = 246678
+    experiment_times = 1
+    percent = 0.05
+    file = open(".../output_file_path/...")
+    file_2 = open(".../label 2/googlescholar.8area.author.label.txt")
+    check_venue = {}
+    check_author = {}
+    for line in file_1:
+        venue_label = line.strip().split(" ")
+        check_venue[venue_label[0]] = int(venue_label[1])
+    for line in file_2:
+        author_label = line.strip().split(" ")
+        check_author[author_label[0]] = int(author_label[1])
+    venue_embed_dict = {}
+    author_embed_dict = {}
+    # collect embeddings separately in dictionary form
+    file.readline()
+    print("read line by line")
+    for line in file:
+        embed = line.strip().split(' ')
+        if embed[0] in check_venue:
+            venue_embed_dict[embed[0]] = []
+            for i in range(1, len(embed), 1):
+                venue_embed_dict[embed[0]].append(float(embed[i]))
+        if embed[0] in check_author:
+            author_embed_dict[embed[0]] = []
+            for j in range(1, len(embed), 1):
+                author_embed_dict[embed[0]].append(float(embed[j]))
+    #get venue embeddings
+    print("reading finished")
+    venues = list(venue_embed_dict.keys())
+    authors = list(author_embed_dict.keys())
+    macro_average_venue = 0
+    micro_average_venue = 0
+    macro_average_author = 0
+    micro_average_author = 0
+    for time in range(experiment_times):
+        print("one more time")
+        np.random.shuffle(venues)
+        np.random.shuffle(authors)
+        venue_embedding = np.array([])
+        author_embedding = np.array([])
+        print("collecting venue embeddings")
+        for venue in venues:
+            temp = np.array(venue_embed_dict[venue])
+            if len(venue_embedding) == 0:
+                venue_embedding = temp
+            else:
+                venue_embedding = np.vstack((venue_embedding, temp))
+        print("collecting author embeddings")
+        count = 0
+        for author in authors:
+            count += 1
+            print("one more author " + str(count))
+            temp_1 = np.array(author_embed_dict[author])
+            if len(author_embedding) == 0:
+                author_embedding = temp_1
+            else:
+                author_embedding = np.vstack((author_embedding, temp_1))
+        # split data into training and testing
+        author_split = int(author_count * 0.8)
+        author_training = author_embedding[:author_split+1,:]
+        author_testing = author_embedding[author_split+1:,:]
+        print("splitting")
+        venue_split = int(venue_count * percent)
+        venue_training = venue_embedding[:venue_split,:]
+        venue_testing = venue_embedding[venue_split:,:]
+        author_split = int(author_count * percent)
+        author_training = author_embedding[:author_split,:]
+        author_testing = author_embedding[author_split:,:]
+        # split label into training and testing
+        venue_label = []
+        venue_true = []
+        author_label = []
+        author_true = []
+        for i in range(len(venues)):
+            if i < venue_split:
+                venue_label.append(check_venue[venues[i]])
+            else:
+                venue_true.append(check_venue[venues[i]])
+        venue_label = np.array(venue_label)
+        venue_true = np.array(venue_true)
+        for j in range(len(authors)):
+            if j < author_split:
+                author_label.append(check_author[authors[j]])
+            else:
+                author_true.append(check_author[authors[j]])
+        author_label = np.array(author_label)
+        author_true = np.array(author_true)
+        file.close()
+        print("beging predicting")
+        clf_venue = LogisticRegression(random_state=0, solver="lbfgs", multi_class="multinomial").fit(venue_training,venue_label)
+        y_pred_venue = clf_venue.predict(venue_testing)
+        clf_author = LogisticRegression(random_state=0, solver="lbfgs", multi_class="multinomial").fit(author_training,author_label)
+        y_pred_author = clf_author.predict(author_testing)
+        macro_average_venue += f1_score(venue_true, y_pred_venue, average="macro")
+        micro_average_venue += f1_score(venue_true, y_pred_venue, average="micro")
+        macro_average_author += f1_score(author_true, y_pred_author, average="macro")
+        micro_average_author += f1_score(author_true, y_pred_author, average="micro")
+    print(macro_average_venue/float(experiment_times))
+    print(micro_average_venue/float(experiment_times))
+    print(macro_average_author / float(experiment_times))
+    print(micro_average_author / float(experiment_times))