Unverified Commit fe6e01ad authored by yifeim's avatar yifeim Committed by GitHub
Browse files

[Model] Lda subgraph (#3206)



* add word_ids and simplify

* simplify

* add word_ids to be removed later

* remove word_ids

* seems to work

* tweak

* transpose word_z

* add word_ids example

* check api compatibility

* improve compatibility

* update doc

* tweak verbose

* restore word_z layout; tweak

* tweak

* tweak doc

* word_cT

* use log_weight and some other tweaks

* rewrite README

* update equations

* rewrite for clarity and pass tests

* tweak

* bugfix import

* fix unit test

* fix mult to be the same as old versions

* tweak

* could be a bugfix

* 0/0=nan

* add doc_subgraph utility function

* minor cache optimization

* minor cache tweak

* add environmental variable to trade cache speed for memory

* update README

* tweak

* add sparse update pass unit test

* simplify sparse update

* improve low-memory efficiency

* tweak

* add sample expectation scores to allow resampling

* simplify

* update comment

* avoid edge cases

* bugfix pred scores

* simplify

* add save function
Co-authored-by: default avatarYifei Ma <yifeim@amazon.com>
Co-authored-by: default avatarQuan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: default avatarJinjing Zhou <VoVAllen@users.noreply.github.com>
parent 9c41e97c
...@@ -8,64 +8,137 @@ There is no back-propagation, because gradient descent is typically considered ...@@ -8,64 +8,137 @@ There is no back-propagation, because gradient descent is typically considered
inefficient on probability simplex. inefficient on probability simplex.
On the provided small-scale example on 20 news groups dataset, our DGL-LDA model runs On the provided small-scale example on 20 news groups dataset, our DGL-LDA model runs
50% faster on GPU than sklearn model without joblib parallel. 50% faster on GPU than sklearn model without joblib parallel.
For larger graphs, thanks to subgraph sampling and low-memory implementation, we may fit 100 million unique words with 256 topic dimensions on a large multi-gpu machine.
(The runtime memory is often less than 2x of parameter storage.)
Key equations Key equations
--- ---
* The corpus is generated by hierarchical Bayes: document(d) -> latent topic(z) -> word(w) <!-- https://editor.codecogs.com/ -->
* All positions in the same document have shared topic distribution θ_d~Dir(α)
* All positions of the same topic have shared word distribution β_z~Dir(η) Let k be the topic index variable with one-hot encoded vector representation z. The rest of the variables are:
* The words in the same document / topic are correlated.
| | z_d\~p(θ_d) | w_k\~p(β_k) | z_dw\~q(ϕ_dw) |
**MAP** |-------------|-------------|-------------|---------------|
| Prior | Dir(α) | Dir(η) | (n/a) |
A simplified MAP model is just a non-conjugate model with an inner summation to integrate out the latent topic variable: | Posterior | Dir(γ_d) | Dir(λ_k) | (n/a) |
<img src="https://latex.codecogs.com/gif.latex?p(G)=\prod_{(d,w)}\left(\sum_z\theta_{dz}\beta_{zw}\right)" title="map" />
We overload w with bold-symbol-w, which represents the entire observed document-world multi-graph. The difference is better shown in the original paper.
The main complications are that θ_d / β_z are shared in the same document / topic and the variables reside in a probability simplex.
One way to work around it is via expectation maximization **Multinomial PCA**
<img src="https://latex.codecogs.com/gif.latex?\log&space;p(G)&space;=\sum_{(d,w)}\log\left(\sum_z\theta_{dz}\beta_{zw}\right)&space;\geq\sum_{(d,w)}\mathbb{E}_q\log\left(\frac{\theta_{dz}\beta_{zw}}{q(z;\phi_{dw})}\right)" title="map-em" /> Multinomial PCA is a "latent allocation" model without the "Dirichlet".
Its data likelihood sums over the latent topic-index variable k,
* An explicit posterior is ϕ_dwz ∝ θ_dz * β_zw <img src="https://latex.codecogs.com/svg.image?\inline&space;p(w_{di}|\theta_d,\beta)=\sum_k\theta_{dk}\beta_{kw}"/>,
* E-step: find summary statistics with fractional membership where θ_d and β_k are shared within the same document and topic, respectively.
* M-step: set θ_d, β_z proportional to the summary statistics
* With an explicit posterior, the bound is tight. If we perform gradient descent, we may need additional steps to project the parameters to the probability simplices:
<img src="https://latex.codecogs.com/svg.image?\inline&space;\sum_k\theta_{dk}=1"/>
and
<img src="https://latex.codecogs.com/svg.image?\inline&space;\sum_w\beta_{kw}=1"/>.
Instead, a more efficient solution is to borrow ideas from evidence lower-bound (ELBO) decomposition:
<!--
\log p(w) \geq \mathcal{L}(w,\phi)
\stackrel{def}{=}
\mathbb{E}_q [\log p(w,z;\theta,\beta) - \log q(z;\phi)]
\\=
\mathbb{E}_q [\log p(w|z;\beta) + \log p(z;\theta) - \log q(z;\phi)]
\\=
\sum_{dwk}n_{dw}\phi_{dwk} [\log\beta_{kw} + \log \theta_{dk} - \log \phi_{dwk}]
-->
<img src="https://latex.codecogs.com/svg.image?\log&space;p(w)&space;\geq&space;\mathcal{L}(w,\phi)\stackrel{def}{=}\mathbb{E}_q&space;[\log&space;p(w,z;\theta,\beta)&space;-&space;\log&space;q(z;\phi)]\\=\mathbb{E}_q&space;[\log&space;p(w|z;\beta)&space;&plus;&space;\log&space;p(z;\theta)&space;-&space;\log&space;q(z;\phi)]\\=\sum_{dwk}n_{dw}\phi_{dwk}&space;[\log\beta_{kw}&space;&plus;&space;\log&space;\theta_{dk}&space;-&space;\log&space;\phi_{dwk}]"/>
The solutions for
<img src="https://latex.codecogs.com/svg.image?\inline&space;\theta_{dk}\propto\sum_wn_{dw}\phi_{dwk}"/>
and
<img src="https://latex.codecogs.com/svg.image?\inline&space;\beta_{kw}\propto\sum_dn_{dw}\phi_{dwk}"/>
follow from the maximization of cross-entropy loss.
The solution for
<img src="https://latex.codecogs.com/svg.image?\inline&space;\phi_{dwk}\propto&space;\theta_{dk}\beta_{kw}"/>
follows from Kullback-Leibler divergence.
After normalizing to
<img src="https://latex.codecogs.com/svg.image?\inline&space;\sum_k\phi_{dwk}=1"/>,
the difference
<img src="https://latex.codecogs.com/svg.image?\inline&space;\ell_{dw}=\log\beta_{kw}+\log\theta_{dk}-\log\phi_{dwk}"/>
becomes constant in k,
which is connected to the likelihood for the observed document-word pairs.
Note that after learning, the document vector θ_d considers the correlation between all words in d and similarly the topic distribution vector β_k considers the correlations in all observed documents.
**Variational Bayes** **Variational Bayes**
A Bayesian model adds Dirichlet priors to θ_d & β_z. This causes the posterior to be implicit and the bound to be loose. We will still use an independence assumption and cycle through the variational parameters similarly to coordinate ascent. A Bayesian model adds Dirichlet priors to θ_d and β_z, which leads to a similar ELBO if we assume independence
<img src="https://latex.codecogs.com/svg.image?\inline&space;q(z,\theta,\beta;\phi,\gamma,\lambda)=q(z;\phi)q(\theta;\gamma)q(\beta;\lambda)"/>,
* The evidence lower-bound is i.e.:
<img src="https://latex.codecogs.com/gif.latex?\log&space;p(G)\geq&space;\mathbb{E}_q\left[\sum_{(d,w)}\log\left(&space;\frac{\theta_{dz}\beta_{zw}}{q(z;\phi_{dw})}&space;\right)&space;&plus;\sum_{d}&space;\log\left(&space;\frac{p(\theta_d;\alpha)}{q(\theta_d;\gamma_d)}&space;\right)&space;&plus;\sum_{z}&space;\log\left(&space;\frac{p(\beta_z;\eta)}{q(\beta_z;\lambda_z)}&space;\right)\right]" title="elbo" />
<!--
* ELBO objective function factors as \log p(w;\alpha,\eta) \geq \mathcal{L}(w,\phi,\gamma,\lambda)
<img src="https://latex.codecogs.com/gif.latex?\sum_{(d,w)}&space;\phi_{dw}^{\top}\left(&space;\mathbb{E}_{\gamma_d}[\log\theta_d]&space;&plus;\mathbb{E}_{\lambda}[\log\beta_{:w}]&space;-\log\phi_{dw}&space;\right)&space;\\&space;&plus;&space;\sum_d&space;(\alpha-\gamma_d)^\top\mathbb{E}_{\gamma_d}[\log&space;\theta_d]-(\log&space;B(\alpha)-\log&space;B(\gamma_d))&space;\\&space;&plus;&space;\sum_z&space;(\eta-\lambda_z)^\top\mathbb{E}_{\lambda_z}[\log&space;\beta_z]-(\log&space;B(\eta)-\log&space;B(\lambda_z))" title="factors" /> \stackrel{def}{=}
\mathbb{E}_q [\log p(w,z,\theta,\beta;\alpha,\eta) - \log q(z,\theta,\beta;\phi,\gamma,\lambda)]
* Similarly, optimization alternates between ϕ, γ, λ. Since θ, β are random, we use an explicit solution for E[log X] under Dirichlet distribution via digamma function. \\=
\mathbb{E}_q \left[
\log p(w|z,\beta) + \log p(z|\theta) - \log q(z;\phi)
+\log p(\theta;\alpha) - \log q(\theta;\gamma)
+\log p(\beta;\eta) - \log q(\beta;\lambda)
\right]
\\=
\sum_{dwk}n_{dw}\phi_{dwk} (\mathbb{E}_{\lambda_k}[\log\beta_{kw}] + \mathbb{E}_{\gamma_d}[\log \theta_{dk}] - \log \phi_{dwk})
\\+\sum_{d}\left[
(\alpha-\gamma_d)^\top\mathbb{E}_{\gamma_d}[\log\theta_d]
-(\log B(\alpha 1_K) - \log B(\gamma_d))
\right]
\\+\sum_{k}\left[
(\eta-\lambda_k)^\top\mathbb{E}_{\lambda_k}[\log\beta_k]
-(\log B(\eta 1_W) - \log B(\lambda_k))
\right]
-->
<img src="https://latex.codecogs.com/svg.image?\log&space;p(w;\alpha,\eta)&space;\geq&space;\mathcal{L}(w,\phi,\gamma,\lambda)\stackrel{def}{=}\mathbb{E}_q&space;[\log&space;p(w,z,\theta,\beta;\alpha,\eta)&space;-&space;\log&space;q(z,\theta,\beta;\phi,\gamma,\lambda)]\\=\mathbb{E}_q&space;\left[\log&space;p(w|z,\beta)&space;&plus;&space;\log&space;p(z|\theta)&space;-&space;\log&space;q(z;\phi)&plus;\log&space;p(\theta;\alpha)&space;-&space;\log&space;q(\theta;\gamma)&plus;\log&space;p(\beta;\eta)&space;-&space;\log&space;q(\beta;\lambda)\right]\\=\sum_{dwk}n_{dw}\phi_{dwk}&space;(\mathbb{E}_{\lambda_k}[\log\beta_{kw}]&space;&plus;&space;\mathbb{E}_{\gamma_d}[\log&space;\theta_{dk}]&space;-&space;\log&space;\phi_{dwk})\\&plus;\sum_{d}\left[(\alpha-\gamma_d)^\top\mathbb{E}_{\gamma_d}[\log\theta_d]-(\log&space;B(\alpha&space;1_K)&space;-&space;\log&space;B(\gamma_d))\right]\\&plus;\sum_{k}\left[(\eta-\lambda_k)^\top\mathbb{E}_{\lambda_k}[\log\beta_k]-(\log&space;B(\eta&space;1_W)&space;-&space;\log&space;B(\lambda_k))\right]"/>
**Solutions**
The solutions to VB subsumes the solutions to multinomial PCA when n goes to infinity.
The solution for ϕ is
<img src="https://latex.codecogs.com/svg.image?\inline&space;\log\phi_{dwk}=\mathbb{E}_{\gamma_d}[\log\theta_{dk}]+\mathbb{E}_{\lambda_k}[\log\beta_{kw}]-\ell_{dw}"/>,
where the additional expectation can be expressed via digamma functions
and
<img src="https://latex.codecogs.com/svg.image?\inline&space;\ell_{dw}=\log\sum_k\exp(\mathbb{E}_{\gamma_d}[\log\theta_{dk}]+\mathbb{E}_{\lambda_k}[\log\beta_{kw}])"/>
is the log-partition function.
The solutions for
<img src="https://latex.codecogs.com/svg.image?\inline&space;\gamma_{dk}=\alpha+\sum_wn_{dw}\phi_{dwk}"/>
and
<img src="https://latex.codecogs.com/svg.image?\inline&space;\lambda_{kw}=\eta+\sum_dn_{dw}\phi_{dwk}"/>
come from direct gradient calculation.
After substituting the optimal solutions, we compute the marginal likelihood by adding the three terms, which are all connected to (the negative of) Kullback-Leibler divergence.
DGL usage DGL usage
--- ---
The corpus is represented as a bipartite multi-graph G. The corpus is represented as a bipartite multi-graph G.
We use DGL to propagate information through the edges and aggregate the distributions at doc/word nodes. We use DGL to propagate information through the edges and aggregate the distributions at doc/word nodes.
For scalability, the phi variables are transient and updated during message passing. For scalability, the phi variables are transient and updated during message passing.
The gamma / lambda variables are updated after the nodes receive all edge messages. The gamma / lambda variables are updated after the nodes receive all edge messages.
Following the conventions in [1], the gamma update is called E-step and the lambda update is called M-step, because the beta variable has smaller variance. Following the conventions in [1], the gamma update is called E-step and the lambda update is called M-step.
The lambda variable is further recorded by the trainer and we may further approximate its MAP estimate by using a large step size for word nodes. The lambda variable is further recorded by the trainer.
A separate function is used to produce perplexity, which is based on the ELBO objective function divided by the total numbers of word/doc occurrences. A separate function is used to produce perplexity, which is based on the ELBO objective function divided by the total numbers of word/doc occurrences.
Example Example
--- ---
`%run example_20newsgroups.py` `%run example_20newsgroups.py`
* Approximately matches scikit-learn training perplexity after 10 rounds of training. * Approximately matches scikit-learn training perplexity after 10 rounds of training.
* Exactly matches scikit-learn training perplexity if word_z is set to lda.components_.T * Exactly matches scikit-learn training perplexity if word_z is set to lda.components_.T
* To compute testing perplexity, we need to fix the word beta variables via MAP estimate. This step is not taken by sklearn and its beta part seems to contain another bug by dividing the training loss by the testing word counts. Nonetheless, I recommend setting `step_size["word"]` to a larger value to approximate the corresponding MAP estimate. * There is a difference in how we compute testing perplexity. We weigh the beta contributions by the training word counts, whereas sklearn weighs them by test word counts.
* The DGL-LDA model runs 50% faster on GPU devices compared with sklearn without joblib parallel. * The DGL-LDA model runs 50% faster on GPU devices compared with sklearn without joblib parallel.
Advanced configurations Advanced configurations
--- ---
* Set `step_size["word"]` to a large value obtain a MAP estimate for beta. * Set `0<rho<1` for online learning with partial_fit.
* Set `0<word_rho<1` for online learning. * Set `mult["doc"]=100` or `mult["word"]=100` or some large value to disable the corresponding Bayesian priors.
References References
--- ---
......
...@@ -103,7 +103,7 @@ print() ...@@ -103,7 +103,7 @@ print()
print("Training dgl-lda model...") print("Training dgl-lda model...")
t0 = time() t0 = time()
model = LDAModel(G, n_components) model = LDAModel(G.num_nodes('word'), n_components)
model.fit(G) model.fit(G)
print("done in %0.3fs." % (time() - t0)) print("done in %0.3fs." % (time() - t0))
print() print()
...@@ -111,8 +111,9 @@ print() ...@@ -111,8 +111,9 @@ print()
print(f"dgl-lda training perplexity {model.perplexity(G):.3f}") print(f"dgl-lda training perplexity {model.perplexity(G):.3f}")
print(f"dgl-lda testing perplexity {model.perplexity(Gt):.3f}") print(f"dgl-lda testing perplexity {model.perplexity(Gt):.3f}")
word_nphi = np.vstack([nphi.tolist() for nphi in model.word_data.nphi])
plot_top_words( plot_top_words(
type('dummy', (object,), {'components_': G.ndata['z']['word'].cpu().numpy().T}), type('dummy', (object,), {'components_': word_nphi}),
tf_feature_names, n_top_words, 'Topics in LDA model') tf_feature_names, n_top_words, 'Topics in LDA model')
print("Training scikit-learn model...") print("Training scikit-learn model...")
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment