"magic_pdf/vscode:/vscode.git/clone" did not exist on "3271cf75d3f01895dcada70830a73edf95ef23a2"
Commit 0d8720c2 authored by Chris Shallue's avatar Chris Shallue
Browse files

Remove inline math from README.md (not supported by GitHub). Replace

with italics and subscript markers.
parent 4f9d1024
......@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture.
![Show and Tell Architecture](g3doc/show_and_tell_architecture.png)
</center>
In this diagram, $$\{ s_0, s_1, ..., s_{N-1} \}$$ are the words of the caption
and $$\{ w_e s_0, w_e s_1, ..., w_e s_{N-1} \}$$ are their corresponding word
embedding vectors. The outputs $$\{ p_1, p_2, ..., p_N \}$$ of the LSTM are
probability distributions generated by the model for the next word in the
sentence. The terms $$\{ \log p_1(s_1), \log p_2(s_2), ..., \log p_N(s_N) \}$$
are the log-likelihoods of the correct word at each step; the negated sum of
these terms is the minimization objective of the model.
In this diagram, \{*s*<sub>0</sub>, *s*<sub>1</sub>, ..., *s*<sub>*N*-1</sub>\}
are the words of the caption and \{*w*<sub>*e*</sub>*s*<sub>0</sub>,
*w*<sub>*e*</sub>*s*<sub>1</sub>, ..., *w*<sub>*e*</sub>*s*<sub>*N*-1</sub>\}
are their corresponding word embedding vectors. The outputs \{*p*<sub>1</sub>,
*p*<sub>2</sub>, ..., *p*<sub>*N*</sub>\} of the LSTM are probability
distributions generated by the model for the next word in the sentence. The
terms \{log *p*<sub>1</sub>(*s*<sub>1</sub>),
log *p*<sub>2</sub>(*s*<sub>2</sub>), ...,
log *p*<sub>*N*</sub>(*s*<sub>*N*</sub>)\} are the log-likelihoods of the
correct word at each step; the negated sum of these terms is the minimization
objective of the model.
During the first phase of training the parameters of the *Inception v3* model
are kept fixed: it is simply a static image encoder function. A single trainable
......@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are
trained to jointly fine-tune the image encoder and the LSTM.
Given a trained model and an image we use *beam search* to generate captions for
that image. Captions are generated word-by-word, where at each step $$t$$ we use
the set of sentences already generated with length $$t-1$$ to generate a new set
of sentences with length $$t$$. We keep only the top $$k$$ candidates at each
step, where the hyperparameter $$k$$ is called the *beam size*. We have found
the best performance with $$k=3$$.
that image. Captions are generated word-by-word, where at each step *t* we use
the set of sentences already generated with length *t* - 1 to generate a new set
of sentences with length *t*. We keep only the top *k* candidates at each step,
where the hyperparameter *k* is called the *beam size*. We have found the best
performance with *k* = 3.
## Getting Started
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment