Commit 0d8720c2 authored by Chris Shallue's avatar Chris Shallue
Browse files

Remove inline math from README.md (not supported by GitHub). Replace

with italics and subscript markers.
parent 4f9d1024
......@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture.
![Show and Tell Architecture](g3doc/show_and_tell_architecture.png)
</center>
In this diagram, $$\{ s_0, s_1, ..., s_{N-1} \}$$ are the words of the caption
and $$\{ w_e s_0, w_e s_1, ..., w_e s_{N-1} \}$$ are their corresponding word
embedding vectors. The outputs $$\{ p_1, p_2, ..., p_N \}$$ of the LSTM are
probability distributions generated by the model for the next word in the
sentence. The terms $$\{ \log p_1(s_1), \log p_2(s_2), ..., \log p_N(s_N) \}$$
are the log-likelihoods of the correct word at each step; the negated sum of
these terms is the minimization objective of the model.
In this diagram, \{*s*<sub>0</sub>, *s*<sub>1</sub>, ..., *s*<sub>*N*-1</sub>\}
are the words of the caption and \{*w*<sub>*e*</sub>*s*<sub>0</sub>,
*w*<sub>*e*</sub>*s*<sub>1</sub>, ..., *w*<sub>*e*</sub>*s*<sub>*N*-1</sub>\}
are their corresponding word embedding vectors. The outputs \{*p*<sub>1</sub>,
*p*<sub>2</sub>, ..., *p*<sub>*N*</sub>\} of the LSTM are probability
distributions generated by the model for the next word in the sentence. The
terms \{log *p*<sub>1</sub>(*s*<sub>1</sub>),
log *p*<sub>2</sub>(*s*<sub>2</sub>), ...,
log *p*<sub>*N*</sub>(*s*<sub>*N*</sub>)\} are the log-likelihoods of the
correct word at each step; the negated sum of these terms is the minimization
objective of the model.
During the first phase of training the parameters of the *Inception v3* model
are kept fixed: it is simply a static image encoder function. A single trainable
......@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are
trained to jointly fine-tune the image encoder and the LSTM.
Given a trained model and an image we use *beam search* to generate captions for
that image. Captions are generated word-by-word, where at each step $$t$$ we use
the set of sentences already generated with length $$t-1$$ to generate a new set
of sentences with length $$t$$. We keep only the top $$k$$ candidates at each
step, where the hyperparameter $$k$$ is called the *beam size*. We have found
the best performance with $$k=3$$.
that image. Captions are generated word-by-word, where at each step *t* we use
the set of sentences already generated with length *t* - 1 to generate a new set
of sentences with length *t*. We keep only the top *k* candidates at each step,
where the hyperparameter *k* is called the *beam size*. We have found the best
performance with *k* = 3.
## Getting Started
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment