Commit 0d8720c2 authored by Chris Shallue's avatar Chris Shallue
Browse files

Remove inline math from README.md (not supported by GitHub). Replace

with italics and subscript markers.
parent 4f9d1024
...@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture. ...@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture.
![Show and Tell Architecture](g3doc/show_and_tell_architecture.png) ![Show and Tell Architecture](g3doc/show_and_tell_architecture.png)
</center> </center>
In this diagram, $$\{ s_0, s_1, ..., s_{N-1} \}$$ are the words of the caption In this diagram, \{*s*<sub>0</sub>, *s*<sub>1</sub>, ..., *s*<sub>*N*-1</sub>\}
and $$\{ w_e s_0, w_e s_1, ..., w_e s_{N-1} \}$$ are their corresponding word are the words of the caption and \{*w*<sub>*e*</sub>*s*<sub>0</sub>,
embedding vectors. The outputs $$\{ p_1, p_2, ..., p_N \}$$ of the LSTM are *w*<sub>*e*</sub>*s*<sub>1</sub>, ..., *w*<sub>*e*</sub>*s*<sub>*N*-1</sub>\}
probability distributions generated by the model for the next word in the are their corresponding word embedding vectors. The outputs \{*p*<sub>1</sub>,
sentence. The terms $$\{ \log p_1(s_1), \log p_2(s_2), ..., \log p_N(s_N) \}$$ *p*<sub>2</sub>, ..., *p*<sub>*N*</sub>\} of the LSTM are probability
are the log-likelihoods of the correct word at each step; the negated sum of distributions generated by the model for the next word in the sentence. The
these terms is the minimization objective of the model. terms \{log *p*<sub>1</sub>(*s*<sub>1</sub>),
log *p*<sub>2</sub>(*s*<sub>2</sub>), ...,
log *p*<sub>*N*</sub>(*s*<sub>*N*</sub>)\} are the log-likelihoods of the
correct word at each step; the negated sum of these terms is the minimization
objective of the model.
During the first phase of training the parameters of the *Inception v3* model During the first phase of training the parameters of the *Inception v3* model
are kept fixed: it is simply a static image encoder function. A single trainable are kept fixed: it is simply a static image encoder function. A single trainable
...@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are ...@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are
trained to jointly fine-tune the image encoder and the LSTM. trained to jointly fine-tune the image encoder and the LSTM.
Given a trained model and an image we use *beam search* to generate captions for Given a trained model and an image we use *beam search* to generate captions for
that image. Captions are generated word-by-word, where at each step $$t$$ we use that image. Captions are generated word-by-word, where at each step *t* we use
the set of sentences already generated with length $$t-1$$ to generate a new set the set of sentences already generated with length *t* - 1 to generate a new set
of sentences with length $$t$$. We keep only the top $$k$$ candidates at each of sentences with length *t*. We keep only the top *k* candidates at each step,
step, where the hyperparameter $$k$$ is called the *beam size*. We have found where the hyperparameter *k* is called the *beam size*. We have found the best
the best performance with $$k=3$$. performance with *k* = 3.
## Getting Started ## Getting Started
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment