"...models/git@developer.sourcefind.cn:OpenDAS/vision.git" did not exist on "9bbb777d934e00f4f26c69ac1c4730286cfec80c"
Commit 0d8720c2 authored by Chris Shallue's avatar Chris Shallue
Browse files

Remove inline math from README.md (not supported by GitHub). Replace

with italics and subscript markers.
parent 4f9d1024
...@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture. ...@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture.
![Show and Tell Architecture](g3doc/show_and_tell_architecture.png) ![Show and Tell Architecture](g3doc/show_and_tell_architecture.png)
</center> </center>
In this diagram, $$\{ s_0, s_1, ..., s_{N-1} \}$$ are the words of the caption In this diagram, \{*s*<sub>0</sub>, *s*<sub>1</sub>, ..., *s*<sub>*N*-1</sub>\}
and $$\{ w_e s_0, w_e s_1, ..., w_e s_{N-1} \}$$ are their corresponding word are the words of the caption and \{*w*<sub>*e*</sub>*s*<sub>0</sub>,
embedding vectors. The outputs $$\{ p_1, p_2, ..., p_N \}$$ of the LSTM are *w*<sub>*e*</sub>*s*<sub>1</sub>, ..., *w*<sub>*e*</sub>*s*<sub>*N*-1</sub>\}
probability distributions generated by the model for the next word in the are their corresponding word embedding vectors. The outputs \{*p*<sub>1</sub>,
sentence. The terms $$\{ \log p_1(s_1), \log p_2(s_2), ..., \log p_N(s_N) \}$$ *p*<sub>2</sub>, ..., *p*<sub>*N*</sub>\} of the LSTM are probability
are the log-likelihoods of the correct word at each step; the negated sum of distributions generated by the model for the next word in the sentence. The
these terms is the minimization objective of the model. terms \{log *p*<sub>1</sub>(*s*<sub>1</sub>),
log *p*<sub>2</sub>(*s*<sub>2</sub>), ...,
log *p*<sub>*N*</sub>(*s*<sub>*N*</sub>)\} are the log-likelihoods of the
correct word at each step; the negated sum of these terms is the minimization
objective of the model.
During the first phase of training the parameters of the *Inception v3* model During the first phase of training the parameters of the *Inception v3* model
are kept fixed: it is simply a static image encoder function. A single trainable are kept fixed: it is simply a static image encoder function. A single trainable
...@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are ...@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are
trained to jointly fine-tune the image encoder and the LSTM. trained to jointly fine-tune the image encoder and the LSTM.
Given a trained model and an image we use *beam search* to generate captions for Given a trained model and an image we use *beam search* to generate captions for
that image. Captions are generated word-by-word, where at each step $$t$$ we use that image. Captions are generated word-by-word, where at each step *t* we use
the set of sentences already generated with length $$t-1$$ to generate a new set the set of sentences already generated with length *t* - 1 to generate a new set
of sentences with length $$t$$. We keep only the top $$k$$ candidates at each of sentences with length *t*. We keep only the top *k* candidates at each step,
step, where the hyperparameter $$k$$ is called the *beam size*. We have found where the hyperparameter *k* is called the *beam size*. We have found the best
the best performance with $$k=3$$. performance with *k* = 3.
## Getting Started ## Getting Started
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment