Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
ResNet50_tensorflow
Commits
0d8720c2
Commit
0d8720c2
authored
Sep 08, 2016
by
Chris Shallue
Browse files
Remove inline math from README.md (not supported by GitHub). Replace
with italics and subscript markers.
parent
4f9d1024
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
16 additions
and
12 deletions
+16
-12
im2txt/README.md
im2txt/README.md
+16
-12
No files found.
im2txt/README.md
View file @
0d8720c2
...
...
@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture.

</center>
In this diagram, $$
\{
s_0, s_1, ..., s_{N-1}
\}
$$ are the words of the caption
and $$
\{
w_e s_0, w_e s_1, ..., w_e s_{N-1}
\}
$$ are their corresponding word
embedding vectors. The outputs $$
\{
p_1, p_2, ..., p_N
\}
$$ of the LSTM are
probability distributions generated by the model for the next word in the
sentence. The terms $$
\{
\l
og p_1(s_1),
\l
og p_2(s_2), ...,
\l
og p_N(s_N)
\}
$$
are the log-likelihoods of the correct word at each step; the negated sum of
these terms is the minimization objective of the model.
In this diagram,
\{
*s*
<sub>
0
</sub>
,
*s*
<sub>
1
</sub>
, ...,
*s*
<sub>
*N*
-1
</sub>
\}
are the words of the caption and
\{
*w*
<sub>
*e*
</sub>
*s*
<sub>
0
</sub>
,
*w*
<sub>
*e*
</sub>
*s*
<sub>
1
</sub>
, ...,
*w*
<sub>
*e*
</sub>
*s*
<sub>
*N*
-1
</sub>
\}
are their corresponding word embedding vectors. The outputs
\{
*p*
<sub>
1
</sub>
,
*p*
<sub>
2
</sub>
, ...,
*p*
<sub>
*N*
</sub>
\}
of the LSTM are probability
distributions generated by the model for the next word in the sentence. The
terms
\{
log
*p*
<sub>
1
</sub>
(
*s*
<sub>
1
</sub>
),
log
*p*
<sub>
2
</sub>
(
*s*
<sub>
2
</sub>
), ...,
log
*p*
<sub>
*N*
</sub>
(
*s*
<sub>
*N*
</sub>
)
\}
are the log-likelihoods of the
correct word at each step; the negated sum of these terms is the minimization
objective of the model.
During the first phase of training the parameters of the
*Inception v3*
model
are kept fixed: it is simply a static image encoder function. A single trainable
...
...
@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are
trained to jointly fine-tune the image encoder and the LSTM.
Given a trained model and an image we use
*beam search*
to generate captions for
that image. Captions are generated word-by-word, where at each step
$$t$$
we use
the set of sentences already generated with length
$$t-1$$
to generate a new set
of sentences with length
$$t$$
. We keep only the top
$$k$$
candidates at each
step,
where the hyperparameter
$$k$$
is called the
*beam size*
. We have found
the best
performance with
$$k=3$$
.
that image. Captions are generated word-by-word, where at each step
*t*
we use
the set of sentences already generated with length
*t*
- 1
to generate a new set
of sentences with length
*t*
. We keep only the top
*k*
candidates at each
step,
where the hyperparameter
*k*
is called the
*beam size*
. We have found
the best
performance with
*k*
= 3
.
## Getting Started
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment