"...git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "abbc96a2148da0c91fa078bd021984b2cc10ef85"
Unverified Commit df017c3c authored by Arthur's avatar Arthur Committed by GitHub
Browse files

[CLAP] Doc nits (#22957)

clap nits
parent 137eb8e6
...@@ -14,10 +14,10 @@ specific language governing permissions and limitations under the License. ...@@ -14,10 +14,10 @@ specific language governing permissions and limitations under the License.
## Overview ## Overview
The CLAP model was proposed in [Large Scale Constrastive Laungaue-Audio pretraining with The CLAP model was proposed in [Large Scale Contrastive Language-Audio pretraining with
feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
CLAP (Constrastive Laungaue-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score. CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
The abstract from the paper is the following: The abstract from the paper is the following:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment