".github/vscode:/vscode.git/clone" did not exist on "9d61831e3ad3fc83896715d8958cd6264c0201cf"
Commit dfffd623 authored by Dan Ellis's avatar Dan Ellis Committed by Manoj Plakal
Browse files

Adding research/audioset/yamnet, a pre-trained audio event classifier (#7850)

Add files for YamNet, a sound event classifier.
parent 9b98e3db
# YAMNet
YAMNet is a pretrained deep net that predicts 521 audio event classes based on
the [AudioSet-YouTube corpus](http://g.co/audioset), and employing the
[Mobilenet_v1](https://arxiv.org/pdf/1704.04861.pdf) depthwise-separable
convolution architecture.
This directory contains the Keras code to construct the model, and example code
for applying the model to input sound files.
## Installation
YAMNet depends on the following Python packages:
* [`numpy`](http://www.numpy.org/)
* [`scipy`](http://www.scipy.org/)
* [`resampy`](http://resampy.readthedocs.io/en/latest/)
* [`tensorflow`](http://www.tensorflow.org/)
* [`pysoundfile`](https://pysoundfile.readthedocs.io/)
These are all easily installable via, e.g., `pip install numpy` (as in the
example command sequence below).
Any reasonably recent version of these packages should work. TensorFlow should
be at least version 1.8 to ensure Keras support is included. We have tested
that everything works on Ubuntu and MacOS with Python 3.7.2, Numpy v1.15.4,
SciPy v1.1.0, resampy v0.2.1, TensorFlow v1.14.0, and PySoundFile 0.9.0.
YAMNet also requires downloading the following data file:
* [YAMNet model weights](https://storage.googleapis.com/audioset/yamnet.h5)
in Keras saved weights in HDF5 format.
After downloading this file into the same directory as this README, the
installation can be tested by running `python yamnet_test.py` which
runs some synthetic signals through the model and checks the outputs.
Here's a sample installation and test session:
```shell
# Upgrade pip first.
python -m pip install --upgrade pip
# Install dependences. Resampy needs to be installed after NumPy and SciPy
# are already installed.
pip install numpy scipy
pip install resampy tensorflow pysoundfile
# Clone TensorFlow models repo into a 'models' directory.
git clone https://github.com/tensorflow/models.git
cd models/research/audioset/yamnet
# Download data file into same directory as code.
curl -O https://storage.googleapis.com/audioset/yamnet.h5
# Installation ready, let's test it.
python yamnet_test.py
# If we see "Ran 4 tests ... OK ...", then we're all set.
```
## Usage
You can run the model over existing soundfiles using inference.py:
```shell
python inference.py input_sound.wav
```
The code will report the top-5 highest-scoring classes averaged over all the
frames of the input. You can access greater detail by modifying the example
code in inference.py.
See the jupyter notebook `yamnet_visualization.ipynb` for an example of
displaying the per-frame model output scores.
## About the Model
The YAMNet code layout is as follows:
* `yamnet.py`: Model definition in Keras.
* `params.py`: Hyperparameters. You can usefully modify PATCH_HOP_SECONDS.
* `features.py`: Audio feature extraction helpers.
* `inference.py`: Example code to classify input wav files.
* `yamnet_test.py`: Simple test of YAMNet installation
### Input: Audio Features
See `features.py`.
As with our previous release
[VGGish](https://github.com/tensorflow/models/tree/master/research/audioset/vggish),
YAMNet was trained with audio features computed as follows:
* All audio is resampled to 16 kHz mono.
* A spectrogram is computed using magnitudes of the Short-Time Fourier Transform
with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann
window.
* A mel spectrogram is computed by mapping the spectrogram to 64 mel bins
covering the range 125-7500 Hz.
* A stabilized log mel spectrogram is computed by applying
log(mel-spectrum + 0.001) where the offset is used to avoid taking a logarithm
of zero.
* These features are then framed into 50%-overlapping examples of 0.96 seconds,
where each example covers 64 mel bands and 96 frames of 10 ms each.
These 96x64 patches are then fed into the Mobilenet_v1 model to yield a 3x2
array of activations for 1024 kernels at the top of the convolution. These are
averaged to give a 1024-dimension embedding, then put through a single logistic
layer to get the 521 per-class output scores corresponding to the 960 ms input
waveform segment. (Because of the window framing, you need at least 975 ms of
input waveform to get the first frame of output scores.)
### Class vocabulary
The file `yamnet_class_map.csv` describes the audio event classes associated
with each of the 521 outputs of the network. Its format is:
```text
index,mid,display_name
```
where `index` is the model output index (0..520), `mid` is the machine
identifier for that class (e.g. `/m/09x0r`), and display_name is a
human-readable description of the class (e.g. `Speech`).
The original Audioset data release had 527 classes. This model drops six of
them on the recommendation of our Fairness reviewers to avoid potentially
offensive mislabelings. We dropped the gendered versions (Male/Female) of
Speech and Singing. We also dropped Battle cry and Funny music.
### Performance
On the 20,366-segment AudioSet eval set, over the 521 included classes, the
balanced average d-prime is 2.318, balanced mAP is 0.306, and the balanced
average lwlrap is 0.393.
According to our calculations, the classifier has 3.7M weights and performs
69.2M multiplies for each 960ms input frame.
### Contact information
This model repository is maintained by [Manoj Plakal](https://github.com/plakal) and [Dan Ellis](https://github.com/dpwe).
# Copyright 2019 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Feature computation for YAMNet."""
import numpy as np
import tensorflow as tf
def waveform_to_log_mel_spectrogram(waveform, params):
"""Compute log mel spectrogram of a 1-D waveform."""
with tf.name_scope('log_mel_features'):
# waveform has shape [<# samples>]
# Convert waveform into spectrogram using a Short-Time Fourier Transform.
# Note that tf.signal.stft() uses a periodic Hann window by default.
window_length_samples = int(
round(params.SAMPLE_RATE * params.STFT_WINDOW_SECONDS))
hop_length_samples = int(
round(params.SAMPLE_RATE * params.STFT_HOP_SECONDS))
fft_length = 2 ** int(np.ceil(np.log(window_length_samples) / np.log(2.0)))
num_spectrogram_bins = fft_length // 2 + 1
magnitude_spectrogram = tf.abs(tf.signal.stft(
signals=waveform,
frame_length=window_length_samples,
frame_step=hop_length_samples,
fft_length=fft_length))
# magnitude_spectrogram has shape [<# STFT frames>, num_spectrogram_bins]
# Convert spectrogram into log mel spectrogram.
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
num_mel_bins=params.MEL_BANDS,
num_spectrogram_bins=num_spectrogram_bins,
sample_rate=params.SAMPLE_RATE,
lower_edge_hertz=params.MEL_MIN_HZ,
upper_edge_hertz=params.MEL_MAX_HZ)
mel_spectrogram = tf.matmul(
magnitude_spectrogram, linear_to_mel_weight_matrix)
log_mel_spectrogram = tf.math.log(mel_spectrogram + params.LOG_OFFSET)
# log_mel_spectrogram has shape [<# STFT frames>, MEL_BANDS]
return log_mel_spectrogram
def spectrogram_to_patches(spectrogram, params):
"""Break up a spectrogram into a stack of fixed-size patches."""
with tf.name_scope('feature_patches'):
# Frame spectrogram (shape [<# STFT frames>, MEL_BANDS]) into patches
# (the input examples).
# Only complete frames are emitted, so if there is less than
# PATCH_WINDOW_SECONDS of waveform then nothing is emitted
# (to avoid this, zero-pad before processing).
hop_length_samples = int(
round(params.SAMPLE_RATE * params.STFT_HOP_SECONDS))
spectrogram_sr = params.SAMPLE_RATE / hop_length_samples
patch_window_length_samples = int(
round(spectrogram_sr * params.PATCH_WINDOW_SECONDS))
patch_hop_length_samples = int(
round(spectrogram_sr * params.PATCH_HOP_SECONDS))
features = tf.signal.frame(
signal=spectrogram,
frame_length=patch_window_length_samples,
frame_step=patch_hop_length_samples,
axis=0)
# features has shape [<# patches>, <# STFT frames in an patch>, MEL_BANDS]
return features
# Copyright 2019 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Inference demo for YAMNet."""
from __future__ import division, print_function
import sys
import numpy as np
import resampy
import soundfile as sf
import params
import yamnet as yamnet_model
def main(argv):
assert argv
yamnet = yamnet_model.yamnet_frames_model(params)
yamnet.load_weights('yamnet.h5')
yamnet_classes = yamnet_model.class_names('yamnet_class_map.csv')
for file_name in argv:
# Decode the WAV file.
wav_data, sr = sf.read(file_name, dtype=np.int16)
assert wav_data.dtype == np.int16, 'Bad sample type: %r' % wav_data.dtype
waveform = wav_data / 32768.0 # Convert to [-1.0, +1.0]
# Convert to mono and the sample rate expected by YAMNet.
if len(waveform.shape) > 1:
waveform = np.mean(waveform, axis=1)
if sr != params.SAMPLE_RATE:
waveform = resampy.resample(waveform, sr, params.SAMPLE_RATE)
# Predict YAMNet classes.
# Second output is log-mel-spectrogram array (used for visualizations).
# (steps=1 is a work around for Keras batching limitations.)
scores, _ = yamnet.predict(np.reshape(waveform, [1, -1]), steps=1)
# Scores is a matrix of (time_frames, num_classes) classifier scores.
# Average them along time to get an overall classifier output for the clip.
prediction = np.mean(scores, axis=0)
# Report the highest-scoring classes and their scores.
top5_i = np.argsort(prediction)[::-1][:5]
print(file_name, ':\n' +
'\n'.join(' {:12s}: {:.3f}'.format(yamnet_classes[i], prediction[i])
for i in top5_i))
if __name__ == '__main__':
main(sys.argv[1:])
# Copyright 2019 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Hyperparameters for YAMNet."""
# The following hyperparameters (except PATCH_HOP_SECONDS) were used to train YAMNet,
# so expect some variability in performance if you change these. The patch hop can
# be changed arbitrarily: a smaller hop should give you more patches from the same
# clip and possibly better performance at a larger computational cost.
SAMPLE_RATE = 16000
STFT_WINDOW_SECONDS = 0.025
STFT_HOP_SECONDS = 0.010
MEL_BANDS = 64
MEL_MIN_HZ = 125
MEL_MAX_HZ = 7500
LOG_OFFSET = 0.001
PATCH_WINDOW_SECONDS = 0.96
PATCH_HOP_SECONDS = 0.48
PATCH_FRAMES = int(round(PATCH_WINDOW_SECONDS / STFT_HOP_SECONDS))
PATCH_BANDS = MEL_BANDS
NUM_CLASSES = 521
CONV_PADDING = 'same'
BATCHNORM_CENTER = True
BATCHNORM_SCALE = False
BATCHNORM_EPSILON = 1e-4
CLASSIFIER_ACTIVATION = 'sigmoid'
FEATURES_LAYER_NAME = 'features'
EXAMPLE_PREDICTIONS_LAYER_NAME = 'predictions'
# Copyright 2019 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Core model definition of YAMNet."""
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras import Model, layers
import features as features_lib
import params
def _batch_norm(name):
def _bn_layer(layer_input):
return layers.BatchNormalization(
name=name,
center=params.BATCHNORM_CENTER,
scale=params.BATCHNORM_SCALE,
epsilon=params.BATCHNORM_EPSILON)(layer_input)
return _bn_layer
def _conv(name, kernel, stride, filters):
def _conv_layer(layer_input):
output = layers.Conv2D(name='{}/conv'.format(name),
filters=filters,
kernel_size=kernel,
strides=stride,
padding=params.CONV_PADDING,
use_bias=False,
activation=None)(layer_input)
output = _batch_norm(name='{}/conv/bn'.format(name))(output)
output = layers.ReLU(name='{}/relu'.format(name))(output)
return output
return _conv_layer
def _separable_conv(name, kernel, stride, filters):
def _separable_conv_layer(layer_input):
output = layers.DepthwiseConv2D(name='{}/depthwise_conv'.format(name),
kernel_size=kernel,
strides=stride,
depth_multiplier=1,
padding=params.CONV_PADDING,
use_bias=False,
activation=None)(layer_input)
output = _batch_norm(name='{}/depthwise_conv/bn'.format(name))(output)
output = layers.ReLU(name='{}/depthwise_conv/relu'.format(name))(output)
output = layers.Conv2D(name='{}/pointwise_conv'.format(name),
filters=filters,
kernel_size=(1, 1),
strides=1,
padding=params.CONV_PADDING,
use_bias=False,
activation=None)(output)
output = _batch_norm(name='{}/pointwise_conv/bn'.format(name))(output)
output = layers.ReLU(name='{}/pointwise_conv/relu'.format(name))(output)
return output
return _separable_conv_layer
_YAMNET_LAYER_DEFS = [
# (layer_function, kernel, stride, num_filters)
(_conv, [3, 3], 2, 32),
(_separable_conv, [3, 3], 1, 64),
(_separable_conv, [3, 3], 2, 128),
(_separable_conv, [3, 3], 1, 128),
(_separable_conv, [3, 3], 2, 256),
(_separable_conv, [3, 3], 1, 256),
(_separable_conv, [3, 3], 2, 512),
(_separable_conv, [3, 3], 1, 512),
(_separable_conv, [3, 3], 1, 512),
(_separable_conv, [3, 3], 1, 512),
(_separable_conv, [3, 3], 1, 512),
(_separable_conv, [3, 3], 1, 512),
(_separable_conv, [3, 3], 2, 1024),
(_separable_conv, [3, 3], 1, 1024)
]
def yamnet(features):
"""Define the core YAMNet mode in Keras."""
net = layers.Reshape(
(params.PATCH_FRAMES, params.PATCH_BANDS, 1),
input_shape=(params.PATCH_FRAMES, params.PATCH_BANDS))(features)
for (i, (layer_fun, kernel, stride, filters)) in enumerate(_YAMNET_LAYER_DEFS):
net = layer_fun('layer{}'.format(i + 1), kernel, stride, filters)(net)
net = layers.GlobalAveragePooling2D()(net)
logits = layers.Dense(units=params.NUM_CLASSES, use_bias=True)(net)
predictions = layers.Activation(
name=params.EXAMPLE_PREDICTIONS_LAYER_NAME,
activation=params.CLASSIFIER_ACTIVATION)(logits)
return predictions
def yamnet_frames_model(feature_params):
"""Defines the YAMNet waveform-to-class-scores model.
Args:
feature_params: An object with parameter fields to control the feature
calculation.
Returns:
A model accepting (1, num_samples) waveform input and emitting a
(num_patches, num_classes) matrix of class scores per time frame as
well as a (num_spectrogram_frames, num_mel_bins) spectrogram feature
matrix.
"""
waveform = layers.Input(batch_shape=(1, None))
# Store the intermediate spectrogram features to use in visualization.
spectrogram = features_lib.waveform_to_log_mel_spectrogram(
tf.squeeze(waveform, axis=0), feature_params)
patches = features_lib.spectrogram_to_patches(spectrogram, feature_params)
predictions = yamnet(patches)
frames_model = Model(name='yamnet_frames',
inputs=waveform, outputs=[predictions, spectrogram])
return frames_model
def class_names(class_map_csv):
"""Read the class name definition file and return a list of strings."""
with open(class_map_csv) as csv_file:
reader = csv.reader(csv_file)
next(reader) # Skip header
return np.array([display_name for (_, _, display_name) in reader])
index,mid,display_name
0,/m/09x0r,Speech
1,/m/0ytgt,"Child speech, kid speaking"
2,/m/01h8n0,Conversation
3,/m/02qldy,"Narration, monologue"
4,/m/0261r1,Babbling
5,/m/0brhx,Speech synthesizer
6,/m/07p6fty,Shout
7,/m/07q4ntr,Bellow
8,/m/07rwj3x,Whoop
9,/m/07sr1lc,Yell
10,/t/dd00135,Children shouting
11,/m/03qc9zr,Screaming
12,/m/02rtxlg,Whispering
13,/m/01j3sz,Laughter
14,/t/dd00001,Baby laughter
15,/m/07r660_,Giggle
16,/m/07s04w4,Snicker
17,/m/07sq110,Belly laugh
18,/m/07rgt08,"Chuckle, chortle"
19,/m/0463cq4,"Crying, sobbing"
20,/t/dd00002,"Baby cry, infant cry"
21,/m/07qz6j3,Whimper
22,/m/07qw_06,"Wail, moan"
23,/m/07plz5l,Sigh
24,/m/015lz1,Singing
25,/m/0l14jd,Choir
26,/m/01swy6,Yodeling
27,/m/02bk07,Chant
28,/m/01c194,Mantra
29,/t/dd00005,Child singing
30,/t/dd00006,Synthetic singing
31,/m/06bxc,Rapping
32,/m/02fxyj,Humming
33,/m/07s2xch,Groan
34,/m/07r4k75,Grunt
35,/m/01w250,Whistling
36,/m/0lyf6,Breathing
37,/m/07mzm6,Wheeze
38,/m/01d3sd,Snoring
39,/m/07s0dtb,Gasp
40,/m/07pyy8b,Pant
41,/m/07q0yl5,Snort
42,/m/01b_21,Cough
43,/m/0dl9sf8,Throat clearing
44,/m/01hsr_,Sneeze
45,/m/07ppn3j,Sniff
46,/m/06h7j,Run
47,/m/07qv_x_,Shuffle
48,/m/07pbtc8,"Walk, footsteps"
49,/m/03cczk,"Chewing, mastication"
50,/m/07pdhp0,Biting
51,/m/0939n_,Gargling
52,/m/01g90h,Stomach rumble
53,/m/03q5_w,"Burping, eructation"
54,/m/02p3nc,Hiccup
55,/m/02_nn,Fart
56,/m/0k65p,Hands
57,/m/025_jnm,Finger snapping
58,/m/0l15bq,Clapping
59,/m/01jg02,"Heart sounds, heartbeat"
60,/m/01jg1z,Heart murmur
61,/m/053hz1,Cheering
62,/m/028ght,Applause
63,/m/07rkbfh,Chatter
64,/m/03qtwd,Crowd
65,/m/07qfr4h,"Hubbub, speech noise, speech babble"
66,/t/dd00013,Children playing
67,/m/0jbk,Animal
68,/m/068hy,"Domestic animals, pets"
69,/m/0bt9lr,Dog
70,/m/05tny_,Bark
71,/m/07r_k2n,Yip
72,/m/07qf0zm,Howl
73,/m/07rc7d9,Bow-wow
74,/m/0ghcn6,Growling
75,/t/dd00136,Whimper (dog)
76,/m/01yrx,Cat
77,/m/02yds9,Purr
78,/m/07qrkrw,Meow
79,/m/07rjwbb,Hiss
80,/m/07r81j2,Caterwaul
81,/m/0ch8v,"Livestock, farm animals, working animals"
82,/m/03k3r,Horse
83,/m/07rv9rh,Clip-clop
84,/m/07q5rw0,"Neigh, whinny"
85,/m/01xq0k1,"Cattle, bovinae"
86,/m/07rpkh9,Moo
87,/m/0239kh,Cowbell
88,/m/068zj,Pig
89,/t/dd00018,Oink
90,/m/03fwl,Goat
91,/m/07q0h5t,Bleat
92,/m/07bgp,Sheep
93,/m/025rv6n,Fowl
94,/m/09b5t,"Chicken, rooster"
95,/m/07st89h,Cluck
96,/m/07qn5dc,"Crowing, cock-a-doodle-doo"
97,/m/01rd7k,Turkey
98,/m/07svc2k,Gobble
99,/m/09ddx,Duck
100,/m/07qdb04,Quack
101,/m/0dbvp,Goose
102,/m/07qwf61,Honk
103,/m/01280g,Wild animals
104,/m/0cdnk,"Roaring cats (lions, tigers)"
105,/m/04cvmfc,Roar
106,/m/015p6,Bird
107,/m/020bb7,"Bird vocalization, bird call, bird song"
108,/m/07pggtn,"Chirp, tweet"
109,/m/07sx8x_,Squawk
110,/m/0h0rv,"Pigeon, dove"
111,/m/07r_25d,Coo
112,/m/04s8yn,Crow
113,/m/07r5c2p,Caw
114,/m/09d5_,Owl
115,/m/07r_80w,Hoot
116,/m/05_wcq,"Bird flight, flapping wings"
117,/m/01z5f,"Canidae, dogs, wolves"
118,/m/06hps,"Rodents, rats, mice"
119,/m/04rmv,Mouse
120,/m/07r4gkf,Patter
121,/m/03vt0,Insect
122,/m/09xqv,Cricket
123,/m/09f96,Mosquito
124,/m/0h2mp,"Fly, housefly"
125,/m/07pjwq1,Buzz
126,/m/01h3n,"Bee, wasp, etc."
127,/m/09ld4,Frog
128,/m/07st88b,Croak
129,/m/078jl,Snake
130,/m/07qn4z3,Rattle
131,/m/032n05,Whale vocalization
132,/m/04rlf,Music
133,/m/04szw,Musical instrument
134,/m/0fx80y,Plucked string instrument
135,/m/0342h,Guitar
136,/m/02sgy,Electric guitar
137,/m/018vs,Bass guitar
138,/m/042v_gx,Acoustic guitar
139,/m/06w87,"Steel guitar, slide guitar"
140,/m/01glhc,Tapping (guitar technique)
141,/m/07s0s5r,Strum
142,/m/018j2,Banjo
143,/m/0jtg0,Sitar
144,/m/04rzd,Mandolin
145,/m/01bns_,Zither
146,/m/07xzm,Ukulele
147,/m/05148p4,Keyboard (musical)
148,/m/05r5c,Piano
149,/m/01s0ps,Electric piano
150,/m/013y1f,Organ
151,/m/03xq_f,Electronic organ
152,/m/03gvt,Hammond organ
153,/m/0l14qv,Synthesizer
154,/m/01v1d8,Sampler
155,/m/03q5t,Harpsichord
156,/m/0l14md,Percussion
157,/m/02hnl,Drum kit
158,/m/0cfdd,Drum machine
159,/m/026t6,Drum
160,/m/06rvn,Snare drum
161,/m/03t3fj,Rimshot
162,/m/02k_mr,Drum roll
163,/m/0bm02,Bass drum
164,/m/011k_j,Timpani
165,/m/01p970,Tabla
166,/m/01qbl,Cymbal
167,/m/03qtq,Hi-hat
168,/m/01sm1g,Wood block
169,/m/07brj,Tambourine
170,/m/05r5wn,Rattle (instrument)
171,/m/0xzly,Maraca
172,/m/0mbct,Gong
173,/m/016622,Tubular bells
174,/m/0j45pbj,Mallet percussion
175,/m/0dwsp,"Marimba, xylophone"
176,/m/0dwtp,Glockenspiel
177,/m/0dwt5,Vibraphone
178,/m/0l156b,Steelpan
179,/m/05pd6,Orchestra
180,/m/01kcd,Brass instrument
181,/m/0319l,French horn
182,/m/07gql,Trumpet
183,/m/07c6l,Trombone
184,/m/0l14_3,Bowed string instrument
185,/m/02qmj0d,String section
186,/m/07y_7,"Violin, fiddle"
187,/m/0d8_n,Pizzicato
188,/m/01xqw,Cello
189,/m/02fsn,Double bass
190,/m/085jw,"Wind instrument, woodwind instrument"
191,/m/0l14j_,Flute
192,/m/06ncr,Saxophone
193,/m/01wy6,Clarinet
194,/m/03m5k,Harp
195,/m/0395lw,Bell
196,/m/03w41f,Church bell
197,/m/027m70_,Jingle bell
198,/m/0gy1t2s,Bicycle bell
199,/m/07n_g,Tuning fork
200,/m/0f8s22,Chime
201,/m/026fgl,Wind chime
202,/m/0150b9,Change ringing (campanology)
203,/m/03qjg,Harmonica
204,/m/0mkg,Accordion
205,/m/0192l,Bagpipes
206,/m/02bxd,Didgeridoo
207,/m/0l14l2,Shofar
208,/m/07kc_,Theremin
209,/m/0l14t7,Singing bowl
210,/m/01hgjl,Scratching (performance technique)
211,/m/064t9,Pop music
212,/m/0glt670,Hip hop music
213,/m/02cz_7,Beatboxing
214,/m/06by7,Rock music
215,/m/03lty,Heavy metal
216,/m/05r6t,Punk rock
217,/m/0dls3,Grunge
218,/m/0dl5d,Progressive rock
219,/m/07sbbz2,Rock and roll
220,/m/05w3f,Psychedelic rock
221,/m/06j6l,Rhythm and blues
222,/m/0gywn,Soul music
223,/m/06cqb,Reggae
224,/m/01lyv,Country
225,/m/015y_n,Swing music
226,/m/0gg8l,Bluegrass
227,/m/02x8m,Funk
228,/m/02w4v,Folk music
229,/m/06j64v,Middle Eastern music
230,/m/03_d0,Jazz
231,/m/026z9,Disco
232,/m/0ggq0m,Classical music
233,/m/05lls,Opera
234,/m/02lkt,Electronic music
235,/m/03mb9,House music
236,/m/07gxw,Techno
237,/m/07s72n,Dubstep
238,/m/0283d,Drum and bass
239,/m/0m0jc,Electronica
240,/m/08cyft,Electronic dance music
241,/m/0fd3y,Ambient music
242,/m/07lnk,Trance music
243,/m/0g293,Music of Latin America
244,/m/0ln16,Salsa music
245,/m/0326g,Flamenco
246,/m/0155w,Blues
247,/m/05fw6t,Music for children
248,/m/02v2lh,New-age music
249,/m/0y4f8,Vocal music
250,/m/0z9c,A capella
251,/m/0164x2,Music of Africa
252,/m/0145m,Afrobeat
253,/m/02mscn,Christian music
254,/m/016cjb,Gospel music
255,/m/028sqc,Music of Asia
256,/m/015vgc,Carnatic music
257,/m/0dq0md,Music of Bollywood
258,/m/06rqw,Ska
259,/m/02p0sh1,Traditional music
260,/m/05rwpb,Independent music
261,/m/074ft,Song
262,/m/025td0t,Background music
263,/m/02cjck,Theme music
264,/m/03r5q_,Jingle (music)
265,/m/0l14gg,Soundtrack music
266,/m/07pkxdp,Lullaby
267,/m/01z7dr,Video game music
268,/m/0140xf,Christmas music
269,/m/0ggx5q,Dance music
270,/m/04wptg,Wedding music
271,/t/dd00031,Happy music
272,/t/dd00033,Sad music
273,/t/dd00034,Tender music
274,/t/dd00035,Exciting music
275,/t/dd00036,Angry music
276,/t/dd00037,Scary music
277,/m/03m9d0z,Wind
278,/m/09t49,Rustling leaves
279,/t/dd00092,Wind noise (microphone)
280,/m/0jb2l,Thunderstorm
281,/m/0ngt1,Thunder
282,/m/0838f,Water
283,/m/06mb1,Rain
284,/m/07r10fb,Raindrop
285,/t/dd00038,Rain on surface
286,/m/0j6m2,Stream
287,/m/0j2kx,Waterfall
288,/m/05kq4,Ocean
289,/m/034srq,"Waves, surf"
290,/m/06wzb,Steam
291,/m/07swgks,Gurgling
292,/m/02_41,Fire
293,/m/07pzfmf,Crackle
294,/m/07yv9,Vehicle
295,/m/019jd,"Boat, Water vehicle"
296,/m/0hsrw,"Sailboat, sailing ship"
297,/m/056ks2,"Rowboat, canoe, kayak"
298,/m/02rlv9,"Motorboat, speedboat"
299,/m/06q74,Ship
300,/m/012f08,Motor vehicle (road)
301,/m/0k4j,Car
302,/m/0912c9,"Vehicle horn, car horn, honking"
303,/m/07qv_d5,Toot
304,/m/02mfyn,Car alarm
305,/m/04gxbd,"Power windows, electric windows"
306,/m/07rknqz,Skidding
307,/m/0h9mv,Tire squeal
308,/t/dd00134,Car passing by
309,/m/0ltv,"Race car, auto racing"
310,/m/07r04,Truck
311,/m/0gvgw0,Air brake
312,/m/05x_td,"Air horn, truck horn"
313,/m/02rhddq,Reversing beeps
314,/m/03cl9h,"Ice cream truck, ice cream van"
315,/m/01bjv,Bus
316,/m/03j1ly,Emergency vehicle
317,/m/04qvtq,Police car (siren)
318,/m/012n7d,Ambulance (siren)
319,/m/012ndj,"Fire engine, fire truck (siren)"
320,/m/04_sv,Motorcycle
321,/m/0btp2,"Traffic noise, roadway noise"
322,/m/06d_3,Rail transport
323,/m/07jdr,Train
324,/m/04zmvq,Train whistle
325,/m/0284vy3,Train horn
326,/m/01g50p,"Railroad car, train wagon"
327,/t/dd00048,Train wheels squealing
328,/m/0195fx,"Subway, metro, underground"
329,/m/0k5j,Aircraft
330,/m/014yck,Aircraft engine
331,/m/04229,Jet engine
332,/m/02l6bg,"Propeller, airscrew"
333,/m/09ct_,Helicopter
334,/m/0cmf2,"Fixed-wing aircraft, airplane"
335,/m/0199g,Bicycle
336,/m/06_fw,Skateboard
337,/m/02mk9,Engine
338,/t/dd00065,Light engine (high frequency)
339,/m/08j51y,"Dental drill, dentist's drill"
340,/m/01yg9g,Lawn mower
341,/m/01j4z9,Chainsaw
342,/t/dd00066,Medium engine (mid frequency)
343,/t/dd00067,Heavy engine (low frequency)
344,/m/01h82_,Engine knocking
345,/t/dd00130,Engine starting
346,/m/07pb8fc,Idling
347,/m/07q2z82,"Accelerating, revving, vroom"
348,/m/02dgv,Door
349,/m/03wwcy,Doorbell
350,/m/07r67yg,Ding-dong
351,/m/02y_763,Sliding door
352,/m/07rjzl8,Slam
353,/m/07r4wb8,Knock
354,/m/07qcpgn,Tap
355,/m/07q6cd_,Squeak
356,/m/0642b4,Cupboard open or close
357,/m/0fqfqc,Drawer open or close
358,/m/04brg2,"Dishes, pots, and pans"
359,/m/023pjk,"Cutlery, silverware"
360,/m/07pn_8q,Chopping (food)
361,/m/0dxrf,Frying (food)
362,/m/0fx9l,Microwave oven
363,/m/02pjr4,Blender
364,/m/02jz0l,"Water tap, faucet"
365,/m/0130jx,Sink (filling or washing)
366,/m/03dnzn,Bathtub (filling or washing)
367,/m/03wvsk,Hair dryer
368,/m/01jt3m,Toilet flush
369,/m/012xff,Toothbrush
370,/m/04fgwm,Electric toothbrush
371,/m/0d31p,Vacuum cleaner
372,/m/01s0vc,Zipper (clothing)
373,/m/03v3yw,Keys jangling
374,/m/0242l,Coin (dropping)
375,/m/01lsmm,Scissors
376,/m/02g901,"Electric shaver, electric razor"
377,/m/05rj2,Shuffling cards
378,/m/0316dw,Typing
379,/m/0c2wf,Typewriter
380,/m/01m2v,Computer keyboard
381,/m/081rb,Writing
382,/m/07pp_mv,Alarm
383,/m/07cx4,Telephone
384,/m/07pp8cl,Telephone bell ringing
385,/m/01hnzm,Ringtone
386,/m/02c8p,"Telephone dialing, DTMF"
387,/m/015jpf,Dial tone
388,/m/01z47d,Busy signal
389,/m/046dlr,Alarm clock
390,/m/03kmc9,Siren
391,/m/0dgbq,Civil defense siren
392,/m/030rvx,Buzzer
393,/m/01y3hg,"Smoke detector, smoke alarm"
394,/m/0c3f7m,Fire alarm
395,/m/04fq5q,Foghorn
396,/m/0l156k,Whistle
397,/m/06hck5,Steam whistle
398,/t/dd00077,Mechanisms
399,/m/02bm9n,"Ratchet, pawl"
400,/m/01x3z,Clock
401,/m/07qjznt,Tick
402,/m/07qjznl,Tick-tock
403,/m/0l7xg,Gears
404,/m/05zc1,Pulleys
405,/m/0llzx,Sewing machine
406,/m/02x984l,Mechanical fan
407,/m/025wky1,Air conditioning
408,/m/024dl,Cash register
409,/m/01m4t,Printer
410,/m/0dv5r,Camera
411,/m/07bjf,Single-lens reflex camera
412,/m/07k1x,Tools
413,/m/03l9g,Hammer
414,/m/03p19w,Jackhammer
415,/m/01b82r,Sawing
416,/m/02p01q,Filing (rasp)
417,/m/023vsd,Sanding
418,/m/0_ksk,Power tool
419,/m/01d380,Drill
420,/m/014zdl,Explosion
421,/m/032s66,"Gunshot, gunfire"
422,/m/04zjc,Machine gun
423,/m/02z32qm,Fusillade
424,/m/0_1c,Artillery fire
425,/m/073cg4,Cap gun
426,/m/0g6b5,Fireworks
427,/g/122z_qxw,Firecracker
428,/m/07qsvvw,"Burst, pop"
429,/m/07pxg6y,Eruption
430,/m/07qqyl4,Boom
431,/m/083vt,Wood
432,/m/07pczhz,Chop
433,/m/07pl1bw,Splinter
434,/m/07qs1cx,Crack
435,/m/039jq,Glass
436,/m/07q7njn,"Chink, clink"
437,/m/07rn7sz,Shatter
438,/m/04k94,Liquid
439,/m/07rrlb6,"Splash, splatter"
440,/m/07p6mqd,Slosh
441,/m/07qlwh6,Squish
442,/m/07r5v4s,Drip
443,/m/07prgkl,Pour
444,/m/07pqc89,"Trickle, dribble"
445,/t/dd00088,Gush
446,/m/07p7b8y,Fill (with liquid)
447,/m/07qlf79,Spray
448,/m/07ptzwd,Pump (liquid)
449,/m/07ptfmf,Stir
450,/m/0dv3j,Boiling
451,/m/0790c,Sonar
452,/m/0dl83,Arrow
453,/m/07rqsjt,"Whoosh, swoosh, swish"
454,/m/07qnq_y,"Thump, thud"
455,/m/07rrh0c,Thunk
456,/m/0b_fwt,Electronic tuner
457,/m/02rr_,Effects unit
458,/m/07m2kt,Chorus effect
459,/m/018w8,Basketball bounce
460,/m/07pws3f,Bang
461,/m/07ryjzk,"Slap, smack"
462,/m/07rdhzs,"Whack, thwack"
463,/m/07pjjrj,"Smash, crash"
464,/m/07pc8lb,Breaking
465,/m/07pqn27,Bouncing
466,/m/07rbp7_,Whip
467,/m/07pyf11,Flap
468,/m/07qb_dv,Scratch
469,/m/07qv4k0,Scrape
470,/m/07pdjhy,Rub
471,/m/07s8j8t,Roll
472,/m/07plct2,Crushing
473,/t/dd00112,"Crumpling, crinkling"
474,/m/07qcx4z,Tearing
475,/m/02fs_r,"Beep, bleep"
476,/m/07qwdck,Ping
477,/m/07phxs1,Ding
478,/m/07rv4dm,Clang
479,/m/07s02z0,Squeal
480,/m/07qh7jl,Creak
481,/m/07qwyj0,Rustle
482,/m/07s34ls,Whir
483,/m/07qmpdm,Clatter
484,/m/07p9k1k,Sizzle
485,/m/07qc9xj,Clicking
486,/m/07rwm0c,Clickety-clack
487,/m/07phhsh,Rumble
488,/m/07qyrcz,Plop
489,/m/07qfgpx,"Jingle, tinkle"
490,/m/07rcgpl,Hum
491,/m/07p78v5,Zing
492,/t/dd00121,Boing
493,/m/07s12q4,Crunch
494,/m/028v0c,Silence
495,/m/01v_m0,Sine wave
496,/m/0b9m1,Harmonic
497,/m/0hdsk,Chirp tone
498,/m/0c1dj,Sound effect
499,/m/07pt_g0,Pulse
500,/t/dd00125,"Inside, small room"
501,/t/dd00126,"Inside, large room or hall"
502,/t/dd00127,"Inside, public space"
503,/t/dd00128,"Outside, urban or manmade"
504,/t/dd00129,"Outside, rural or natural"
505,/m/01b9nn,Reverberation
506,/m/01jnbd,Echo
507,/m/096m7z,Noise
508,/m/06_y0by,Environmental noise
509,/m/07rgkc5,Static
510,/m/06xkwv,Mains hum
511,/m/0g12c5,Distortion
512,/m/08p9q4,Sidetone
513,/m/07szfh9,Cacophony
514,/m/0chx_,White noise
515,/m/0cj0r,Pink noise
516,/m/07p_0gm,Throbbing
517,/m/01jwx6,Vibration
518,/m/07c52,Television
519,/m/06bz3,Radio
520,/m/07hvw1,Field recording
# Copyright 2019 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Installation test for YAMNet."""
import numpy as np
import tensorflow as tf
import params
import yamnet
class YAMNetTest(tf.test.TestCase):
_yamnet_graph = None
_yamnet = None
_yamnet_classes = None
@classmethod
def setUpClass(cls):
super(YAMNetTest, cls).setUpClass()
cls._yamnet_graph = tf.Graph()
with cls._yamnet_graph.as_default():
cls._yamnet = yamnet.yamnet_frames_model(params)
cls._yamnet.load_weights('yamnet.h5')
cls._yamnet_classes = yamnet.class_names('yamnet_class_map.csv')
def clip_test(self, waveform, expected_class_name, top_n=10):
"""Run the model on the waveform, check that expected class is in top-n."""
with YAMNetTest._yamnet_graph.as_default():
prediction = np.mean(YAMNetTest._yamnet.predict(
np.reshape(waveform, [1, -1]), steps=1)[0], axis=0)
top_n_class_names = YAMNetTest._yamnet_classes[
np.argsort(prediction)[-top_n:]]
self.assertIn(expected_class_name, top_n_class_names)
def testZeros(self):
self.clip_test(
waveform=np.zeros((1, int(3 * params.SAMPLE_RATE))),
expected_class_name='Silence')
def testRandom(self):
np.random.seed(51773) # Ensure repeatability.
self.clip_test(
waveform=np.random.uniform(-1.0, +1.0,
(1, int(3 * params.SAMPLE_RATE))),
expected_class_name='White noise')
def testSine(self):
self.clip_test(
waveform=np.reshape(
np.sin(2 * np.pi * 440 * np.linspace(
0, 3, int(3 *params.SAMPLE_RATE))),
[1, -1]),
expected_class_name='Sine wave')
if __name__ == '__main__':
tf.test.main()
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2019 The TensorFlow Authors All Rights Reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"# =============================================================================="
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Visualization of the YAMNet audio event classification model.\n",
"# See https://github.com/tensorflow/models/tree/master/research/audioset/yamnet/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Imports.\n",
"import numpy as np\n",
"import soundfile as sf\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import params\n",
"import yamnet as yamnet_model"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Logging before flag parsing goes to stderr.\n",
"W1121 08:50:31.581582 4453795264 deprecation.py:506] From /Applications/anaconda/envs/py3/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
"Instructions for updating:\n",
"Call initializer instance with the dtype argument instead of passing it to the constructor\n"
]
}
],
"source": [
"# Set up the YAMNet model.\n",
"params.PATCH_HOP_SECONDS = 0.1 # 10 Hz scores frame rate.\n",
"yamnet = yamnet_model.yamnet_frames_model(params)\n",
"yamnet.load_weights('yamnet.h5')\n",
"class_names = yamnet_model.class_names('yamnet_class_map.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Read in the audio.\n",
"# You can get this example waveform via:\n",
"# curl -O https://storage.googleapis.com/audioset/speech_whistling2.wav\n",
"wav_data, sr = sf.read('speech_whistling2.wav', dtype=np.int16)\n",
"waveform = wav_data / 32768.0\n",
"# Sampling rate should be 16000 Hz.\n",
"assert sr == 16000"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Run the model.\n",
"scores, spectrogram = yamnet.predict(np.reshape(waveform, [1, -1]), steps=1)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x576 with 3 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Visualize the results.\n",
"plt.figure(figsize=(10, 8))\n",
"\n",
"# Plot the waveform.\n",
"plt.subplot(3, 1, 1)\n",
"plt.plot(waveform)\n",
"plt.xlim([0, len(waveform)])\n",
"# Plot the log-mel spectrogram (returned by the model).\n",
"plt.subplot(3, 1, 2)\n",
"plt.imshow(spectrogram.T, aspect='auto', interpolation='nearest', origin='bottom')\n",
"\n",
"# Plot and label the model output scores for the top-scoring classes.\n",
"mean_scores = np.mean(scores, axis=0)\n",
"top_N = 10\n",
"top_class_indices = np.argsort(mean_scores)[::-1][:top_N]\n",
"plt.subplot(3, 1, 3)\n",
"plt.imshow(scores[:, top_class_indices].T, aspect='auto', interpolation='nearest', cmap='gray_r')\n",
"# Compensate for the PATCH_WINDOW_SECONDS (0.96 s) context window to align with spectrogram.\n",
"patch_padding = (params.PATCH_WINDOW_SECONDS / 2) / params.PATCH_HOP_SECONDS\n",
"plt.xlim([-patch_padding, scores.shape[0] + patch_padding])\n",
"# Label the top_N classes.\n",
"yticks = range(0, top_N, 1)\n",
"plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])\n",
"_ = plt.ylim(-0.5 + np.array([top_N, 0]))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment