Building an audio recognition model using siamese networks
In the last tutorial, we saw how to use siamese networks to recognize a face. Now we will see how to use siamese networks to recognize audio. We will train our network to differentiate between the sound of a dog and the sound of a cat. The dataset of cat and dog audio can be downloaded from here: https://www.kaggle.com/mmoreaux/audio-cats-and-dogs#cats_dogs.zip.
Once we have downloaded the data, we fragment our data into three folders: Dogs, Sub_dogs, and Cats. In Dogs and Sub_dogs, we place the dog's barking audio and in the Cats folder, we place the cat's audio. The objective of our network is to recognize whether the audio is a dog's barking or some different sound. As we know, for a siamese network, we need to feed input as a pair; we select an audio from the Dogs and Sub_dogs folders and mark them as a genuine pair and we select an audio from the Dogs and Cats folders and mark them as an imposite pair. That is, (dogs, subdogs) is a genuine pair and (dogs, cats) is an imposite pair.
Now, we will show, step-by-step, how to train our siamese network to recognize whether the audio is the dog's barking sound or a different sound.
For better understanding, you can check the complete code, which is available as a Jupyter Notebook with an explanation here: https://github.com/sudharsan13296/Hands-On-Meta-Learning-With-Python/blob/master/02.%20Face%20and%20Audio%20Recognition%20using%20Siamese%20Networks/2.5%20Audio%20Recognition%20using%20Siamese%20Network.ipynb.
First, we will load all of the necessary libraries:
#basic imports
import glob
import IPython
from random import randint
#data processing
import librosa
import numpy as np
#modelling
from sklearn.model_selection import train_test_split
from keras import backend as K
from keras.layers import Activation
from keras.layers import Input, Lambda, Dense, Dropout, Flatten
from keras.models import Model
from keras.optimizers import RMSprop
Before going ahead, we load and listen to the audio clips:
IPython.display.Audio("data/audio/Dogs/dog_barking_0.wav")
IPython.display.Audio("data/audio/Cats/cat_13.wav")
So, how can we feed this raw audio to our network? How can we extract meaningful features from the raw audio? As we know, neural networks accept only vectorized input, so we need to convert our audio into a feature vector. How can we do that? Well, there are several mechanisms through which we can generate embeddings for audio. One such popular mechanism is Mel-Frequency Cepstral Coefficients (MFCC). MFCC converts the short-term power spectrum of an audio using a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. To learn more about MFCC, check out this nice tutorial: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/.
We will use the MFCC function from the librosa library for generating the audio embeddings. So, we define a function called audio2vector, which returns the audio embeddings given an audio file:
def audio2vector(file_path, max_pad_len=400):
#read the audio file
audio, sr = librosa.load(file_path, mono=True)
#reduce the shape
audio = audio[::3]
#extract the audio embeddings using MFCC
mfcc = librosa.feature.mfcc(audio, sr=sr)
#as the audio embeddings length varies for different audio, we keep the maximum length as 400
#pad them with zeros
pad_width = max_pad_len - mfcc.shape[1]
mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')
return mfcc
We will load one audio file and see the embeddings:
audio_file = 'data/audio/Dogs/dog_barking_0.wav'
audio2vector(audio_file)
array([[-297.54905127, -288.37618855, -314.92037769, ..., 0. , 0. , 0. ], [ 23.05969394, 9.55913148, 37.2173831 , ..., 0. , 0. , 0. ], [-122.06299523, -115.02627567, -108.18703056, ..., 0. , 0. , 0. ], ..., [ -6.40930836, -2.8602708 , -2.12551478, ..., 0. , 0. , 0. ], [ 0.70572914, 4.21777791, 4.62429301, ..., 0. , 0. , 0. ], [ -6.08997702, -11.40687886, -18.2415214 , ..., 0. , 0. , 0. ]])
Now that we have understood how to generate audio embeddings, we need to create the data for our siamese network. As we know, a siamese network accepts the data in a pair, so we define the function for getting our data. We will create the genuine pair as (Dogs, Sub_dogs) and assign the label as 1 and the imposite pair as (Dogs, Cats) and assign the label as 0:
def get_data():
pairs = []
labels = []
Dogs = glob.glob('data/audio/Dogs/*.wav')
Sub_dogs = glob.glob('data/audio/Sub_dogs/*.wav')
Cats = glob.glob('data/audio/Cats/*.wav')
np.random.shuffle(Sub_dogs)
np.random.shuffle(Cats)
for i in range(min(len(Cats),len(Sub_dogs))):
#imposite pair
if (i % 2) == 0:
pairs.append([audio2vector(Dogs[randint(0,3)]),audio2vector(Cats[i])])
labels.append(0)
#genuine pair
else:
pairs.append([audio2vector(Dogs[randint(0,3)]),audio2vector(Sub_dogs[i])])
labels.append(1)
return np.array(pairs), np.array(labels)
X, Y = get_data("/home/sudarshan/sudarshan/Experiments/oneshot-audio/data/")
Next, we split our data for training and testing with 75% training and 25% testing proportions:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
Now that we have successfully generated our data, we build our siamese network. We define our base network, which is used for feature extraction, and we use three dense layers with a dropout layer in between:
def build_base_network(input_shape):
input = Input(shape=input_shape)
x = Flatten()(input)
x = Dense(128, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(128, activation='relu')(x)
return Model(input, x)
Next, we feed the audio pair to the base network, which will return the features:
input_dim = X_train.shape[2:]
audio_a = Input(shape=input_dim)
audio_b = Input(shape=input_dim)
base_network = build_base_network(input_dim)
feat_vecs_a = base_network(audio_a)
feat_vecs_b = base_network(audio_b)
feat_vecs_a and feat_vecs_b are the feature vectors of our audio pair. Next, we feed these feature vectors to the energy function to compute a distance between them, and we use Euclidean distance as our energy function:
def euclidean_distance(vects):
x, y = vects
return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True))
def eucl_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0], 1)
distance = Lambda(euclidean_distance, output_shape=eucl_dist_output_shape)([feat_vecs_a, feat_vecs_b])
Next, we set the epoch length to 13 and we use the RMS prop for optimization:
epochs = 13
rms = RMSprop()
model = Model(input=[audio_a, audio_b], output=distance)
Lastly, we define our loss function as contrastive_loss and compile the model:
def contrastive_loss(y_true, y_pred):
margin = 1
return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))
model.compile(loss=contrastive_loss, optimizer=rms)
Now, we train our model:
audio1 = X_train[:, 0]
audio2 = X_train[:, 1]
model.fit([audio_1, audio_2], y_train, validation_split=.25,
batch_size=128, verbose=2, nb_epoch=epochs)
You can how the loss over epochs:
Train on 8 samples, validate on 3 samples Epoch 1/13 - 0s - loss: 23594.8965 - val_loss: 1598.8439 Epoch 2/13 - 0s - loss: 62360.9570 - val_loss: 816.7302 Epoch 3/13 - 0s - loss: 17967.6230 - val_loss: 970.0378 Epoch 4/13 - 0s - loss: 20030.3711 - val_loss: 358.9078 Epoch 5/13 - 0s - loss: 11196.0547 - val_loss: 339.9991 Epoch 6/13 - 0s - loss: 3837.2898 - val_loss: 381.9774 Epoch 7/13 - 0s - loss: 2037.2965 - val_loss: 303.6652 Epoch 8/13 - 0s - loss: 1434.4321 - val_loss: 229.1388 Epoch 9/13 - 0s - loss: 2553.0562 - val_loss: 215.1207 Epoch 10/13 - 0s - loss: 1046.6870 - val_loss: 197.1127 Epoch 11/13 - 0s - loss: 569.4632 - val_loss: 183.8586 Epoch 12/13 - 0s - loss: 759.0131 - val_loss: 162.3362 Epoch 13/13 - 0s - loss: 819.8594 - val_loss: 120.3017