Acoustic Scene Recognition with Deep Learning

Wei Dai

Machine Learning Department Carnegie Mellon University

Abstract

Background. Sound complements visual inputs, and is an important modality for

perceiving the environment. Increasingly, machines in various environments have the

ability to hear, such as smartphones, autonomous robots, or security systems. This

work applies state-of-the-art Deep Learning models that have revolutionized speech

recognition to understanding general environmental sounds.

Aim. This work aims to classify 15 common indoor and outdoor locations using envi-

ronmental sounds. We compare both conventional and Deep Learning models for this

task.

Data. We use a dataset from the ongoing IEEE challenge on Detection and Classifica-

tion of Acoustic Scenes and Events (DCASE). The dataset contains 15 diverse indoor

and outdoor locations, such as buses, cafes, cars, city centers, forest paths, libraries,

and trains, totaling 9.75 hours of audio recording.

Methods. We extract features using signal processing techniques, such as mel-frequency

cepstral coefficients (MFCC), various statistical functionals, and spectrograms. We ex-

tract 4 feature sets: MFCCs (60-dimensional), Smile983 (983-dimensional), Smile6k

(6573-dimensional), and spectrograms (only for CNN-based models). On these fea-

tures we apply 5 models: Gaussian Mixture Models (GMMs), Support Vector Machines

(SVMs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Recur-

rent Deep Neural Networks (RDNNs), Convolutional Neural Networks (CNNs), and

Recurrent Convolutional Neural Networks (RCNNs). Among them GMMs and SVMs

are popular conventional models for this task, while RDNNs, CNNs, and RCNNs are,

to our knowledge, the first application of these models in the context of environmental

sound.

Results. Our experiments show that model performance varies with features. With

a small set of features (MFCCs and Smile983) temporal models (RNNs, RDNNs) out-

perform non-temporal models (GMMs, SVMs, DNNs). However, with large feature

sets (Smile6k) DNNs outperform temporal models (RNNs and RDNNs) and achieve

the best performance among all studied methods. The GMM with MFCC features, the

baseline model provided by the DCASE contest, achieves 67.6% test accuracy, while

the best performing model (a DNN with the Smile6k feature) reaches 80% test accu-

racy. RNNs and RDNNs generally have performance in the range of 68∼77%, while

SVMs vary between 56∼73%. CNNs and RCNNs with spectrogram features lag in

performance compared with other Deep Learning models, reaching 63∼64% accuracy.

Conclusions. We find that Deep Learning models compare favorably to conventional

models (GMMs and SVMs). No single model outperforms all other models across

all feature sets, showing that model performance varies significantly with the feature

representation. The fact that the best performing model is a non-temporal DNN is

evidence that environmental sounds do not exhibit strong temporal dynamics. This is

consistent with our day-to-day experience that environmental sounds tend to be ran-

dom and unpredictable.

Acoustic Scene Recognition with Deep Learning, Study notes of Microwave Engineering and Acoustics

Related documents

Partial preview of the text

Download Acoustic Scene Recognition with Deep Learning and more Study notes Microwave Engineering and Acoustics in PDF only on Docsity!

Wei Dai

Machine Learning Department Carnegie Mellon University

∑W

∑W

2.2 Gaussian Mixture Models (GMMs)

∑^ K

2.3 Support Vector Machines (SVMs)

∑^ N

2.4 Deep Neural Networks (DNNs)

2.6 Recurrent Deep Neural Networks (RDNNs)

2.7 Convolutional Neural Networks (CNNs)

2.8 Recurrent Convolutional Neural Networks (RCNNs)

3.1 Dataset

3.4 Results

3.5 Discussion