fault detection uding ML algorithms | Study Guides, Projects, Research Signal Processing and Analysis

Deep Learning using Linear Support Vector Machines

Yichuan Tang [email protected]

Department of Computer Science, University of Toronto. Toronto, Ontario, Canada.

Abstract

Recently, fully-connected and convolutional

neural networks have been trained to achieve

state-of-the-art performance on a wide vari-

ety of tasks such as speech recognition, im-

age classification, natural language process-

ing, and bioinformatics. For classification

tasks, most of these “deep learning” models

employ the softmax activation function for

prediction and minimize cross-entropy loss.

In this paper, we demonstrate a small but

consistent advantage of replacing the soft-

max layer with a linear support vector ma-

chine. Learning minimizes a margin-based

loss instead of the cross-entropy loss. While

there have been various combinations of neu-

ral nets and SVMs in prior art, our results

using L2-SVMs show that by simply replac-

ing softmax with linear SVMs gives signifi-

cant gains on popular deep learning datasets

MNIST, CIFAR-10, and the ICML 2013 Rep-

resentation Learning Workshop’s face expres-

sion recognition challenge.

1. Introduction

Deep learning using neural networks have claimed

state-of-the-art performances in a wide range of tasks.

These include (but not limited to) speech (Mohamed

et al., 2009; Dahl et al., 2010) and vision (Jarrett

et al., 2009; Ciresan et al., 2011; Rifai et al., 2011a;

Krizhevsky et al., 2012). All of the above mentioned

papers use the softmax activation function (also known

as multinomial logistic regression) for classification.

Support vector machine is an widely used alternative

to softmax for classification (Boser et al., 1992). Using

SVMs (especially linear) in combination with convolu-

tional nets have been proposed in the past as part of a

International Conference on Machine Learning 2013: Chal-

lenges in Representation Learning Workshop. Atlanta,

Georgia, USA.

multistage process. In particular, a deep convolutional

net is first trained using supervised/unsupervised ob-

jectives to learn good invariant hidden latent represen-

tations. The corresponding hidden variables of data

samples are then treated as input and fed into linear

(or kernel) SVMs (Huang & LeCun, 2006; Lee et al.,

2009; Quoc et al., 2010; Coates et al., 2011). This

technique usually improves performance but the draw-

back is that lower level features are not been fine-tuned

w.r.t. the SVM’s objective.

Other papers have also proposed similar models but

with joint training of weights at lower layers using

both standard neural nets as well as convolutional neu-

ral nets (Zhong & Ghosh, 2000; Collobert & Bengio,

2004; Nagi et al., 2012). In other related works, We-

ston et al. (2008) proposed a semi-supervised embed-

ding algorithm for deep learning where the hinge loss

is combined with the “contrastive loss” from siamese

networks (Hadsell et al., 2006). Lower layer weights

are learned using stochastic gradient descent. Vinyals

et al. (2012) learns a recursive representation using lin-

ear SVMs at every layer, but without joint fine-tuning

of the hidden representation.

In this paper, we show that for some deep architec-

tures, a linear SVM top layer instead of a softmax

is beneficial. We optimize the primal problem of the

SVM and the gradients can be backpropagated to learn

lower level features. Our models are essentially same

as the ones proposed in (Zhong & Ghosh, 2000; Nagi

et al., 2012), with the minor novelty of using the loss

from the L2-SVM instead of the standard hinge loss.

Compared to nets using a top layer softmax,

we demonstrate superior performance on MNIST,

CIFAR-10, and on a recent Kaggle competition on

recognizing face expressions. Optimization is done us-

ing stochastic gradient descent on small minibatches.

Comparing the two models in Sec. 3.4, we believe the

performance gain is largely due to the superior regu-

larization effects of the SVM loss function, rather than

an advantage from better parameter optimization.

fault detection uding ML algorithms, Study Guides, Projects, Research of Signal Processing and Analysis

Related documents

Partial preview of the text

Download fault detection uding ML algorithms and more Study Guides, Projects, Research Signal Processing and Analysis in PDF only on Docsity!

Abstract

1. Introduction

2. The model

∑^ N

∑^ N

∑^ N

3.2. MNIST

3.3. CIFAR-

4. Conclusions

References