



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
collection of research papers for machine learning and signal processing
Typology: Study Guides, Projects, Research
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Yichuan Tang [email protected]
Department of Computer Science, University of Toronto. Toronto, Ontario, Canada.
Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide vari- ety of tasks such as speech recognition, im- age classification, natural language process- ing, and bioinformatics. For classification tasks, most of these “deep learning” models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the soft- max layer with a linear support vector ma- chine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neu- ral nets and SVMs in prior art, our results using L2-SVMs show that by simply replac- ing softmax with linear SVMs gives signifi- cant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Rep- resentation Learning Workshop’s face expres- sion recognition challenge.
Deep learning using neural networks have claimed state-of-the-art performances in a wide range of tasks. These include (but not limited to) speech (Mohamed et al., 2009; Dahl et al., 2010) and vision (Jarrett et al., 2009; Ciresan et al., 2011; Rifai et al., 2011a; Krizhevsky et al., 2012). All of the above mentioned papers use the softmax activation function (also known as multinomial logistic regression) for classification.
Support vector machine is an widely used alternative to softmax for classification (Boser et al., 1992). Using SVMs (especially linear) in combination with convolu- tional nets have been proposed in the past as part of a
International Conference on Machine Learning 2013: Chal- lenges in Representation Learning Workshop. Atlanta, Georgia, USA.
multistage process. In particular, a deep convolutional net is first trained using supervised/unsupervised ob- jectives to learn good invariant hidden latent represen- tations. The corresponding hidden variables of data samples are then treated as input and fed into linear (or kernel) SVMs (Huang & LeCun, 2006; Lee et al., 2009; Quoc et al., 2010; Coates et al., 2011). This technique usually improves performance but the draw- back is that lower level features are not been fine-tuned w.r.t. the SVM’s objective. Other papers have also proposed similar models but with joint training of weights at lower layers using both standard neural nets as well as convolutional neu- ral nets (Zhong & Ghosh, 2000; Collobert & Bengio, 2004; Nagi et al., 2012). In other related works, We- ston et al. (2008) proposed a semi-supervised embed- ding algorithm for deep learning where the hinge loss is combined with the “contrastive loss” from siamese networks (Hadsell et al., 2006). Lower layer weights are learned using stochastic gradient descent. Vinyals et al. (2012) learns a recursive representation using lin- ear SVMs at every layer, but without joint fine-tuning of the hidden representation. In this paper, we show that for some deep architec- tures, a linear SVM top layer instead of a softmax is beneficial. We optimize the primal problem of the SVM and the gradients can be backpropagated to learn lower level features. Our models are essentially same as the ones proposed in (Zhong & Ghosh, 2000; Nagi et al., 2012), with the minor novelty of using the loss from the L2-SVM instead of the standard hinge loss. Compared to nets using a top layer softmax, we demonstrate superior performance on MNIST, CIFAR-10, and on a recent Kaggle competition on recognizing face expressions. Optimization is done us- ing stochastic gradient descent on small minibatches. Comparing the two models in Sec. 3.4, we believe the performance gain is largely due to the superior regu- larization effects of the SVM loss function, rather than an advantage from better parameter optimization.
2.1. Softmax
For classification problems using deep learning tech- niques, it is standard to use the softmax or 1-of-K encoding at the top. For example, given 10 possible classes, the softmax layer has 10 nodes denoted by pi, where i = 1,... , 10. pi specifies a discrete probability distribution, therefore,
i pi^ = 1.
Let h be the activation of the penultimate layer nodes, W is the weight connecting the penultimate layer to the softmax layer, the total input into a softmax layer, given by a, is
ai =
k
hkWki, (1)
then we have
pi = exp(ai) ∑ 10 j exp(aj^ )^
The predicted class ˆi would be
ˆi = arg max i
pi
= arg max i
ai (3)
2.2. Support Vector Machines
Linear support vector machines (SVM) is originally formulated for binary classification. Given train- ing data and its corresponding labels (xn, yn), n = 1 ,... , N , xn ∈ RD^ , tn ∈ {− 1 , +1}, SVMs learning consists of the following constrained optimization:
min w,ξn
wTw + C
n=
ξn (4)
s.t. wTxntn ≥ 1 − ξn ∀n ξn ≥ 0 ∀n
ξn are slack variables which penalizes data points which violate the margin requirements. Note that we can include the bias by augment all data vectors xn with a scalar value of 1. The corresponding uncon- strained optimization problem is the following:
min w
wTw + C
n=
max(1 − wTxntn, 0) (5)
The objective of Eq. 5 is known as the primal form problem of L1-SVM, with the standard hinge loss. Since L1-SVM is not differentiable, a popular variation
is known as the L2-SVM which minimizes the squared hinge loss:
min w
wTw + C
n=
max(1 − wTxntn, 0)^2 (6)
L2-SVM is differentiable and imposes a bigger (quadratic vs. linear) loss for points which violate the margin. To predict the class label of a test data x:
arg max t
(wTx)t (7)
For Kernal SVMs, optimization must be performed in the dual. However, scalability is a problem with Ker- nal SVMs, and in this paper we will be only using linear SVMs with standard deep learning models.
2.3. Multiclass SVMs
The simplest way to extend SVMs for multiclass prob- lems is using the so-called one-vs-rest approach (Vap- nik, 1995). For K class problems, K linear SVMs will be trained independently, where the data from the other classes form the negative cases. Hsu & Lin (2002) discusses other alternative multiclass SVM ap- proaches, but we leave those to future work. Denoting the output of the k-th SVM as
ak(x) = wTx (8)
The predicted class is
arg max k
ak(x) (9)
Note that prediction using SVMs is exactly the same as using a softmax Eq. 3. The only difference between softmax and multiclass SVMs is in their objectives parametrized by all of the weight matrices W. Soft- max layer minimizes cross-entropy or maximizes the log-likelihood, while SVMs simply try to find the max- imum margin between data points of different classes.
2.4. Deep Learning with Support Vector Machines Most deep learning methods for classification using fully connected layers and convolutional layers have used softmax layer objective to learn the lower level parameters. There are exceptions, notably in papers by (Zhong & Ghosh, 2000; Collobert & Bengio, 2004; Nagi et al., 2012), supervised embedding with nonlin- ear NCA (Salakhutdinov & Hinton, 2007), and semi- supervised deep embedding (Weston et al., 2008). In this paper, we use L2-SVM’s objective to train deep
We can also look at the validation curve of the Soft- max vs L2-SVMs as a function of weight updates in Fig. 2. As learning rate is lowered during the latter
Figure 2. Cross validation performance of the two models. Result is averaged over 8 folds.
half of training, DLSVM maintains a small yet clear performance gain.
We also plotted the 1st layer convolutional filters of the two models:
Figure 3. Filters from convolutional net with softmax.
Figure 4. Filters from convolutional net with L2-SVM.
While not much can be gain from looking at these filters, SVM trained conv net appears to have more textured filters.
MNIST is a standard handwritten digit classification dataset and has been widely used as a benchmark dataset in deep learning. It is a 10 class classification problem with 60,000 training examples and 10,000 test cases. We used a simple fully connected model by first per- forming PCA from 784 dimensions down to 70 dimen- sions. Two hidden layers of 512 units each is followed by a softmax or a L2-SVM. The data is then divided up into 300 minibatches of 200 samples each. We trained using stochastic gradient descent with momentum on these 300 minibatches for over 400 epochs, totaling 120K weight updates. Learning rate is linearly decayed from 0.1 to 0.0. The L2 weight cost on the softmax layer is set to 0.001. To prevent overfitting and criti- cal to achieving good results, a lot of Gaussian noise is added to the input. Noise of standard deviation of 1. (linearly decayed to 0) is added. The idea of adding Gaussian noise is taken from these papers (Raiko et al., 2012; Rifai et al., 2011b). Our learning algorithm is permutation invariant with- out any unsupervised pretraining and obtains these results: Softmax: 0.99% DLSVM: 0.87% An error of 0.87% on MNIST is probably (at this time) state-of-the-art for the above learning setting. The only difference between softmax and DLSVM is the last layer. This experiment is mainly to demonstrate the effectiveness of the last linear SVM layer vs. the softmax, we have not exhaustively explored other com- monly used tricks such as Dropout, weight constraints, hidden unit sparsity, adding more hidden layers and increasing the layer size.
Canadian Institute For Advanced Research 10 dataset is a 10 class object dataset with 50,000 images for training and 10,000 for testing. The colored images are 32 × 32 in resolution. We trained a Convolutional Neural Net with two alternating pooling and filtering layers. Horizontal reflection and jitter is applied to the data randomly before the weight is updated using a minibatch of 128 data cases. The Convolutional Net part of both the model is fairly standard, the first C layer had 32 5×5 filters with Relu hidden units, the second C layer has 64 5 × 5 filters. Both pooling layers used max pooling and downsam- pled by a factor of 2. The penultimate layer has 3072 hidden nodes and uses Relu activation with a dropout rate of 0.2. The dif-
ference between the Convnet+Softmax and ConvNet with L2-SVM is the mainly in the SVM’s C constant, the Softmax’s weight decay constant, and the learning rate. We selected the values of these hyperparameters for each model separately using validation.
ConvNet+Softmax ConvNet+SVM Test error 14.0% 11.9%
Table 2. Comparisons of the models in terms of % error on the test set.
In literature, the state-of-the-art (at the time of writ- ing) result is around 9.5% by (Snoeck et al. 2012). However, that model is different as it includes con- trast normalization layers as well as used Bayesian op- timization to tune its hyperparameters.
3.4. Regularization or Optimization
To see whether the gain in DLSVM is due to the su- periority of the objective function or to the ability to better optimize, We looked at the two final models’ loss under its own objective functions as well as the other objective. The results are in Table 3.
ConvNet ConvNet +Softmax +SVM Test error 14.0% 11.9% Avg. cross entropy 0.072 0. Hinge loss squared 213.2 0.
Table 3. Training objective including the weight costs.
It is interesting to note here that lower cross entropy actually led a higher error in the middle row. In ad- dition, we also initialized a ConvNet+Softmax model with the weights of the DLSVM that had 11.9% error. As further training is performed, the network’s error rate gradually increased towards 14%.
This gives limited evidence that the gain of DLSVM is largely due to a better objective function.
In conclusion, we have shown that DLSVM works bet- ter than softmax on 2 standard datasets and a recent dataset. Switching from softmax to SVMs is incredibly simple and appears to be useful for classification tasks. Further research is needed to explore other multiclass SVM formulations and better understand where and how much the gain is obtained.
Acknowledgment
Thanks to Alex Krizhevsky for making his very fast CUDA Conv kernels available! Many thanks to Relu Patrascu for making running experiments pos- sible! Thanks to Ian Goodfellow, Dumitru Erhan, and Yoshua Bengio for organizing the contests.
Boser, Bernhard E., Guyon, Isabelle M., and Vapnik, Vladimir N. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual ACM Work- shop on Computational Learning Theory, pp. 144–152. ACM Press, 1992.
Ciresan, D., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. High-performance neural networks for visual object classification. CoRR, abs/1102.0183, 2011.
Coates, Adam, Ng, Andrew Y., and Lee, Honglak. An analysis of single-layer networks in unsupervised feature learning. Journal of Machine Learning Research - Pro- ceedings Track, 15:215–223, 2011. Collobert, R. and Bengio, S. A gentle hessian for efficient gradient descent. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2004.
Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. Phone recognition with the mean-covariance restricted Boltzmann machine. In NIPS 23. 2010.
Hadsell, Raia, Chopra, Sumit, and Lecun, Yann. Dimen- sionality reduction by learning an invariant mapping. In In Proc. Computer Vision and Pattern Recognition Con- ference (CVPR06. IEEE Press, 2006.
Hsu, Chih-Wei and Lin, Chih-Jen. A comparison of meth- ods for multiclass support vector machines. IEEE Trans- actions on Neural Networks, 13(2):415–425, 2002.
Huang, F. J. and LeCun, Y. Large-scale learning with SVM and convolutional for generic object cate- gorization. In CVPR, pp. I: 284–291, 2006. URL http://dx.doi.org/10.1109/CVPR.2006.164.
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. What is the best multi-stage architecture for object recognition? In Proc. Intl. Conf. on Computer Vision (ICCV’09). IEEE, 2009. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Convo- lutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Intl. Conf. on Machine Learning, pp. 609–616, 2009.
Mohamed, A., Dahl, G. E., and Hinton, G. E. Deep belief networks for phone recognition. In NIPS Workshop on Deep Learning for Speech Recognition and Related Ap- plications, 2009.