






















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The use of linear programming (lp) approaches to boosting and demonstrates their efficient solution using lpboost, a column generation based simplex method. The lpboost algorithm can be used to solve any lp boosting formulation by iteratively optimizing the dual misclassification costs in a restricted lp and dynamically generating weak hypotheses to make new lp columns. Insights into the mathematical workings of lp boosting and its differences from the prior hard margin approach.
Typology: Papers
1 / 30
This page cannot be seen from the preview
Don't miss anything!























Machine Learning, 46, 225–254, 2002 ©c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.
AYHAN DEMIRIZ [email protected] Department of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
KRISTIN P. BENNETT [email protected] Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA while visiting Microsoft Research, Redmond, WA, USA
JOHN SHAWE-TAYLOR [email protected] Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK
Editor: Nello Cristianini
Abstract. We examine linear program (LP) approaches to boosting and demonstrate their efficient solution using LPBoost, a column generation based simplex method. We formulate the problem as if all possible weak hypotheses had already been generated. The labels produced by the weak hypotheses become the new feature space of the problem. The boosting task becomes to construct a learning function in the label space that minimizes misclassification error and maximizes the soft margin. We prove that for classification, minimizing the 1-norm soft margin error function directly optimizes a generalization error bound. The equivalent linear program can be efficiently solved using column generation techniques developed for large-scale optimization problems. The resulting LPBoost algorithm can be used to solve any LP boosting formulation by iteratively optimizing the dual misclassification costs in a restricted LP and dynamically generating weak hypotheses to make new LP columns. We provide algorithms for soft margin classification, confidence-rated, and regression boosting problems. Unlike gradient boosting algorithms, which may converge in the limit only, LPBoost converges in a finite number of iterations to a global solution satisfying mathematically well-defined optimality conditions. The optimal solutions of LPBoost are very sparse in contrast with gradient based methods. Computationally, LPBoost is competitive in quality and computational cost to AdaBoost.
Keywords: ensemble learning, boosting, linear programming, sparseness, soft margin
1. Introduction
Recent papers (Schapire et al., 1998) have shown that boosting, arcing, and related ensemble methods (hereafter summarized as boosting) can be viewed as margin maximization in function space. By changing the cost function, different boosting methods such as AdaBoost can be viewed as gradient descent to minimize this cost function. Some authors have noted the possibility of choosing cost functions that can be formulated as linear programs (LP) but then dismiss the approach as intractable using standard LP algorithms (R¨atsch et al., 2000a; Breiman, 1999).
The main contribution in the application of LP methods to boosting has been made by Grove and Schuurmans (1998) who derived an LP method DualLPBoost based on maximizing the margin of the combined classifier. They experienced difficulties, however, in getting the method to work in practice. We adopt a similar approach but optimize a new rigorous generalization bound obtained in terms of a soft margin measure. Using a soft margin ensures that the approach is able to handle noisy data more robustly, but also overcomes the convergence problems experienced by Grove and Schuurmans. We discuss in more detail the reasons for this improvement in Section 7. Furthermore in this paper we show that LP boosting is generally computationally feasible using a classic column generation simplex algorithm (Nash & Sofer, 1996). This method performs tractable boosting using any cost function expressible as an LP. We specifically examine the variations of the 1-norm soft margin cost function used for support vector machines (R¨atsch et al., 2000b; Bennett, 1999; Mangasarian, 2000). One advantage of these approaches is that immediately the method of analysis for support vector machine problems becomes applicable to the boosting problem. In Section 2, we prove that the LPBoost approach to classification directly minimizes a bound on the generalization error. We adopt the LP formulations developed for support vector machines. In Section 3, we discuss the soft margin LP formulation. By adopting linear programming, we immediately have the tools of mathematical programming at our disposal. In Section 4 we examine how column generation approaches for solving large scale LPs can be adapted to boosting. For classification, we examine both standard and confidence-rated boosting. Standard boosting algorithms use weak hypotheses that are classifiers, that is, whose outputs are in the set {− 1 , + 1 }. Schapire and Singer (1998) have considered boosting weak hypotheses whose outputs reflected not only a classification but also an associated confidence encoded by a value in the range [− 1 , +1]. They demonstrate that so-called confidence-rated boosting can speed convergence of the composite classifier, though the accuracy in the long term was not found to be significantly affected. In Section 5, we discuss the minor modifications needed for LPBoost to perform confidence-rated boosting. The methods we develop can be readily extended to any ensemble problem formulated as an LP. We demonstrate this by adapting the approach to regression in Section 6. In Section 7, we examine the hard margin LP formulation of Grove and Schuurmans (1998) which is also a special case of the column generation approach. By use of duality theory and optimality conditions, we can gain insight into how LP boosting works mathematically, specifically demonstrating the critical differences between the prior hard margin approach and the proposed soft margin approach. Computational results and practical issues for implementation of the method are given in Section 8.
2. Motivation for soft margin boosting
We begin with an analysis of the boosting problem using the methodology developed for support vector machines. The function classes that we will be considering are of the form co( H ) = {
h ∈ H a^ h^ h^ :^ a^ h^ ≥^0 }, where^ H^ is a set of weak hypotheses which we assume is closed under complementation. Initially, these will be classification functions with outputs in the set {− 1 , 1 }, though this can be taken as [− 1 , 1] in confidence-rated boosting. We begin,
We define the inner product of two functions f, g ∈ L(X) by 〈 f · g〉 =
x ∈supp( f ) f^ ( x )g( x ). This implicitly defines a norm ‖·‖ 2. We also introduce
‖ f ‖ 1 =
x ∈ supp( f )
| f ( x )|.
Note that the sum that defines the inner product is well-defined by the Cauchy-Schwarz inequality. Clearly the space is closed under addition and multiplication by scalars. Fur- thermore, the inner product is linear in both arguments. We now form the product space X ×L(X) with the corresponding function class F ×L(X) acting on X × L(X) via the composition rule
( f, g) : ( x , h) −→ f ( x ) + 〈g · h〉.
Now for any fixed 1 ≥ > 0 we define an embedding of X into the product space X × L(X) as follows:
τ : x −→ ( x , δ x ),
where δ x ∈ L(X) is defined by δ x ( y ) = 1 if y = x , and 0 otherwise.
De fi nition 2.2. Consider using a class F of real-valued functions on an input space X for classification by thresholding at 0. We define the margin slack variable of an exam- ple ( x i , yi ) ∈ X × {− 1 , 1 } with respect to a function f ∈ F and target margin γ to be the quantity ξ(( x i , yi ), f, γ ) = ξi = max(0, γ − yi f ( x i )). Note that ξi > γ implies incorrect classification of ( x i , yi ).
The construction of the space X × L(X) allows us to obtain a margin separation of γ by using an auxiliary function defined in terms of the margin slack variables. For a function f and target margin γ the auxiliary function with respect to the training set S is
g (^) f =
i= 1
ξ(( x i , yi ), f, γ )yi δ x i =
i= 1
ξi yi δ x i.
It is now a simple calculation to check the following two properties of the function ( f, g (^) f ) ∈ F × L(X):
Together these facts imply that the generalization error of f can be assessed by applying the large margin theorem to ( f, g (^) f ). This gives the following theorem:
Theorem 2.2. Consider thresholding a real-valued function space F on the domain X. Fix γ ∈ R+^ and choose G ⊂ F × L(X). For any probability distribution D on X ×{− 1 , 1 }, with
probability 1 − δ over random examples S, any hypothesis f ∈ F for which ( f, g (^) f ) ∈ G has generalization error no more than
errD( f ) ≤ ε(, F, δ, γ ) =
log N (G, 2 ,
γ 2
) + log
δ
provided > 2 /ε, and there is no discrete probability on misclassi fi ed training points.
We are now in a position to apply these results to our function class which will be in the form described above, F = co(H ) = {
h∈H a^ h^ h^ :^ a^ h^ ≥^0 }, where we have left open for the time being what the class H of weak hypotheses might contain. The sets G of Theorem 2. will be chosen as follows:
h∈H
a (^) h h, g
h∈H
a (^) h + ‖g‖ 1 ≤ B, a (^) h ≥ 0
Hence, the condition that a function f =
h∈H a^ h^ h^ satisfies the conditions of Theorem 2.2 for G = GB is simply
∑
h∈H
a (^) h +
i= 1
ξ(( x i , yi ), f, γ ) =
h∈H
a (^) h +
i= 1
ξi ≤ B. (1)
Note that this will be the quantity that we will minimize through the boosting iterations described in later sections, where we will use the parameter C in place of 1/ and the margin γ will be set to 1. The final piece of the puzzle that we require to apply Theorem 2. is a bound on the covering numbers of GB in terms of the class of weak hypotheses H , the bound B, and the margin γ. Before launching into this analysis, observe that for any input x ,
max h∈H {|h( x )|} = 1 , while max x i δ x i ( x ) ≤ ≤ 1.
2.1. Covering numbers of convex hulls
In this subsection we analyze the covering numbers N (GB , , γ ) of the set
h∈H
a (^) h h, g
h∈H
a (^) h + ‖g‖ 1 ≤ B, a (^) h ≥ 0
in terms of B, the class H , and the scale γ. Assume first that we have an η/B-cover G of the function class H with respect to the set S = ( x 1 , x 2 ,... , x ) for some η < γ. If H is a class of binary-valued functions then we will take η to be zero and G will be the set of dichotomies that can be realized by the class. Now consider the set V of
3. Boosting LP for classification
From the above discussion we can see that a soft margin cost function should be valuable for boosting classification functions. Once again using the techniques used in support vector machines, we can formulate this problem as a linear program. The quantity B defined in Eq. (1) can be optimized directly using an LP. The LP is formulated as if all possible labelings of the training data by the weak hypotheses were known. The LP minimizes the 1-norm soft margin cost function used in support vector machines with the added restrictions that all the weights are positive and the threshold is assumed to be zero. This LP and variants can be practically solved using a column generation approach. Weak hypotheses are generated as needed to produce the optimal support vector machine based on the output of the all weak hypotheses. In essence the base learning algorithm becomes an ‘oracle’ that generates the necessary columns. The dual variables of the linear program provide the misclassification costs needed by the learning machine. The column generation procedure searches for the best possible misclassification costs in dual space. Only at optimality is the actual ensemble of weak hypotheses constructed.
3.1. LP formulation
Let the matrix H be a by m matrix of all the possible labelings of the training data using functions from H. Specifically Hi j = h (^) j (x (^) i ) is the label ( 1 or − 1 ) given by weak hypothesis h (^) j ∈ H on the training point x (^) i. Each column H. j of the matrix H constitutes the output of weak hypothesis h (^) j on the training data, while each row Hi gives the out- puts of all the weak hypotheses on the example x (^) i. There may be up to 2^ distinct weak hypotheses. The following linear program can be used to minimize the quantity in Eq. (1):
min a,ξ
∑^ m
i= 1
ai + C
i= 1
ξi
s.t. yi Hi a + ξi ≥ 1 , ξi ≥ 0 , i = 1 ,... , (2) ai ≥ 0 , i = 1 ,... , m
where C > 0 is the tradeoff parameter between misclassification error and margin maxi- mization. The dual of LP (2) is
max u
i= 1
u (^) i
s.t.
i= 1
u (^) i yi Hij ≤ 1 , j = 1 ,... , m (3)
0 ≤ u (^) i ≤ C, i = 1 ,... ,
Alternative soft margin LP formulations exist, such as this one for the ν-LP Boosting 1 (Ratsch et al., 2000a):¨
max a,ξ,ρ ρ − D
i= 1
ξi
s.t. yi Hi a + ξi ≥ ρ, i = 1 ,... , m (4) ∑
i= 1
ai = 1 , ξi ≥ 0 , i = 1 ,... ,
a (^) j ≥ 0 , j = 1 ,... , m
The dual of this LP (4) is:
min u,β β
s.t.
i= 1
u (^) i yi Hi j ≤ β, j = 1 ,... , m (5)
∑^
i= 1
u (^) i = 1 , 0 ≤ u (^) i ≤ D, i = 1 ,... ,
These LP formulations are exactly equivalent given the appropriate choice of the param- eters C and D. Proofs of this fact can be found in Ratsch et al. (2000b), Bennett et al. (2000)¨ so we only state the theorem here.
Theorem 3.1 (LP Formulation Equivalence). If LP (4) with parameter D has a primal solution ( a¯, ρ >¯ 0 , ξ )¯ and dual solution ( u¯, β)¯, then ( aˆ = aρ¯ ¯ , ξˆ = ξ¯ ρ ¯ )^ and^ (^ uˆ^ =^
u¯ β^ ˆ )^ are the primal and dual solutions of LP (2) with parameter C = Dβ ¯. Similarly, if LP 2 with parameter C has primal solution ( aˆ = 0 , ξ )ˆ and dual solution ( uˆ = 0 ), then ( ρ¯ = ∑m^1 i = 1 aˆi^ , a¯ = ˆa ρ,¯ ξ¯ = ξˆ ρ)¯ and ( β¯ = ∑^1 i = 1 uˆ^ i
, u¯ = ˆu β)¯ are the primal and dual solutions of LP (4) with parameter D = C βˆ.
Practically we found ν-LP (4) with D = (^) ν^1 , ν ∈ ( 0 , 1 ) preferable because of the inter- pretability of the parameter. A more extensive discussion and development of these char- acteristics for SVM classification can be found in R¨atsch et al. (2000b). To maintain dual feasibility, the parameter ν must maintain (^1) ⇐ D ⇐ 1. By picking ν appropriately we can force the minimum number of support vectors. We know that the number of support vectors will be the number of points misclassified plus the points on the margin, and this was used as a heuristic for choice of ν. The reader should consult (R¨atsch et al., 2000a, 2000b) for a more in-depth analysis of this family of cost functions.
3.2. Properties of LP formulation
We now examine the characteristics of LP (4) and its optimality conditions to gain insight into the properties of LP Boosting. This will be useful in understanding both the effects of
formulation solvable by LPBoost.) A notable difference is that LP (5) has an additional upper bound on the misclassification costs u, 0 ≤ u (^) i ≤ D, i = 1 ,... , , that is produced by the introduction of the soft margin in the primal. From the LP optimality conditions and the fact that linear programs have extreme point solutions, we know that there exist very sparse solutions of both the primal and dual problems and that the degree of sparsity will be greatly influenced by the choice of parameter D = (^) ν^1. The size of the dual feasible region depends on our choice of ν. If ν is too large, forcing D small, then the dual problem is infeasible. For large but still feasible ν (D very small but still feasible), the problem degrades to something very close to the equal-cost case, u (^) i = 1 /. All the u (^) i are forced to be nonzero. Practically, this means that as ν increases (D becomes larger), the optimal solution may be one or two weak hypotheses that are best assuming approximately equal costs. As ν decreases (D grows), the misclassification costs, u (^) i , will increase for hard-to-classify points or points on the margin in the label space and will go to 0 for points that are easy to classify. Thus the misclassification costs u become sparser. If ν is too small (and D too large) then the meaningless null solution, a = 0, with all points classified as one class, becomes optimal. For a good choice of ν, a sparse solution for the primal ensemble weights a will be optimal. This implies that few weak hypotheses will be used. Also a sparse dual u will be optimal. This means that the solution will be dependent only on a smaller subset of data (the support vectors.) Data with u (^) i = 0 are well-classified with sufficient margin, so the performance on these data is not critical. From LP sensitivity analysis, we know that the u (^) i are exactly the sensitivity of the optimal solution to small perturbations in the margin. In some sense the sparseness of u is good because the weak hypotheses can be constructed using only smaller subsets of the data. But as we will see in Sections 7 and 8, this sparseness of the misclassification costs can lead to problems when implementing algorithms.
4. LPBoost algorithms
We now examine practical algorithms for solving the LP (4). Since the matrix H has a very large number of columns, prior authors have dismissed the idea of solving LP formulations for boosting as being intractable using standard LP techniques. But column generation techniques for solving such LPs have existed since the 1950s and can be found in LP textbooks; see for example (Nash & Sofer, 1996, Section 7.4). Column generation is frequently used in large-scale integer and linear programming algorithms so commercial codes such as CPLEX have been optimized to perform column generation very efficiently (CPLEX, 1994). The simplex method does not require that the matrix H be explicitly available. At each iteration, only a subset of the columns is used to determine the current solution (called a basic feasible solution). The simplex method needs some means for determining if the current solution is optimal, and if it is not, some means for generating some column that violates the optimality conditions. The tasks of verification of optimality and generating a column can be performed by the learning algorithm. A simplex-based boosting method will alternate between solving an LP for a reduced matrix Hˆ corresponding to the weak hypotheses generated so far and using the base learning algorithm to generate the best-scoring weak hypothesis based on the dual misclassification cost provided by the
LP. This will continue until the algorithm terminates at an exact or approximate optimal solution based on well-defined stopping criteria or some other stopping criteria such as the maximum number of iterations is reached. The idea of column generation (CG) is to restrict the primal problem (2) by considering only a subset of all the possible labelings based on the weak hypotheses generated so far; i.e., only a subset Hˆ of the columns of H is used. The LP solved using Hˆ is typically referred to as the restricted master problem. Solving the restricted primal LP corresponds to solving a relaxation of the dual LP. The constraints for weak hypotheses that have not been generated yet are missing. One extreme case is when no weak hypotheses are considered. In this case the optimal dual solution is uˆ (^) i = (^1) (with appropriate choice of D). This will provide the initialization of the algorithm. If we consider the unused columns to have aˆi = 0, then aˆ is feasible for the original primal LP. If ( uˆ, β)ˆ is feasible for the original dual problem then we are done since we have primal and dual feasibility with equal objectives. If aˆ is not optimal then ( uˆ, β)ˆ is infeasible for the dual LP with full matrix H. Specifically, the constraint
i= 1 uˆ^ i^ yi^ Hi j^ ≤ ˆβ is violated for at least one weak hypothesis. Or equivalently,
i= 1 uˆ^ i^ yi^ Hi j^ >^ βˆ^ for some^ j. Of course we do not want to a priori generate all columns of H (H. j ), so we use our base learning algorithm as an oracle that either produces H. j ,
i= 1 uˆ^ i^ yi^ Hi j^ >^ βˆ^ for some^ j^ or a guarantee that no such H. j exists. To speed convergence we would like to find the one with maximum deviation, that is, the base learning algorithm H(S, u) must deliver a function hˆ satisfying
∑^
i= 1
yi hˆ(x (^) i ) uˆ (^) i = max h∈H
i= 1
u ˆ (^) i yi h(x (^) i ) (10)
Thus uˆ (^) i becomes the new misclassification cost, for example i, that is given to the base learning machine to guide the choice of the next weak hypothesis. One of the big payoffs of the approach is that we have a stopping criterion that guarantees that the optimal ensemble has been found. If there is no weak hypothesis h for which
i= 1 uˆ^ i^ yi^ h(x^ i^ ) >^ βˆ, then the current combined hypothesis is the optimal solution over all linear combinations of weak hypotheses. We can also gauge the cost of early stopping since if maxh∈H
i= 1 uˆ^ i^ yi^ h(x^ i^ )^ ≤ ˆβ^ +^ ,^ for some > 0, we can obtain a feasible solution of the full dual problem by taking ( uˆ, βˆ + ). Hence, the value V of the optimal solution can be bounded between βˆ ≤ V < βˆ + . This implies that, even if we were to potentially include a non-zero coefficient for all the weak hypotheses, the value of the objective ρ − D
i= 1 ξi^ can only be increased by at most^ . We assume the existence of the weak learning algorithm H(S, u) which selects the best weak hypothesis from a set H closed under complementation using the criterion of Eq. (10). The following algorithm results.
Algorithm 4.1 (LPBoost).
Given as input training set: S m ← 0 No weak hypotheses a ← 0 All coef fi cients are 0
5. Confidence-rated boosting
The derivations and algorithm of the last two sections did not rely on the assumption that L (^) i j ∈ {− 1 , + 1 }. We can therefore apply the same reasoning to implementing a weak learning algorithm for a finite set of confidence-rated functions F whose outputs are real numbers. We again assume that F is closed under complementation. We simply define L (^) i j = f (^) j (x (^) i ) for each f (^) j ∈ F and apply the same algorithm as before. We again assume the existence of a base learner F(S, u), which finds a function fˆ ∈ F satisfying
∑^
i= 1
yi fˆ (x (^) i ) uˆ (^) i = max f ∈F
i= 1
u ˆ (^) i yi f (x (^) i ) (11)
The only difference in the associated algorithm is the base learner which now optimizes this equation.
Algorithm 5.1 (LPBoost-CRB).
Given as input training set: S m ← 0 No weak hypotheses a ← 0 All coef fi cients are 0 β ← 0 u ← ( (^1) ,... , (^1) ) Corresponding optimal dual REPEAT m ← m + 1 Find weak hypothesis using equation (11): f (^) m ← F(S, u) Check for optimal solution: If
i= 1 u^ i^ f^ i^ h^ m^ (x^ i^ )^ ≤^ β,^ m^ ←^ m^ −^ 1,^ break Him ← f (^) m (x (^) i ) Solve restricted master for new costs: argmin β s.t.
i= 1 u^ i^ yi^ f^ j^ (x^ i^ )^ ≤^ β (u, β) ← j = 1 ,... , m ∑ i= 1 u^ i^ =^1 0 ≤ u (^) i ≤ D, i = 1 ,... , END a ← Lagrangian multipliers from last LP return m, f =
∑m j= 1 a^ j^ f^ j
6. LPBoost for regression
The LPBoost algorithm can be extended to optimize any ensemble cost function that can be formulated as a linear program. To solve alternate formulations we need only change the LP
restricted master problem solved at each iteration and the criteria given to the base learner. The only assumptions in the current approach are that the number of weak hypotheses be finite and that if an improving weak hypothesis exists then the base learner can generate it. To see a simple example of this consider the problem of boosting regression functions. We use the following adaptation of the SVM regression formulations. This LP was also adapted to boosting using a barrier algorithm in Ratsch et al. (2000c). We assume we are¨ given a training set of data S = (( x 1 , y 1 ),... , ( x , y )), but now yi may take on any real value.
min a,ξ,ξ ∗,
i= 1
(ξi + ξ (^) i∗ )
s.t. Hi a − yi − ξi ≤ , ξi ≥ 0 , i = 1 ,... , Hi a − yi + ξ (^) i∗ ≥ −, ξ i ∗ ≥ 0 , i = 1 ,... , ∑^ m
i= 1
ai = 1 , ai ≥ 0 , i = 1 ,... , m
First we reformulate the problem slightly differently:
min a,ξ,ξ ∗,
i= 1
(ξi + ξ (^) i∗ )
s.t. −Hi a + ξi + ≥ −yi , ξi ≥ 0 , i = 1 ,... , Hi a+ξ (^) i∗ + ≥ yi , ξ (^) i∗ ≥ 0 , i = 1 ,... ,
−
∑^ m
i= 1
ai = − 1 , ai ≥ 0 , i = 1 ,... , m
We introduce Lagrangian multipliers (u, u∗, β), construct the dual, and convert to a minimization problem to yield:
min u,u∗,β
β +
i= 1
yi (u (^) i − u∗ i )
s.t.
i= 1
(−u (^) i + u∗ i )Hi j ≤ β, j = 1 ,... , m
∑^
i= 1
(u (^) i + u∗ i ) = 1
0 ≤ u (^) i ≤ C, 0 ≤ u∗ i ≤ C, i = 1 ,... ,
LP (14) restricted to all weak hypotheses constructed so far becomes the new mas- ter problem. If the base learner returns any hypothesis H. j that is not dual feasible, i.e.
Algorithm 6.1 (LPBoost-Regression).
Given as input training set: S m ← 0 No weak hypotheses a ← 0 All coef fi cients are 0 β ← 0 u (^) i ← (−‖yyi‖^ ) 1 + Corresponding feasible dual u∗ i ← (‖yyi^ )‖+ 1 REPEAT m ← m + 1 Find weak hypothesis using equation (17): h (^) m ← H(S, (−u + u∗)) Check for optimal solution: If
i= 1 (−u^ i^ +^ u
∗ i )h^ m^ (x^ i^ )^ ≤^ β,^ m^ ←^ m^ −^ 1,^ break Him ← h (^) m (x (^) i ) Solve restricted master for new costs: argmin β +
i= 1
u (^) i − u∗ i
yi s.t.
i= 1 (−u^ i^ +^ u ∗ i )h^ j^ (x^ i^ )^ ≤^ β (u, u∗, β) ← j = 1 ,... , m ∑ i= 1 (u^ i^ +^ u ∗ i )^ =^1 END^0 ≤^ u^ i ,^ u∗ i ≤^ C,^ i^ =^1 ,... , a ← Lagrangian multipliers from last LP return m, f =
∑m j= 1 a^ j^ h^ j
7. Hard margins, soft margins, and sparsity
The column generation algorithm can also be applied to the hard margin LP error func- tion for boosting. In fact the DualLPBoost proposed by Grove and Schuurmans (1998) does exactly this. Breiman (1999) also investigated an equivalent formulation using an asymptotic algorithm. Both papers found that optimizing the hard margin LP to con- struct ensembles did not work well in practice. In contrast the soft margin LP ensem- ble methods optimized using column generation investigated in this paper and using an arcing approach in R¨atsch et al. (2000b) worked well (see Section 8). Poor performance of hard margin versus soft margin classification methods have been noted in other con- texts as well. In a computational study of the hard margin Multisurface-Method (MSM) for classification (Mangasarian, 1965) and the soft margin Robust Linear Programming (RLP) method (Bennett & Mangasarian, 1992) (both closely related LP precursors to Boser et al.’s Support Vector Machine (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995)), the soft margin RLP performed uniformly better than the hard margin MSM. In this section we will examine the critical difference between hard and soft margin classifiers geometrically through a simple example. This discussion will also illustrate some of the practical issues of using a column generation approach to solving the soft margin problems.
The hard margin ensemble LP found in Grove and Schuurmans (1998) expressed in the notation of this paper is:
max a,ρ ρ s.t. yi Hi a ≥ ρ, i = 1 ,... , ∑^ m
i= 1
a (^) j = 1 ,
ai ≥ 0 , i = 1 ,... , m
This is the primal formulation. The dual of the hard margin problem is
min u,β
β
s.t.
i= 1
u (^) i yi Hi j ≤ β, j = 1 ,... , m
∑^
i= 1
u (^) i = 1 , 0 ≤ u (^) i , i = 1 ,... ,
Let us examine geometrically what the hard and soft margin formulations do using con- cepts used to described the geometry of SVM in Bennett and Bredensteiner (2000). Consider the LP subproblem in the column generation algorithm after sufficient weak hypotheses have been generated such that two classes are linearly separable. Specifically, there exist ρ > 0 and a such that yi Hi a ≥ ρ > 0 for i = 1 ,... , . Figure 1 gives an example of two confidence rated hypotheses (labels between 0 and 1). The left figure shows the separating hyperplane in the label space where each data point x (^) i is plotted as (h 1 (x (^) i ), h 2 (x (^) i )). The separating hy- perplane is shown as a dotted line through the origin as there is no threshold. The minimum margin ρ is positive and produces a very reasonable separating plane. The solution depends only on the two support vectors indicated by boxes. The right side shows the problem in dual or margin space where each point is plotted as (yi h 1 (x (^) i ), yi h 2 (x (^) i )). Recall, a weak hypothesis is correct on a point if yi h(x (^) i ) is positive. The convex hull of the points in the dual space is shown with dotted lines. The dual LP computes a point in the convex hull 2 that is optimal by some criteria. When the data are linearly separable, the dual problem finds the
Figure 1. No noise hard margin LP solution for two confidence-rated hypotheses. Left is the separation in label space. Right is the separation in dual or margin space.
Figure 3. Noisy soft margin LP solution for two confidence-rated hypotheses. Left is the separation in label space. Right is the separation in dual or margin space.
margin LP (see Eq. (2)) both in the original label space on the left and the dual margin space on the right. On the left side, we see that the separating plane in the hypothesis space is very close to that of the hard margin solution for the no-noise case shown in figure 1. This seems to be a desirable solution. In general, the soft margin formulation is much less sensitive to noise. There are some notable differences in the dual solution shown on the right side of figure 3. The dual LP for the soft margin case is almost identical to the hard margin case with one critical difference: the dual variables are now bounded above. For clarity we repeat the dual LP here.
min u,β β
s.t.
i= 1
u (^) i yi Hi j ≤ β, j = 1 ,... , m (20)
∑^
i= 1
u (^) i = 1 , 0 ≤ u (^) i ≤ D, i = 1 ,... ,
In our example, we used a misclassification cost in the primal of D = 1 /4. In the dual, this has the effect effect of reducing the set of feasible dual points to a smaller set called the reduced convex hull (Bennett & Bredensteiner, 2000). If D is sufficiently small, the reduced convex hull no longer intersects the negative orthant and we once again can return to the case of finding the closest point in the now reduced convex hull to the origin as in the linearly separable case. By adding the upper bound to the dual variables, any optimal dual vector u will still be sparse but not as sparse as in the hard margin case. For D = 1 /k it must have at least k positive elements. In the example, there are four such support vectors outlined in figure 3 with squares. For D sufficiently small, by the LP complementarity conditions, any misclassified point will have a corresponding positive dual multiplier. In this case the support vectors also were drawn from both classes, but note that there is nothing in the formulation that guarantees this. To summarize, if we are calculating the optimal hard margin ensemble over a large enough hypothesis space such that the data is separable in the label space, it may work very well.
But in the early iterations of a column generation algorithm, the hard margin LP will be optimized over a small set of hypotheses such that the classes are not linearly separable in the label space. In this case we observe several problematic characteristics of the hard margin formulation: extreme sensitivity to noise (producing undesirable hypotheses weightings), extreme sparsity of the dual vector especially in the early iterations of a column generation algorithm, failure to assign positive Lagrangian multipliers to misclassified examples, and no guarantee that points will be drawn from both distributions. Although we examined these potential problems using the confidence-rated case in 2 dimensions it is easy to see that they hold true and are somewhat worse for the more typical case where the labels are restricted to 1 and −1. The soft margin LP adopted in this paper addresses some but not all of these problems. Adding soft margins makes the LP much less sensitive to noise. Adding soft margins to the primal corresponds to adding bounds on the dual multipliers. The constraint that the dual multipliers sum to one forces more of the multipliers to be positive both in the separable and inseparable cases. Furthermore the complementarity conditions of the soft margin LP guarantee that any point that violates the soft margin will have a positive multiplier. Assuming D is sufficiently small, this means that every misclassified point will have a positive multiplier. But this geometric analysis illustrates that there are some potential problems with the soft margin LP. The column generation algorithm uses the dual costs as misclassifica- tion costs for the base learner to generate new hypotheses. So the characteristics of the dual solution are critical. For a small set of hypotheses, the LP will be degenerate, and the dual solution may still be quite sparse. Any method that finds extreme point solu- tions will be biased to the sparsest dual optimal solution, when in practice less sparse solutions would be better suited as misclassification costs for the base learner. If the pa- rameter D is chosen too large the margin may still be negative so the LP will still suffer from the many problems found in the hard margin case. If the parameter D is chosen too small then the problem reduces to the equal cost case so little advantage will be gained through using an ensemble method. Potentially, the distribution of the support vectors may still be highly skewed towards one class. All of these are potential problems in an LP-Based ensemble method. As we will see in the following sections, they can arise in practice.
8. Computational experiments
We performed three sets of experiments to compare the performance of LPBoost, CRB, and AdaBoost on three classification tasks: one boosting decision tree stumps on smaller datasets and two boosting C4.5 (Quinlan, 1996). For decision tree stumps six datasets were used, LP-Boost was run until the optimal ensemble was found, and AdaBoost was stopped at 100 and 1000 iterations. For the C4.5 experiments, we report results for four large datasets with and without noise. Finally, to further validate C4.5, we experimented with ten more additional datasets. The rationale was to first evaluate LPBoost where the base learner solves (10) exactly and the optimal ensemble can be found by LP-Boost. Then our goal was to examine LPBoost in a more realistic environment by using C4.5 as a base learner