Download CS260: Machine Learning Algorithms Lecture 6 and more Lecture notes Machine Learning in PDF only on Docsity! CS260: Machine Learning Algorithms Lecture 6: Nonlinear mapping, Model complexity Cho-Jui Hsieh UCLA Jan 28, 2019 Nonlinear Transformation Circular Separable Data is not linear separable but circular separable by a circle of radius √ 0.6 centered at origin: hSEP(x) = sign(−x2 1 − x2 2 + 0.6) Circular Separable and Linear Separable h(x) = sign(0.6 · 1 + (−1) · x2 1 + (−1) · x2 2 ) Circular Separable and Linear Separable h(x) = sign( 0.6︸︷︷︸ w̃0 · 1︸︷︷︸ z0 + (−1)︸︷︷︸ w̃1 · x2 1︸︷︷︸ z1 + (−1)︸︷︷︸ w̃2 · x2 2︸︷︷︸ z2 ) = sign(w̃Tz) {(xn, yn)} circular separable ⇒ {(zn, yn)} linear separable x ∈ X → z ∈ Z (using a nonlinear transformation Φ) Nonlinear Transformation Define nonlinear transformation Φ(x) = (1, x2 1 , x 2 2 ) = (z0, z1, z2) = z Linear hypotheses in Z space: sign(h̃(z)) = sign(h̃(Φ(x))) = sign(wTΦ(x)) Lines in Z space ⇔ some quadratic curves in X -space Nonlinear Transformation Define nonlinear transformation Φ(x) = (1, x2 1 , x 2 2 ) = (z0, z1, z2) = z Linear hypotheses in Z space: sign(h̃(z)) = sign(h̃(Φ(x))) = sign(wTΦ(x)) Lines in Z space ⇔ some quadratic curves in X -space Nonlinear Transformation Define nonlinear transformation Φ(x) = (1, x2 1 , x 2 2 ) = (z0, z1, z2) = z Linear hypotheses in Z space: sign(h̃(z)) = sign(h̃(Φ(x))) = sign(wTΦ(x)) Lines in Z space ⇔ some quadratic curves in X -space General Quadratic Hypothesis Set A “bigger” Z-space: Φ2(x) = (1, x1, x2, x 2 1 , x1x2, x 2 2 ) Linear in Z-space ⇔ quadratic hypotheses in X -space The hypotheses space: HΦ2 = {h(x) : h(x) = w̃TΦ2(x) for some w̃} (Quadratic hypotheses) Also include linear model as a degenerate case General Quadratic Hypothesis Set A “bigger” Z-space: Φ2(x) = (1, x1, x2, x 2 1 , x1x2, x 2 2 ) Linear in Z-space ⇔ quadratic hypotheses in X -space The hypotheses space: HΦ2 = {h(x) : h(x) = w̃TΦ2(x) for some w̃} (Quadratic hypotheses) Also include linear model as a degenerate case Learning a good quadratic function Transform original data {xn, yn} to {zn = Φ(xn), yn} Solve a linear problem on {zn, yn} using your favorite algorithm A to get a good model w̃ Return the model h(x) = sign(w̃TΦ(x)) The price we pay: Computational complexity Q-th order polynomial transform: Φ(x) = (1, x1, x2, · · · , xd , x2 1 , x1x2, · · · , x2 d , · · · xQ1 , x Q−1 1 x2, · · · , xQd ) O(dQ) dimensional vector ⇒ High computational cost The price we pay: overfitting Overfitting: the model has low training error but high prediction error. Theory of Generalization Material is from “Learning from data” Training versus testing Does low training error imply low test error? They can be totally different if train distribution 6= test distribution Even under the same distribution, they can be very different: Because h is picked to minimize training error, not test error Formal definition Assume training and test data are both sampled from D The ideal function (for generating labels) is f : f (x)→ y Training error: Sample x1, · · · , xN from D and Etr(h) = 1 N ∑N n=1 e(h(xn), f (xn)) h is determined by x1, · · · , xN Test error: Sample x1, · · · , xM from D and Ete(h) = 1 M ∑M m=1 e(h(xm), f (xm)) h is independent to x1, · · · , xM Generalization error = Test error = Expected performance on D: E (h) = Ex∼D [e(h(x), f (x))] = Ete(h) Formal definition Assume training and test data are both sampled from D The ideal function (for generating labels) is f : f (x)→ y Training error: Sample x1, · · · , xN from D and Etr(h) = 1 N ∑N n=1 e(h(xn), f (xn)) h is determined by x1, · · · , xN Test error: Sample x1, · · · , xM from D and Ete(h) = 1 M ∑M m=1 e(h(xm), f (xm)) h is independent to x1, · · · , xM Generalization error = Test error = Expected performance on D: E (h) = Ex∼D [e(h(x), f (x))] = Ete(h) How to bound ‖E (h)− Etr(h)‖? Inferring Something Unknown Consider a bin with red and green marbles P[picking a red marble] = µ P[picking a green marble] = 1− µ The value of µ is unknown to us How to infer µ? Pick N marbles independently ν: the fraction of red marbles Inferring with probability Do you know µ? No Sample can be mostly green while bin is mostly red Can you say something about µ? Yes ν is “probably” close to µ Connection to Learning How to connect this to learning? Each marble (uncolored) is a data point x ∈ X Connection to Learning How to connect this to learning? Each marble (uncolored) is a data point x ∈ X red marble: h(x) 6=f (x) (h is correct) green marble: h(x)=f (x) (h is wrong) Connection to Learning Given a function h: If we randomly draw x1, · · · , xN (independent to h): E (h) = Ex∼D [h(x) 6= f (x)] (generalization error, unknown) ⇔ µ 1 N ∑N n=1[h(xn) 6= yn] (error on sampled data, known) ⇔ ν Based on Hoeffding’s inequality: P[|µ− ν| > ε] ≤ 2e−2ε2N “µ = ν” is probably approximately correct (PAC) However, this can only “verify” the error of a hypothesis: h and x1, · · · , xN must be independent Apply to multiple bins (hypothesis) Can we apply to multiple hypothesis? Color in each bin depends on different hypothesis Bingo when getting all green balls? Coin Game If you have 150 fair coins, flip each coin 5 times, and one of them gets 5 heads. Is this coin (g) special? No. The probability of existing one of the coin results in 5 heads is 1− ( 31 32 )150 > 99% Because: there can exist some h such that E and Etr are far away if M is large. Coin Game If you have 150 fair coins, flip each coin 5 times, and one of them gets 5 heads. Is this coin (g) special? No. The probability of existing one of the coin results in 5 heads is 1− ( 31 32 )150 > 99% Because: there can exist some h such that E and Etr are far away if M is large. When is learning successful? When our Learning Algorithm A picks the hypothesis g : P[|Etr (g)− E (g)| > ε] ≤ 2Me−2ε2N If M is small and N is large enough: If A finds Etr (g) ≈ 0 ⇒ E (g) ≈ 0 (Learning is successful!) Feasibility of Learning P[|Etr (g)− E (g)| > ε] ≤ 2Me−2ε2N Two questions: (1) Can we make sure E (g) ≈ Etr (g)? (2) Can we make sure E (g) ≈ 0? M: complexity of model Small M: (1) holds, but (2) may not hold (too few choices) (under-fitting) Large M: (1) doesn’t hold, but (2) may hold (over-fitting) Feasibility of Learning P[|Etr (g)− E (g)| > ε] ≤ 2Me−2ε2N Two questions: (1) Can we make sure E (g) ≈ Etr (g)? (2) Can we make sure E (g) ≈ 0? M: complexity of model Small M: (1) holds, but (2) may not hold (too few choices) (under-fitting) Large M: (1) doesn’t hold, but (2) may hold (over-fitting) What the theory will achieve Currently we only know P[|Etr(g)− E (g)| > ε] ≤ 2Me−2ε2N What if M =∞? (e.g., linear hyperplanes) Todo: We will establish a finite quantity to replace M P[|Etr(g)− E (g)| > ε] ? ≤ 2mH(N)e−2ε2N Study mH(N) to understand the trade-off for model complexity What the theory will achieve Currently we only know P[|Etr(g)− E (g)| > ε] ≤ 2Me−2ε2N What if M =∞? (e.g., linear hyperplanes) Todo: We will establish a finite quantity to replace M P[|Etr(g)− E (g)| > ε] ? ≤ 2mH(N)e−2ε2N Study mH(N) to understand the trade-off for model complexity Conclusions Polynomial feature expansion: Our first nonlinear model Bounding the generalization (test) error Questions?