Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CS260: Machine Learning Algorithms Lecture 6, Lecture notes of Machine Learning

Nonlinear mapping and model complexity in machine learning algorithms. It covers topics such as circular separable and linear separable data, nonlinear transformations, quadratic hypotheses, overfitting, and generalization error. The document also includes examples and formulas to illustrate these concepts. likely useful as study notes or lecture notes for university students studying machine learning or related fields.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

dyanabel
dyanabel 🇺🇸

4.7

(20)

288 documents

1 / 52

Toggle sidebar

Related documents


Partial preview of the text

Download CS260: Machine Learning Algorithms Lecture 6 and more Lecture notes Machine Learning in PDF only on Docsity! CS260: Machine Learning Algorithms Lecture 6: Nonlinear mapping, Model complexity Cho-Jui Hsieh UCLA Jan 28, 2019 Nonlinear Transformation Circular Separable Data is not linear separable but circular separable by a circle of radius √ 0.6 centered at origin: hSEP(x) = sign(−x2 1 − x2 2 + 0.6) Circular Separable and Linear Separable h(x) = sign(0.6 · 1 + (−1) · x2 1 + (−1) · x2 2 ) Circular Separable and Linear Separable h(x) = sign( 0.6︸︷︷︸ w̃0 · 1︸︷︷︸ z0 + (−1)︸︷︷︸ w̃1 · x2 1︸︷︷︸ z1 + (−1)︸︷︷︸ w̃2 · x2 2︸︷︷︸ z2 ) = sign(w̃Tz) {(xn, yn)} circular separable ⇒ {(zn, yn)} linear separable x ∈ X → z ∈ Z (using a nonlinear transformation Φ) Nonlinear Transformation Define nonlinear transformation Φ(x) = (1, x2 1 , x 2 2 ) = (z0, z1, z2) = z Linear hypotheses in Z space: sign(h̃(z)) = sign(h̃(Φ(x))) = sign(wTΦ(x)) Lines in Z space ⇔ some quadratic curves in X -space Nonlinear Transformation Define nonlinear transformation Φ(x) = (1, x2 1 , x 2 2 ) = (z0, z1, z2) = z Linear hypotheses in Z space: sign(h̃(z)) = sign(h̃(Φ(x))) = sign(wTΦ(x)) Lines in Z space ⇔ some quadratic curves in X -space Nonlinear Transformation Define nonlinear transformation Φ(x) = (1, x2 1 , x 2 2 ) = (z0, z1, z2) = z Linear hypotheses in Z space: sign(h̃(z)) = sign(h̃(Φ(x))) = sign(wTΦ(x)) Lines in Z space ⇔ some quadratic curves in X -space General Quadratic Hypothesis Set A “bigger” Z-space: Φ2(x) = (1, x1, x2, x 2 1 , x1x2, x 2 2 ) Linear in Z-space ⇔ quadratic hypotheses in X -space The hypotheses space: HΦ2 = {h(x) : h(x) = w̃TΦ2(x) for some w̃} (Quadratic hypotheses) Also include linear model as a degenerate case General Quadratic Hypothesis Set A “bigger” Z-space: Φ2(x) = (1, x1, x2, x 2 1 , x1x2, x 2 2 ) Linear in Z-space ⇔ quadratic hypotheses in X -space The hypotheses space: HΦ2 = {h(x) : h(x) = w̃TΦ2(x) for some w̃} (Quadratic hypotheses) Also include linear model as a degenerate case Learning a good quadratic function Transform original data {xn, yn} to {zn = Φ(xn), yn} Solve a linear problem on {zn, yn} using your favorite algorithm A to get a good model w̃ Return the model h(x) = sign(w̃TΦ(x)) The price we pay: Computational complexity Q-th order polynomial transform: Φ(x) = (1, x1, x2, · · · , xd , x2 1 , x1x2, · · · , x2 d , · · · xQ1 , x Q−1 1 x2, · · · , xQd ) O(dQ) dimensional vector ⇒ High computational cost The price we pay: overfitting Overfitting: the model has low training error but high prediction error. Theory of Generalization Material is from “Learning from data” Training versus testing Does low training error imply low test error? They can be totally different if train distribution 6= test distribution Even under the same distribution, they can be very different: Because h is picked to minimize training error, not test error Formal definition Assume training and test data are both sampled from D The ideal function (for generating labels) is f : f (x)→ y Training error: Sample x1, · · · , xN from D and Etr(h) = 1 N ∑N n=1 e(h(xn), f (xn)) h is determined by x1, · · · , xN Test error: Sample x1, · · · , xM from D and Ete(h) = 1 M ∑M m=1 e(h(xm), f (xm)) h is independent to x1, · · · , xM Generalization error = Test error = Expected performance on D: E (h) = Ex∼D [e(h(x), f (x))] = Ete(h) Formal definition Assume training and test data are both sampled from D The ideal function (for generating labels) is f : f (x)→ y Training error: Sample x1, · · · , xN from D and Etr(h) = 1 N ∑N n=1 e(h(xn), f (xn)) h is determined by x1, · · · , xN Test error: Sample x1, · · · , xM from D and Ete(h) = 1 M ∑M m=1 e(h(xm), f (xm)) h is independent to x1, · · · , xM Generalization error = Test error = Expected performance on D: E (h) = Ex∼D [e(h(x), f (x))] = Ete(h) How to bound ‖E (h)− Etr(h)‖? Inferring Something Unknown Consider a bin with red and green marbles P[picking a red marble] = µ P[picking a green marble] = 1− µ The value of µ is unknown to us How to infer µ? Pick N marbles independently ν: the fraction of red marbles Inferring with probability Do you know µ? No Sample can be mostly green while bin is mostly red Can you say something about µ? Yes ν is “probably” close to µ Connection to Learning How to connect this to learning? Each marble (uncolored) is a data point x ∈ X Connection to Learning How to connect this to learning? Each marble (uncolored) is a data point x ∈ X red marble: h(x) 6=f (x) (h is correct) green marble: h(x)=f (x) (h is wrong) Connection to Learning Given a function h: If we randomly draw x1, · · · , xN (independent to h): E (h) = Ex∼D [h(x) 6= f (x)] (generalization error, unknown) ⇔ µ 1 N ∑N n=1[h(xn) 6= yn] (error on sampled data, known) ⇔ ν Based on Hoeffding’s inequality: P[|µ− ν| > ε] ≤ 2e−2ε2N “µ = ν” is probably approximately correct (PAC) However, this can only “verify” the error of a hypothesis: h and x1, · · · , xN must be independent Apply to multiple bins (hypothesis) Can we apply to multiple hypothesis? Color in each bin depends on different hypothesis Bingo when getting all green balls? Coin Game If you have 150 fair coins, flip each coin 5 times, and one of them gets 5 heads. Is this coin (g) special? No. The probability of existing one of the coin results in 5 heads is 1− ( 31 32 )150 > 99% Because: there can exist some h such that E and Etr are far away if M is large. Coin Game If you have 150 fair coins, flip each coin 5 times, and one of them gets 5 heads. Is this coin (g) special? No. The probability of existing one of the coin results in 5 heads is 1− ( 31 32 )150 > 99% Because: there can exist some h such that E and Etr are far away if M is large. When is learning successful? When our Learning Algorithm A picks the hypothesis g : P[|Etr (g)− E (g)| > ε] ≤ 2Me−2ε2N If M is small and N is large enough: If A finds Etr (g) ≈ 0 ⇒ E (g) ≈ 0 (Learning is successful!) Feasibility of Learning P[|Etr (g)− E (g)| > ε] ≤ 2Me−2ε2N Two questions: (1) Can we make sure E (g) ≈ Etr (g)? (2) Can we make sure E (g) ≈ 0? M: complexity of model Small M: (1) holds, but (2) may not hold (too few choices) (under-fitting) Large M: (1) doesn’t hold, but (2) may hold (over-fitting) Feasibility of Learning P[|Etr (g)− E (g)| > ε] ≤ 2Me−2ε2N Two questions: (1) Can we make sure E (g) ≈ Etr (g)? (2) Can we make sure E (g) ≈ 0? M: complexity of model Small M: (1) holds, but (2) may not hold (too few choices) (under-fitting) Large M: (1) doesn’t hold, but (2) may hold (over-fitting) What the theory will achieve Currently we only know P[|Etr(g)− E (g)| > ε] ≤ 2Me−2ε2N What if M =∞? (e.g., linear hyperplanes) Todo: We will establish a finite quantity to replace M P[|Etr(g)− E (g)| > ε] ? ≤ 2mH(N)e−2ε2N Study mH(N) to understand the trade-off for model complexity What the theory will achieve Currently we only know P[|Etr(g)− E (g)| > ε] ≤ 2Me−2ε2N What if M =∞? (e.g., linear hyperplanes) Todo: We will establish a finite quantity to replace M P[|Etr(g)− E (g)| > ε] ? ≤ 2mH(N)e−2ε2N Study mH(N) to understand the trade-off for model complexity Conclusions Polynomial feature expansion: Our first nonlinear model Bounding the generalization (test) error Questions?