Kernelization with Outer Product Instances | CMPS 290, Study notes of Computer Graphics

Material Type: Notes; Class: Advanced Topics in Computer Graphics; Subject: Computer Science; University: University of California-Santa Cruz; Term: Unknown 2009;

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-anp-1
koofers-user-anp-1 🇺🇸

8 documents

1 / 36

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Kernelization with outer product instances
Manfred K. Warmuth
University of California - Santa Cruz
UC Berkeley, March 18, 2009
M. Warmuth (UCSC) Kernelization with outer product instances 1 / 36
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24

Partial preview of the text

Download Kernelization with Outer Product Instances | CMPS 290 and more Study notes Computer Graphics in PDF only on Docsity!

Kernelization with outer product instances

Manfred K. Warmuth

University of California - Santa Cruz

UC Berkeley, March 18, 2009

Outline

(^1) Kernel Paradigm

2 An iff condition for the kernel paradigm

(^3) The matrix case

(^4) Efficiency

Kernel Paradigm

Simple start - linear regression

Examples 〈S〉 = 〈(x 1 , 1 ),... , (xt ,t )〉 Linear hypothesis w (〈S〉) = w Predicts with w · x on instance x

Kernel Paradigm

What if data not close to linear

z 1

z

Simply invent new features :-)

Kernel Paradigm

Does the expansion always work

Can you always improve things by inventing new features Fitting the data may be Is this learning?

Short answer If good fit with non-sparse solution - maybe If good fit with sparse solution - no

  • algorithms based on relative entropy regularization, (multiplicative updates) often better

Kernel Paradigm

The Kernel Trick [BGV92]

If w linear combination of expanded instances, then

̂ ` =

i

ai φ(zi ) ︸ ︷︷ ︸ w

· φ(z) =

i

ai φ(zi ) · φ(z) ︸ ︷︷ ︸ K (zi ,z)

Kernel Paradigm

Visualizing the magic

source

sink

1 z 1 ˜z 1

1 z 2 ˜z 2

1 z 3 ˜z 3

One term per path

Kernel Paradigm

Good news

Many of our favorite algorithms can be “kernelized”: Linear Least Squares, Widrow-Hoff, Support Vector Machines, PCA, “Simplex Algorithm”, ...

Question:

What is the class of algorithms that can be kernelized?

Kernel Paradigm

Which algorithms are kernelizable? - The usual answer

Representer Theorem: [KW71]

w = arg inf (^) w˜

|| w˜||^2 + η

i

lossi (w · xi )

Solution w linear combination of the φ(xi )

Sufficient condition for the fact that parameter vector w is linear combination of instances

An iff condition for the kernel paradigm

Outline

(^1) Kernel Paradigm

2 An iff condition for the kernel paradigm

(^3) The matrix case

(^4) Efficiency

An iff condition for the kernel paradigm

Kernel matrix

If U is an orthogonal matrix, then 〈US〉 denotes the sequence

〈(Ux 1 , y 1 ), (Ux 2 , y 2 ),... , (Uxt , yt )〉

Transformation only affects the instances

X′X = (UX)′^ UX

Lemma

Two sequences of examples are orthogonal transformations of each other iff the kernel matrices associated with the sequences are the same

An iff condition for the kernel paradigm

Class of algorithms

Alg. produces weight vector

w : example sequences to → Rn

Actions of alg. depend on dot product

w (〈S〉) · x

w is transformation invariant if

w (〈S〉) · x = w (〈US〉) · Ux,

for all sequences 〈S〉, orthogonal matrices U, and instances x

An iff condition for the kernel paradigm

Accessing features violates formal def. of paradigm

Consider alg. w (〈S〉) =

0 if x 1 , 1 > 0 x 1 otherwise.

Alg. satisfies 2a predicts w. linear combination of instances

but not 2b transformed sequence → transformed matrix 1 transformation invariance 3 coefficients only depend on kernel matrix Use U = −I

An iff condition for the kernel paradigm

Recall informal characterization of Kernel Paradigm

The parameter vector is a linear combination of instances Too general?

  • By above example alg. may not depend on dot products

“Algorithm only relies on dot products” (Individual features never touched) More general than characterization of above theorem