CS 188 Final Cheat Sheet, Study notes of Computer Science

CS 188 Final Cheat Sheet covering key concepts

Typology: Study notes

2024/2025

Uploaded on 12/10/2025

tamnhi-vu
tamnhi-vu 🇺🇸

2 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
rand
HMM
(hidden
Markov
Model)
:
Gibbs
sampling
init
:
4-
·
obs
evidence
e
timestep
a
1
·
States
=
comp
.
assignments
res
incorporate
it
into
our
model
to
all
vars
S
·
State
var
:
random
var
.
2var
.
fix
evid
.
Var
.,
rand
.
Set
encoding
belief
&
timestep
can't
non-evid
.
Var
Sample
:
S-P(S/c
,
r
,
+ w)
Sample
:
C-p(CIS
,
r)
Sample
:
W-P(WIs
,
r)
getting
·
evidence
var
:
rand
.
Var
change
.
gen
.
Subsea
.
States
by
new
set
encoding
obse
a
timestep
at
same
looping
through
non-evidvar
:
Markov
ASS
.
of
trans
.
models
so
Represent
HMM
using
initial
time
sample
new
val
.
Chosen
Var
.
to
&
if
we
are
trying
to
figure
out
prob
-
Of
G3
,
all
we
we
have
a
distribution
,
transition
model
,
gen.
new
sample
need
to
know
is
G2
,
knowing
G
,
and
Go
bunch
of
-
P(Xi
X
,
...,
Xi
-
,
Xi
+
,
...,
Xn)
=
samples
of
Wo
.
our
&
sensor
model
P(X
:
/markov-blankets
(v
:
1)
Amatrix
Mult
:
(mxn)
(nxP)
trans
.
model
3
ex
:
weather
forecasting
·
considers
downstream
&
-
gives
us
a
upstream
evid
.
must
match
Sample
wo
-
>
w
.
-
>
we
- ws
...
Simplies
t
states
of
w
,
po
US
calc
prob
&
form
dist
of
Wo
,
(F
,
b4
E2
F3
3
use
that
to
F2
Calc
W
.
read
below
and
so
on
.
State
var
Wi
(the
weather
(timelaspe
on
day
i
Upd)
·
evidence
var
Fi
(the
forecast
on
day
i)
·
Initial
distP(Wol
·
Sensor
model
P(F
:
Wil
*
use
.
Trans iti on
model
P(Wi+
,
I
Wil
for
2
things
needed
to
def
HMM
:
basic
4 11
Prob
of
P
,
/Po
,
Palp
,
--
like
HMM
(transition
model)
above
,
2)
evidence
prob
aka
becomes
sensor
model
:
Prob
Comput
You
see
some
evid
.
exp
when
more
&
time
t
given
there's
some
state
&
time
+
scomplex
Shown
2
parts
in
green
-
Forward
Alg
:
above
for
HMM
B(Wi
+
)dP(fi
+
Wit)[P(Wit
WilB)Wil
Wi
T
time
elaspe
upd
:
adv
.
Model's
State
by
one
timestep
Step
D
-
B'
(Wit)
=
P(W
:
+,
(wi)B(wi)
Obs
.
upd
:
incorporate
new
ev as
Step
B(Wi
+
1)
&
P(fi
+
1)
Witi)
B'
(Wit1)
M
belief
dist
of
time
i
given
obs
.
evid.
n
f
,
...,
fi
Step
is
C
n
&
B(Wi)
=
P(Wilf
,...,
fi)
Similar
to
exact
inf.
B'
(Wil
=
P(Wilf
,...,
fi
-
1)
W/Baye 's
Shown
2
~
Particle
Filtering
:
Parts
in
Purple
HMM
use
set
of
Samples
(particles)
to
represent
belief
State
·
Stores
n
Particles
(n
<
/XI)
·
pred
.:
Sample
state
from
transition
model
4
X
+
+
P(X
+
+ 1
(X
+
)
<
SRPQ
·
Upd
:
Obs
et
+
1
&
weight
sample
based
on
evid
.
3
W
=
P(t
+
+
1)X
=
+
1)
-
Sep
.
Fitting
data
better
mean
getting
higher
weight
·
Normalize
weight
across
all
particles
·
Resample
:
n
times
,
sample
w/replacement
&
get
new
particles
avoids
tracking
weighted
samples
·
repeat
for
next
time
step
·
exact
interference
is
infeasible
when
domain
of
vars
grows
too
large
·
Step
similar
to
likelihood
weighting
from
Bayes
net
e
every
iteration
use
trans
.
Model
to
get
a
new
set
of
samples
,
upd.
Way
we
sample
by
mult
.
by
prob
of
Seeing
Curr
.
evid
given
State
of
sample
If
your
sample
ends
up
on
a
state
w/o
Prob
given
evid
.
bic
maybe
You r
sensor
is
never
going
to
give
a
s
can
help
you
elim
that
sample
bIC
I
/
&
:
ci
up
. .
resamp
-
if an evid
Varabo
a
/
>
Weint
:
I
W
=
P(t
+
+
1)X
=
+
)
=
0
,
which
weight
limes
,
one
time
for
each
time
step
-
-
*
particle
filtering
is
not
perfectly
particles
:
particles
:
particles
:
a
rficus
if
you
have
enn,
a
Crews
particles
to
exact
prob
-
i
S
S
better
for
Complex
,
a
bunch
of
State
var
(5
or
6)
and
a
bunch
(3
,
3) W
=
.
4
of
evid
.
Var
.
pf2

Partial preview of the text

Download CS 188 Final Cheat Sheet and more Study notes Computer Science in PDF only on Docsity!

rand

HMM (hidden

Markov Model)

:

Gibbs

sampling

init

:

·

obs evidencee timestepa

·

States

= comp .

assignments

res

incorporate it

into our

model

to

all

vars S

·

State var : random

var.

2var.

fix evid. Var .,

rand. Set

encoding belief & timestep

can't

non-evid

. Var Sample

: S-P(S/c ,

r ,

  • w) Sample

: C-p(CIS , r) Sample

: W-P(WIs , r)

getting

·

evidence var

: rand. Var

change

.

gen

. Subsea. States by

new

set

encoding obse

a

timestep

at

same

looping through non-evidvar

:

Markov ASS.

of trans.

models so Represent HMM using initial

time sample

new val. Chosen Var. to &

if we

are trying to figure

out prob

Of

G

,

all

we

we have

a

distribution,

transition model

,

gen.

new sample

need to

know is G

,

knowing

G

,

and Go

bunch of

  • P(Xi X , ..., Xi - ,

Xi

  • , ...,

Xn)

=

samples of

Wo. our

sensor

model

P(X : /markov-blankets (v : 1)

Amatrix

Mult

:

(mxn)

(nxP)

trans.

model

ex

: weather forecasting

·

considers downstream

gives

us

a upstream

evid.

must match

Sample wo

  • > w . - > we - ws

...

Simplies

t

states of w ,

po ① US calc prob &

form

dist of Wo , (F,

b

E

F

use that to F

Calc W.

② read below

and so on

. State var Wi (the weather

(timelaspe

on day

i

Upd)

·

evidence var Fi (the forecast

on day

i)

·

Initial distP(Wol

· Sensor model P(F : Wil

* use.

Transition

model

P(Wi+ , I Wil

for

2 things needed to def HMM

:

basic

4 11

Prob of

P

,

/Po

,

Palp

,

--

like

HMM

(transition model)

above

  1. evidence prob aka

becomes

sensor model

: Prob

Comput

You see some evid.

exp when

more

& time t given there's

some state & time

scomplex

Shown

parts

in green

Forward Alg

:

above for HMM

B(Wi+

)dP(fi

  • Wit)[P(Wit

WilB)Wil

Wi

T

time elaspe upd

:

adv. Model's

State by one timestep

Step D

B' (Wit) =

P(W :+, (wi)B(wi)

Obs

. upd :

incorporate

new ev as

Step

B(Wi +

& P(fi + 1) Witi) B' (Wit1) M

belief dist of

time i given obs

. evid.

n f

, ...,

fi

Step

② is

C

n

B(Wi)

= P(Wilf ,...,

fi) Similar

to

exact

inf.

B' (Wil

=

P(Wilf

,...,

fi - 1) W/Baye's

Shown

Particle Filtering

Parts

in

Purple HMM

use set

of Samples (particles)

to

represent belief State

· Stores n Particles (n

/XI)

·

pred .: Sample

state from

transition

model

4 X

+ + P(X + + 1 (X + )

SRPQ

·

Upd : Obs et + 1

& weight sample

based

on evid.

W

=

P(t+ + 1)X = + 1) -

Sep

. Fitting data

better mean getting

higher weight

· Normalize

weight across

all

particles

·

Resample

: n

times

,

sample

w/replacement

get new

particles

avoids tracking

weighted

samples

·

repeat for next time step

·

exact interference is

infeasible when

domain

of

vars

grows too large

·

Step

similar to likelihood weighting

from

Bayes

net

e every iteration use trans. Model

to get a new set of

samples , upd.

Way we sample by

mult

. by prob of

Seeing Curr. evid given State of sample

If your

sample

ends up on a state

w/o

Prob given

evid. bic maybe

Yoursensoris nevergoingto give a

s

can

help you

elim

that sample

bIC

I

&

:

ci

up

..

resamp

if an evid

Varabo

a

/ >

Weint

I

W

=

P(t

1)X = +

)

= 0

,

which

weight limes

, one time for each

time step

* particle filtering is not perfectly

particles : particles

: particles

:

↑ a rficus

if you have

enn,

a

Crews

particles

to

exact prob -

i S

better forComplex S

,

a bunch

of

State var (5 or 6) and a bunch

(3 , 3) W

= .

4

of evid. Var.

Machine Learning

Perception Al

· core idea : we give machines access

to data & they

learn

for themselves

Data

is often

split into training

, val.

a

al

weighs

i

e

dataset can

be divided in to

but

should

y

E

  • 1 ,

13 , do

:

set

of features ,

X (2)

Set of classes

,

poss

(a) classify

the sample using

Training Set

: used to fit model the curr· weights

Validation Set

: used to

tune hyperparam

. class predicted

byetybetheWi

(learning

rate ,

model struct , etc

. )

1

if

activationw(X)

=

Test Set

: used to test the

entire model

S

Wif(x))

·

some

types of

machine learning prob .

y

= classify

=

-1 if activationw (X)

=

WTf(X) < 0

*IRegression

: try to est. Some

numerical Val from data

(b) Compare the pred. label y to the

& bunch

of lines on

Plot ,

find

truelabe

nothinga upd

line of best fit

ex

:

feature

of houses &

finds

line of best fit to find what

Your

weights : w

  • w + y

f(x)

price of each house should be 3) If you went through every

* Classification

:

try

to classify

training sample wo having

to upd.

data into discrete

classes

your weights (all samples pred

.

Pixels are features in

an corr. ) , then

terminate. Else ,

repeat

img. Of

a # &

classes

,

what Step

we're trying to Pred. (0-9) *

weights

def .

the line drawn

Clustering

: try to group Similar eX

: 2D data

data into clusters

naturally >0 Perception

is pos

·

Types

of

learning :

< perception

is neg.

3 supervised

:

training data

x 3

40 - 1 = - 1

has labels ,

e. g. classification & if the

res .

happens to be incour

ex

:

or

digits)

then we

upd. Our weight by

  • unsupervised

:

training data adding the true class *

feat.

has no

labels ,

e. g.

Clustering

4[-i]s

  • 1[i]

& don't

know exactly what

You're looking for, but

Want to see

if there is

Neural

Networks :

naturally see

what Struct

appear

·

motivation :

most

prob

are non-linear

type of

·

Common neural

network

Naive Bayes

classification

Class is the multilayer

· Goal

: Create a

model that can pred

Perception a lg.

a label

y given

features ,

where we instead of

having binary

assume all features are ind.

affected - 1 ,

I choices

after each

by label

Pred .

label y

  • >

Y feat.

node ,

we use a non-linear

ex

: Spam

filter

L activation

func.

·

y is in Espam , Ham

F. F ... Fr &

NodesStill

dot prod .

their

·

Fi in Eo , 13 is

whether word

i inputs w/ their own weight

appears in the

email rectors

· Label email based on

the higher

of these use gradient descent to

two prob:

Upd. all weights-

P(y

= nam/F ,

= f

,.. ., Fn

=

fn) backpropagation

X

G

P(y = Spam /Fi

= f ,... ,

Fn

=

fn) *

When

applying

  • we don't

assume that there is any

type

non-linearity , 00

of relationship blu

words allowing us to

We can

pred.

get dist

for y by multiplying proby given each more

complex classes X

of the

feature var.

=>

if we

come up

we a neural

· Generalized

: network

that is complex

spred (f

, ... ,

fn)

= argmaxP(y

=

y(F

= f , ...

Fu

=

fn)

enough ,

we can come up w/

a line that inc.

all circles

argmaxF

and no crosses

create multles

a

Class

we're

of

some

1

Planning

Xi ~

to classify

I

↓ Maximum

Likelihood EstimationLimes

each t a

a

· How to est. CPTs since we don't actually

feat.

know them

(input) ; do X 3

for each node ↑

parameter

est

. W/MLE inputs/feat . I weight

· find Prob. (CPT

val) O

= PC.

) Such

that we

maximize the likelihood of observing ist layer (made of a bunch

of nodes

Our observation ,

P

(observations (0)

eage is

network is still linear A CPT blu we're

· Ans. is actually fairly intuitive. Y S mult. # by features

& fle ,

given data (F, Y)

P(y

=

y)

=

MLE(OI(F

, y))

F. Fz ... FN

=

(H

ex. w(y

=

y) ( +o + al

eX)

Sigmond Func.

PFMEexw(y=

aka activation

·

look at

a ton of emails

that

have

already been classified for

us ,

& find each edge

(CPT),

What is

prob given that

email

isspam,ithassome word

ina

word in

it

perception

Binary

Perception

·idea:

linearlyseparates datain

a

a

(def

. by a set of weights

· if data

is linearly separable ,

the

alg. Will perfectly classify the data

· to find boundaries -w

WTf(x)

= 0

that don't have

to cross

the origin ,

incorp. a "bias" feat.

that always has val

of

Gi

· creating perception model

:

wehave a bunchofdataa

features on some O

kind F

of graph a

. Where

each

feat

. is either pos. Or Olinear

neg ., so

given just

feat. We

want to *

perception figures

out what line

build

model that can

you can draw to

take these feat.

Perfectly

div. 2

give usanest for

beclasses

one

train

  • >

valid e

test Logistics Reg. Conti inference by enum.

·

idea

: instea of simply using

WiX (binary Perception) & LPCTPC, S)

Classifying

,

applysigmoid

func.

on wiX

Prior

sampling

·

results always

blw 0

and I

:

can pred.

Prob .

Unlike

binary

4

Al

:

perceptron

rand

.

gen. Samples

·ifwecan

getasetof weights

samples discard sampstate

every

inconsistent wh

logistic

regression form can comb, of evid.

give us

a good

classification ass Calc. Prob.

for a set of data pts

.

· May have to compute

· How do we compute

Use

large # of samples

radient descent for

unlikely scenarios

·

see also

:

multiclass logistic

· eX

: gen. the following

regression ,usingsoftmaxfuna

Samples

  • >ComputeP(w)

C ,

iS ,

r ,

w

Gradient Ascent

!

Descent

:

S ,

S ,

Lr ,

w

·

given acc. measure, C P(W

=

w)

in order

to come up

S

L

S

RP(w

=

w)

=

Wbetternetworkaa

or

dec. our loss

wh

· Goal

: Want to find Param .

that maximizes obje

· For i

= 1 ,

2 , ...,

n (in topolog

fund.

or minimizes

loss

order (

func.

Sample X : from

·

if

closed-form

formula for P(X : /

Parents (Xi))

global

Optimum does

not exist ,

· Return

(X ,, Xz

,

...,

Xn)

can use gradient ascent/

descent

· Observation

: gradient is dir.

of steepest inc ... by repeatedly Rejection

Samp.

following the gradient ,

we can

X

idea

: we can

imm. Stop

chase

maxima/minima

gen.

Samples as soon

as

· Gradient ascent

: will not

they

become inconst. w/

rand.

Initialize w accept our

evid.

While

w not converged do

:

any still

discard most

w = w

GVwf(w) sample

samples but takes

end

that has

less time

to gen.

·

Gradient descent

:

inconsist.

those

& rand. Initialize w evid ;

·

Al9 :

While w not

converged do :

same for input

: evidence

w = w

GVwf

(w)

likelihood

e, ..., 2

end

& Gibbs

For i,from

Traing Neural

Networks

Parents

leat

· set

the weights to some

initial

Val.

reject

: return

& no sample

is

· input training data , run

forward pass

to generate val

gen.

in this

at all modes ,

calc. loss func.

cycle

On final output return (X 1, Xz, ..., Xn)

·

run

backwards pass-calc

. the we want P(CIr , w -

gradient of loss

w/ respect

to can

throw

away

each of weights

samples

w/ ir orTw

·

use gradient

descent

to upd.

all weights

C ,

iS ,

r ,

w

· repeat w/ more

data

i

Bayes Net

:

S ,

S ,

r ,

w

·

acyclic

graph ·

use a

dirrob

.

table to

L

C

S

encode relationships blw var. S R

  • Calc .

Prob. of an assignment

P(X , Xz , ...,

Xn)

P(X : /Parents (Xi)

& L

W S eX : Alarm (A) goes off if there's a

:

burglary (B) or

earthquake (E) ,

res .

in John (5) or Mary (M)

calling

B

E Likelihood weighting

P(2)

logistic

fun

· CPTs

:PCBI

,

PLE

· idea : Rejection sampling

· takes

in some

input from

  • o

may reject a

lot of

to 0 and maps from

0 to P(5IA)

,

PCMIAlyv

Y

M

samples

if evid. is

Uses : · P(

  • b , - 2 ,
  • a ,

j ,

m)

= Unlikely

  • > let's fix

the

if used as activation func. p(

  • b)

. P(

  • e)

. P(

  • a) - b ,
  • e)

.

P(+ j) + a)

  • evid.

Var. While we

which you can apply imm. P(

  • mi
  • a) sample

after a linear layer weight

sample by

don't

for get

bis ,Sa

& pred. the class

of something prob.

of evid

Ind in Bayes

Net

:

given

Parents

include bias

val

·

ex:

if ourneural

networkis

as

1) each

node is cond , ind.

Of all samples are

used!

to o , sigmoid

func will map all its ancestor

nodes (non-

· downstream

var. inf.

to a val from O to 1 descendants) in

the graph

, by upstream

var ,

but

if given-o

, it will given all of its parents upstream

var.

likely be class of O

  1. each

node is cond. ind.

Of

want

to see all

3 + 1000 ,

it will likely

be all other var. given

its evid .,

not just

class

of 1 b/C very

big Markov blanket (consisting that

which inf.

anything

in middle like of its

parents ,

children ,

downstream

vars.

O will be interpreted as Children's other Parents)

·

Al9 :

Probability like 0. 5 saying

take

al

at

input

: evidence

i t 's 50/50 e, ..., 2

Variable

Elim

i inv. the

var. 3

Forivid.

Logistic

Regression :

1

Clim.

hidden var.

X by

you're

hw(X)

=

(1 I

e-wax)

joining (multiplying

trying

var.

uses smaller +g + )

all

factors inv. X

to elim -Xi

= Obs.

we want to fine

tune our

factors

summing

out 4 value ,

  • for :

weights , such

that when we

than Factors

: unnorm.

Prob

Set W

= W

pass in W

X ,

which is a single infer,

proportional

to actual

P(X

Parents (Xi)

layer

in NN ,

into sigmoid func

. by

enum. Prob , but doesn't sum

to

return (X,

, Xz, ...,

Xn)

,

w

we get

val from o to

< P(

+/

ike a

prob

dist-

shoulda

  • similar to a neural

network w/

just I layer

more unnec. terms outside