Download Machine learning ensemble methods Machine learning ensemble methods Machine learning ensem and more Lecture notes Mathematical Methods in PDF only on Docsity! ENCS5341 Machine Learning and Data Science Linear Algebra and Probability Review Yazan Abu Farha - Birzeit University Slides are based on Stanford CS229 course Linear Algebra Diagonal matrices • A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically denoted 𝐷 = 𝑑𝑖𝑎𝑔(𝑑", 𝑑#, … , 𝑑!), with 𝐷'( = -𝑑', 𝑖 = 𝑗 0, 𝑖 ≠ 𝑗 • For example the identity matrix 𝐼 = 𝑑𝑖𝑎𝑔 (1,1, … , 1) 4 Vector-Vector Product • inner product or dot product • outer product 5 Matrix-Vector Product • If we write A by rows, then we can express Ax as, 6 Matrix-Vector Product • It is also possible to multiply on the left by a row vector. • expressing A in terms of rows we have yT is a linear combination of the rows of A. 9 Matrix-Matrix Multiplication (different views) 1. As a set of vector-vector products (dot product) 2. As a sum of outer products 10 Matrix-Matrix Multiplication (different views) 3. As a set of matrix-vector products. 4. As a set of vector-matrix products 11 Norms • A norm of a vector ∥x∥ is informally a measure of the “length” of the vector. • More formally, a norm is any function f : ℝ! → ℝ that satisfies 4 properties: 14 Examples of Norms
The commonly-used Euclidean or £2 norm,
n
lIxll2 = ,] 52 x?.
i=1
The £; norm,
n
lIxlla = 50 [xi
i=1
The £,, norm,
\|x||oo = max; |x;|.
15
The Inverse of a Square Matrix
@ The inverse of a square matrix A € R™” is denoted A~!, and is the unique matrix such
that
A1A=1=AA1.
@ We say that A is invertible or non-singular if A! exists and non-invertible or singular
otherwise.
@ In order for a square matrix A to have an inverse A~!, then A must be full rank.
@ Properties (Assuming A, B € R”*" are non-singular):
> (AT) 1=A
> (AB)! = BA“
> (A-1)7 =(A’)-1. For this reason this matrix is often denoted A~’.
16
Definitions, Axioms, and Corollaries
@ Performing an experiment — outcome
@ Sample Space (S): set of all possible outcomes of an experiment
e Event (E): a subset of S (E C S)
@ Probability (Bayesian definition)
A number between 0 and 1 to which we ascribe meaning
i.e. our belief that an event E occurs
e@ Frequentist definition of probability
P(E) = lim ™E)
n-oco n
19
Definitions, Axioms, and Corollaries
Axiom 1:
Axiom 2:
Axiom 3:
Corollary 1:
Corollary 2:
Corollary 3:
0<P(E)<1
P(Sy=1
If E and F are mutually exclusive (EM F = @), then P(E) + P(F) = P(E UF)
P(E®)=1—P(E) (= P(S) — P(E)
E CF, then P(E) < P(F)
P(E UF) = P(E) + P(F) — P(EF) (Inclusion-Exclusion Principle)
20
Conditional Probability and Bayes’ Rule
For any events A, B such that P(B) 4 0, we define:
P(An B)
P(B)
Let's apply conditional probability to obtain Bayes’ Rule!
P(BNA) — P(ANB)
P(A) P(A)
P(B)P(A| B)
P(A)
P(A| B) :=
P(B| A) =
Conditioned Bayes’ Rule: given events A, B, C,
P(A| B,C) = reo)
P(B|A,C)P(A| C)
21
Chain Rule
For any n events Aj,..., An, the joint probability can be expressed as a product of conditionals:
P(A, NAN... A An)
— P(A1)P(A2 | A1)P(A3 | A2m A1)..-P(An | An-1N An-2N....9 A1)
24
Independence
Events A, B are independent if
P(AB) = P(A)P(B)
We denote this as A | B. From this, we know that if A _L B,
P(ANB) _ P(A)P(B)
P(B) ~——~P(B)
Implication: If two events are independent, observing one event does not change the probability
that the other event occurs.
In general: events A;,..., A, are mutually independent if
P(()Ai) = [[ P(A)
ieS ieS
P(A| B) = = P(A)
for any subset S C {1,..., n}.
25
Random Variables
@ A random variable X is a variable that probabilistically takes on different values. It maps
outcomes to real values
@ X takes on values in Va/(X) C R or Support Sup(X)
@ X =k is the event that random variable X takes on value k
Discrete RVs:
@ Val(X) is a set
@ P(X =k) can be nonzero
Continuous RVs:
@ Val(X) is a range
@ P(X =k) =0 for all k. P(a < X < b) can be nonzero.
26
Probability Density Function (PDF)
PDF of a continuous RV is simply the derivative of the CDF.
dF x(x)
fx (x) = f(x) = dx
Thus,
b
P(a< X < b) = Fx(b) — Fx(a) = / F(x) dx
A valid PDF must be such that
@ for all real numbers x, fx(x) > 0.
@ [°) fx(x)dx =1
29
Expectation
Let g be an arbitrary real-valued function.
@ If X is a discrete RV with PMF px:
Ele(X)]:= Sd) e(x)px(~)
x€ Val(X)
e If X is a continuous RV with PDF fx:
Ble(X)] = | wlxfel)ee
Intuitively, expectation is a weighted average of the values of g(x), weighted by the probability
of x.
30
Properties of Expectation
For any constant a € R and arbitrary real function f:
e E[a] = a
e Elaf(X)] = aE[F(X)]
Linearity of Expectation
Given n real-valued functions f,(X), ..., fr(X)),
EL). f(X)] = 5 E[A(X)]
i=1 i=1
31
Joint and Marginal Distributions
e Joint PMF for discrete RV's X, Y:
Pxy(x,y) = P(X =x, Y =y)
Note that yoxeVal(X) yeval(Y) pxy(x, y) = 1
e@ Marginal PMF of X, given joint PMF of X, Y:
px(x) = 0 pxy(x,y)
34
Joint and Marginal Distributions
e@ Joint PDF for continuous X, Y:
0° Fxy(x, y)
fyy (x, y) = Oxdy
Note that [°° [°° fxy(x, y)dxdy = 1
@ Marginal PDF of X, given joint PDF of X, Y:
fy (x) = [- fxy (x, y)dy
—oo
35