Cheat sheet Statistics for Network Science | Zusammenfassungen Statistik

I. FUNDEMENTALS OF STATISTICS

Examples of common distributions:

1. Normal (Gaussian, Bell-curve): 𝑋~𝒩(𝜇,𝜎2)

A continuous random variable X with 𝑓(𝑥)=

√2𝜎𝜋𝑒−(𝑥−𝜇)2

2𝜎2.

Ε[𝑋]=𝜇 V[𝑋]= 𝜎2

2. Uniform

a. Discrete:

𝑝𝑋(𝑘)={1

𝑏−𝑎+1, if 𝑘 = 𝑎, 𝑎 + 1, …, 𝑏,

0, otherwise.

Ε[𝑋]=𝑎+𝑏

b. Continuous:

𝑓𝑋(𝑥)={1

𝑏−𝑎, if 𝑎 ≤ 𝑥 ≤ 𝑏,

0, otherwise.

Ε[𝑋]=𝑎+𝑏

3. Exponential

HEAVY TAIL: A random variable 𝑋 with CCFD 𝐹

(𝑥) is

heavy-tailed if the tail is not bounded by an exponential.

FAIT TAIL: A random variable 𝑋 with CCFD 𝐹

(𝑥) is fat-

tailed if 𝐹

(𝑥)~𝑥−𝛾 for a 𝛾> 0.

II. DISTRIBUTION FITTING

POWER LAW.

Continuous case:

𝑝(𝑥)=𝛼−1

𝑥min (𝑥

𝑥min)−𝛼

𝐹

(𝑥)=( 𝑥

𝑥min)−𝛼+1

Discrete case:

𝑝(𝑥)=𝑥−𝛼

𝜁(𝛼)

How to fit a power law:

1. Estimate the exponent, 𝛼, through Maximum

Likelihood Estimation

2. Estimate the lower bound, 𝑘min, through the

Kolmogorov-Smirnov test statistic. (a) Calculate the

maximum distance, D, between the CDFs of data and

fit, (b) Take the 𝑥min that minimizes the distance.

3. Test the goodness of fit of power law by getting the

p-value. The p-value is the probability that a data set

of the same size that is truly drawn from the

hypothesized distribution would have goodness of fit

D worse than the observed value.

a. Create many (at least 10,000) synthetic

power-law data

b. For each, fit a new power law and calculate

the KS statistic

c. p-value = fraction of KS statistics for

synthetic that exceed for real

d. If p-value is small, power law can be ruled

out

4. Test the goodness of fit of competing distributions

(power law with cutoff, lognormal, stretched

exponential) using likelihood ratio test

III. INTRODUCTION TO STATISTICAL INFERENCE

Process of scientific understanding:

-1) Intuition/Observation

0) Hypothesis

1) Measure

2) + Perturb

3) + Control

4) Infer

5) + Reproduce

6) + Generalize

Statistical inference is the process of deducing

properties of

an underlying probability distribution by analysis of data.

We try to infer properties of a population by analyzing a

sample.

Statistical inference approaches fall into 2 categories:

1) Estimation

Parametric estimation: Making an educated guess about

the value of a parameter or a value range of parameters.

Prediction: If we want to estimate something else, like

the

next value in Xn+1 a sequence X1, . . . ,Xn

2) Hypothesis testing

i. State the null hypothesis , such as , and an

alternative hypothesis H1 : ✓ 6= ✓0. This

simplest explanation is the null hypothesis

or null model. It assumes that chance is

responsible for an observation.

ii. Statistical assumptions: independence,

confounding, normality, sample size

iii. Compute the test statistic from the data and

compare it to the reference distribution that

describes if is true

iv. If lies in the "extreme regions" of F T , reject

H 0 .

Cheat sheet Statistics for Network Science, Zusammenfassungen von Statistik

Zugehörige Dokumente

Unvollständige Textvorschau

Nur auf Docsity: Lade Cheat sheet Statistics for Network Science und mehr Zusammenfassungen als PDF für Statistik herunter!

I. FUNDEMENTALS OF STATISTICS

Ε[𝑋] = 𝜇 V[𝑋] = 𝜎

[

]

Ε[𝑋] =

II. DISTRIBUTION FITTING

POWER LAW.

III. INTRODUCTION TO STATISTICAL INFERENCE

H 0.

HYPOTHESIS TESTING IN NETWORKS:

IV. MULTIPLE HYPOTHESIS TESTING AND MAXIMUM

LIKELIHOOD ESTIMATION

CONFUSION MATRIX

V. CLUSTER ANALYSIS

2. PARTITIONAL CLUSTERING.

K-MEANS.

DENSITY-BASED CLUSTERTING.

3. GRAPH PARTITIONING

KERNIGHAN-LIN ALGORITHM.

4. HIERARCHICAL CLUSTERING.