Cheat sheet Statistics for Network Science, Zusammenfassungen von Statistik

Cheat sheet for statistics for network science.

Art: Zusammenfassungen

2019/2020

Hochgeladen am 07.01.2020

b1b8
b1b8 🇦🇹

1 dokument

1 / 3

Toggle sidebar

Diese Seite wird in der Vorschau nicht angezeigt

Lass dir nichts Wichtiges entgehen!

bg1
I. FUNDEMENTALS OF STATISTICS
Examples of common distributions:
1. Normal (Gaussian, Bell-curve): 𝑋~𝒩(𝜇,𝜎2)
A continuous random variable X with 𝑓(𝑥)=
1
2𝜎𝜋𝑒(𝑥−𝜇)2
2𝜎2.
Ε[𝑋]=𝜇 V[𝑋]= 𝜎2
2. Uniform
a. Discrete:
𝑝𝑋(𝑘)={1
𝑏−𝑎+1, if 𝑘 = 𝑎, 𝑎 + 1, , 𝑏,
0, otherwise.
Ε[𝑋]=𝑎+𝑏
2
b. Continuous:
𝑓𝑋(𝑥)={1
𝑏−𝑎, if 𝑎 𝑥 𝑏,
0, otherwise.
Ε[𝑋]=𝑎+𝑏
2
3. Exponential
HEAVY TAIL: A random variable 𝑋 with CCFD 𝐹
(𝑥) is
heavy-tailed if the tail is not bounded by an exponential.
FAIT TAIL: A random variable 𝑋 with CCFD 𝐹
(𝑥) is fat-
tailed if 𝐹
(𝑥)~𝑥−𝛾 for a 𝛾> 0.
II. DISTRIBUTION FITTING
POWER LAW.
Continuous case:
𝑝(𝑥)=𝛼1
𝑥min (𝑥
𝑥min)−𝛼
𝐹
(𝑥)=( 𝑥
𝑥min)−𝛼+1
Discrete case:
𝑝(𝑥)=𝑥−𝛼
𝜁(𝛼)
How to fit a power law:
1. Estimate the exponent, 𝛼, through Maximum
Likelihood Estimation
2. Estimate the lower bound, 𝑘min, through the
Kolmogorov-Smirnov test statistic. (a) Calculate the
maximum distance, D, between the CDFs of data and
fit, (b) Take the 𝑥min that minimizes the distance.
3. Test the goodness of fit of power law by getting the
p-value. The p-value is the probability that a data set
of the same size that is truly drawn from the
hypothesized distribution would have goodness of fit
D worse than the observed value.
a. Create many (at least 10,000) synthetic
power-law data
b. For each, fit a new power law and calculate
the KS statistic
c. p-value = fraction of KS statistics for
synthetic that exceed for real
d. If p-value is small, power law can be ruled
out
4. Test the goodness of fit of competing distributions
(power law with cutoff, lognormal, stretched
exponential) using likelihood ratio test
III. INTRODUCTION TO STATISTICAL INFERENCE
Process of scientific understanding:
-1) Intuition/Observation
0) Hypothesis
1) Measure
2) + Perturb
3) + Control
4) Infer
5) + Reproduce
6) + Generalize
Statistical inference is the process of deducing
properties of
an underlying probability distribution by analysis of data.
We try to infer properties of a population by analyzing a
sample.
Statistical inference approaches fall into 2 categories:
1) Estimation
Parametric estimation: Making an educated guess about
the value of a parameter or a value range of parameters.
Prediction: If we want to estimate something else, like
the
next value in Xn+1 a sequence X1, . . . ,Xn
2) Hypothesis testing
i. State the null hypothesis , such as , and an
alternative hypothesis H1 : 6= 0. This
simplest explanation is the null hypothesis
or null model. It assumes that chance is
responsible for an observation.
ii. Statistical assumptions: independence,
confounding, normality, sample size
iii. Compute the test statistic from the data and
compare it to the reference distribution that
describes if is true
iv. If lies in the "extreme regions" of F T , reject
H 0 .
pf3

Unvollständige Textvorschau

Nur auf Docsity: Lade Cheat sheet Statistics for Network Science und mehr Zusammenfassungen als PDF für Statistik herunter!

I. FUNDEMENTALS OF STATISTICS

Examples of common distributions:

  1. Normal (Gaussian, Bell-curve): 𝑋~𝒩

2

A continuous random variable X with 𝑓(𝑥) =

1

√ 2 𝜎𝜋

( 𝑥−𝜇

)

2

2 𝜎

2

Ε[𝑋] = 𝜇 V[𝑋] = 𝜎

2

  1. Uniform

a. Discrete:

𝑋

1

𝑏−𝑎+ 1

, if 𝑘 = 𝑎, 𝑎 + 1 , … , 𝑏,

0 , otherwise.

[

]

𝑎+𝑏

2

b. Continuous:

𝑋

1

𝑏−𝑎

, if 𝑎 ≤ 𝑥 ≤ 𝑏,

0 , otherwise.

Ε[𝑋] =

𝑎+𝑏

2

  1. Exponential

HEAVY TAIL: A random variable 𝑋 with CCFD 𝐹

(𝑥) is

heavy-tailed if the tail is not bounded by an exponential.

FAIT TAIL: A random variable 𝑋 with CCFD 𝐹

(𝑥) is fat-

tailed if 𝐹

−𝛾

for a 𝛾 > 0.

II. DISTRIBUTION FITTING

POWER LAW.

Continuous case:

min

min

−𝛼

min

−𝛼+ 1

Discrete case:

−𝛼

How to fit a power law:

  1. Estimate the exponent, 𝛼̂ , through Maximum

Likelihood Estimation

  1. Estimate the lower bound, 𝑘 min

, through the

Kolmogorov-Smirnov test statistic. (a) Calculate the

maximum distance, D, between the CDFs of data and

fit, (b) Take the 𝑥

min

that minimizes the distance.

  1. Test the goodness of fit of power law by getting the

p-value. The p-value is the probability that a data set

of the same size that is truly drawn from the

hypothesized distribution would have goodness of fit

D worse than the observed value.

a. Create many (at least 10,000) synthetic

power-law data

b. For each, fit a new power law and calculate

the KS statistic

c. p-value = fraction of KS statistics for

synthetic that exceed for real

d. If p-value is small, power law can be ruled

out

  1. Test the goodness of fit of competing distributions

(power law with cutoff, lognormal, stretched

exponential) using likelihood ratio test

III. INTRODUCTION TO STATISTICAL INFERENCE

Process of scientific understanding:

    1. Intuition/Observation
  1. Hypothesis

1 ) Measure

    • Perturb
    • Control
  1. Infer

    • Reproduce
    • Generalize

Statistical inference is the process of deducing

properties of

an underlying probability distribution by analysis of data.

We try to infer properties of a population by analyzing a

sample.

Statistical inference approaches fall into 2 categories:

  1. Estimation

Parametric estimation: Making an educated guess about

the value of a parameter or a value range of parameters.

Prediction: If we want to estimate something else, like

the

next value in Xn+1 a sequence X1,... ,Xn

  1. Hypothesis testing

i. State the null hypothesis , such as , and an

alternative hypothesis H1 : ✓ 6= ✓ 0. This

simplest explanation is the null hypothesis

or null model. It assumes that chance is

responsible for an observation.

ii. Statistical assumptions: independence,

confounding, normality, sample size

iii. Compute the test statistic from the data and

compare it to the reference distribution that

describes if is true

iv. If lies in the "extreme regions" of F T , reject

H 0.

The p-value is the probability of obtaining a test

statistic at least as extreme as the one that was

actually observed, assuming that the null hypothesis

is true. If the p-value is lower than a predefined

significance level , like here, then is rejected. If lies in

the "extreme regions" of , reject. Otherwise, retain it - but

this does not prove!

HYPOTHESIS TESTING IN NETWORKS:

  1. Take a null model, like ER or configuration.

  2. Generate 1000+ realizations with parameters

calibrated from

the data (here: choose p and N accordingly).

  1. This gives you mean and standard deviation of

measures:

degree, clustering, characteristic path length, etc

  1. Calculate z-score. If z>2, your network is probably not

random.

IV. MULTIPLE HYPOTHESIS TESTING AND MAXIMUM

LIKELIHOOD ESTIMATION

Problem: If I test n true null hypotheses at level 𝛼, then

on average I’ll still falsely reject 𝛼𝑛 of them.

20 hypotheses to test, 0.05 level of significance.

P(significant)=1-P(no significant)=1-(1-0.05)^

CONFUSION MATRIX

V. CLUSTER ANALYSIS

  • Cluster analysis is the division of data into groups

that are meaningful, useful, or both. Cluster analysis

is the study of techniques for automatically finding

classes.

  • Partitional (A division into non-overlapping subsets.

Each data object is in exactly one subset.) vs

Hierarchical (A set of nested clusters organized as a

tree.)

  • Exclusive (Each object assigned to a single cluster.)

vs Overlapping (Non-exclusive.) vs Fuzzy (Cluster

membership is a weight.)

  • Agglomerative (Bottom-up) vs Divisive methods

(Top-down)

2. PARTITIONAL CLUSTERING.

K-MEANS.

a. Select K points as initial centroids.

b. Repeat:

i. Form K clusters by assigning each point to

its closest centroid.

ii. Recompute the centroid of each cluster.

c. Until: centroids do not change.

Advantages: easy, fast, works well with globular

clusters of same size and density

Disadvantages: need to know number of partitions,

sensitive to initial conditions, not effective under

several conditions (can get stuck in local minimum).

K-MEDIOD. The prototype is the medoid, which is a

data point itself.

DENSITY-BASED CLUSTERTING.

a. Label all points as core, border or noise points.

(Select a radius. Count all points within a radius.

This is the density of the point. Select a

parameter MinPts).

b. Eliminate noise points.

c. Put an edge between all core points that are

within epsilon of each other

d. Make each group of connected core points into

a separate cluster.

e. Assign each border point to one of the clusters

of its associated core points.

Advantages: resistant to noise, can handle clusters

of arbitrary shape and size

Disadvantages: cannot handle clusters with different

densities, can be computationally expensive

3. GRAPH PARTITIONING

Divides a network into a pre-defined number of

smaller subgraphs. The cut is the set of links that

need to be removed to split a graph. To partition a

graph, we would like to minimize the cut, but also

get similarly sized sets. Normalized cut =

𝐶𝑢𝑡

( 𝑆,𝑇

)

𝑉𝑜𝑙

( 𝑆

)

𝐶𝑢𝑡

( 𝑆,𝑇

)

𝑉𝑜𝑙

( 𝑇

)

KERNIGHAN-LIN ALGORITHM.

a. Divide the network into 2 arbitrary groups with

predefined size.

b. Inspect each pair of nodes between groups.

Identify the pair that results in largest reduction

of cuts and swap them.

c. Repeat until no swap improves.

4. HIERARCHICAL CLUSTERING.

Induces a dendrogram that records the sequences

of merges or splits

a. Compute the proximity matrix, if necessary.

b. Repeat:

i. Merge the closest two clusters (single link,

complete link, group average, centroid).

ii. Update the proximity matrix to reflect the

proximity between the new cluster and the

original clusters

c. Until: only one cluster remains.