










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A series of lecture notes on statistical data analysis, specifically focusing on probability theory, bayes' theorem, random variables, probability distributions, hypothesis testing, and the chi-square test. The notes cover topics such as functions of random variables, expectation values, error propagation, the monte carlo method, p-values, and the significance of a peak.
Typology: Slides
1 / 18
This page cannot be seen from the preview
Don't miss anything!











1 Probability, Bayes’ theorem, random variables, pdfs 2 Functions of r.v.s, expectation values, error propagation 3 Catalogue of pdfs 4 The Monte Carlo method 5 Statistical tests: general concepts 6 Test statistics, multivariate methods 7 Significance tests 8 Parameter estimation, maximum likelihood 9 More maximum likelihood 10 Method of least squares 11 Interval estimation, setting limits 12 Nuisance parameters, systematic uncertainties 13 Examples of Bayesian approach 14 tba
Suppose hypothesis H predicts pdf observations for a set of We observe a single point in this space: What can we say about the validity of H in light of the data? Decide what part of the data space represents less compatibility with H than does the point (^) less compatible with H more compatible with H (Not unique!)
i.e. p = 0.0026 is the probability of obtaining such a bizarre result (or more so) ‘by chance’, under the assumption of H. Probability to observe n heads in N coin tosses is binomial: Hypothesis H : the coin is fair ( p = 0.5). Suppose we toss the coin N = 20 times and get n = 17 heads. Region of data space with equal or lesser compatibility with H relative to n = 17 is: n = 17, 18, 19, 20, 0, 1, 2, 3. Adding up the probabilities for these values gives:
Suppose we observe n events; these can consist of: n b events from known processes (background) n s events from a new process (signal) If n s , n b are Poisson r.v.s with means s , b , then n = n s
Suppose we measure a value x for each event and find: Each bin (observed) is a Poisson r.v., means are given by dashed lines. In the two bins with the peak, 11 entries found with b = 3.2. The p -value for the s = 0 hypothesis is:
But... did we know where to look for the peak? → give P ( n ≥ 11) in any 2 adjacent bins Is the observed width consistent with the expected x resolution? → take x window several times the expected resolution How many bins × distributions have we looked at? → look at a thousand of them, you’ll find a 10
G. Cowan 10
The p -value is a function of the data, and is thus itself a random variable with a given distribution. Suppose the p -value of H is found from a test statistic t ( x ) as Lectures on Statistical Data Analysis The pdf of p H under assumption of H is In general for continuous data, under assumption of H , p H ~ Uniform[0,1] and is concentrated toward zero for Some (broad) class of alternatives. pH g ( p H
g ( p H
G. Cowan 11
0 So the probability to find the p -value of H 0 , p 0
Lectures on Statistical Data Analysis We started by defining critical region in the original data space ( x ), then reformulated this in terms of a scalar test statistic t ( x ). We can take this one step further and define the critical region of a test of H 0
0
Formally the p -value relates only to H 0 , but the resulting test will have a given power with respect to a given alternative H 1
2
If n i
i
i , i.e., n i
i
i 2 ),
2
2
2 = z ): If the n i
i
i
then the Poisson dist. becomes Gaussian and therefore Pearson’s
2
2 pdf.
2 value obtained from the data then gives the p -value:
2
Recall that for the chi-square pdf for N degrees of freedom,
i are right, the rms deviation of n i
i
i , so each term in the sum contributes ~ 1.
2 / N reported as a measure of goodness-of-fit.
2 and N separately. Consider, e.g.,
2 per dof only a bit greater than one can imply a small p -value, i.e., poor goodness-of-fit.
2
← This gives for N = 20 dof. Now need to find p -value, but... many bins have few (or no)
2 to follow the chi-square pdf.
2
2 statistic still reflects the level of agreement between data and prediction, i.e., it is still a ‘valid’ test statistic. To find its sampling distribution, simulate the data with a Monte Carlo program: Here data sample simulated 10 6 times. The fraction of times we
2
29.8 gives the p -value: p = 0. If we had used the chi-square pdf we would find p = 0.073.