




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This report discusses statistical analysis techniques for high energy particle physics data. It introduces new techniques and places familiar methods into the general framework of statistical data analysis. The report is concerned with discovering properties of particle interactions from the data and deducing statistically meaningful statements about those properties. It covers univariate and multivariate data analysis techniques. The emphasis is on ideas and concepts rather than specific details. The report is a lecture note presented at the CERN School of Computing in 1974.
Typology: Lecture notes
1 / 100
This page cannot be seen from the preview
Don't miss anything!





























































































Stanford, California 94305
September 1974
Printed in the United States of America. Available from National Technical information Service, U. S. Department of Commerce, 5285 Port Royal Road, Springfield, Virginia 22151. Price : Printed Copy $5 .45 ; Microfiche $1.. Lectures presented at the CERN School of Computing, Godoysund, Norway, August 11-24, 1974.
UC-34d (E/I)
Useful techniques for the statistical analysis and presentation of
high energy particle physics data are described and discussed.
Page 6 .3 .2.1 The Mutual Information Measure for Pairwise Dependence (^72) 6 .3 .2.2 An Algorithm for the Direct Measure of Stochastic Independence (^73) 6 .4 A Multivariate Goodness-of-fit Test (^77)
FOOTNOTES AND REFERENCES 83
The purpose of this report is to acquaint high energy physicists with a vari- ety of techniques for presenting and making statistical inferences from counted data. The attempt will be to introduce new techniques that are not commonly used in high energy particle physics as well as to place those methods that are familiar into the general framework of statistical data analysis. This report will not deal with the equally important problem of data seduction. That is, reducing the raw digitizations from particle detectors to more useful quantities such as particle momenta and angles. Although these calculations are often quite complex they seldom require statistical inference. (A notable exception is hypothesis discrimination in kinematic fitting. ) The computer codes that per- form these computations can usually be thought of as computing engines that transform the data from the raw experimental variables to those that are more convenient for further calculations. This report is concerned with these further calculations ; that is, how to discover properties of the particle interactions from the data, and deduce as well as present, statistically meaningful statements about those properties. The methods discussed are general in the sense that they can be applied to data from any science that have similar properties to those encountered in particle physics. In fact, many of the techniques that are discussed, although new to particle physics, are commonly used in other sciences, especially pat- tern recognition and artificial intelligence. The emphasis, however, will be on those methods that can be most profitably applied to the types of data usually encountered in high energy particle physics experiments.
and p (x) > 0 for all^ Zc R where R is the total region of measurement space. It is clear that p(x)^ contains^ all^ of the information of the experiment.^ The purpose of experimentation is to infer properties of p(i) from the observed dis- tributions of the measured counts. Conversely, it is the purpose of theory to calculate p(x)^ from mathematical models and infer from it the results of exper- iments. Data.analysis is divided into two types, parametric and non-parametri c. In parametric (or model dependent) analysis, p(x) is assumed to be a member of a parameterized family of distributions p(i) __ p(a ;^ x) , (2a) where ais the set of parameters (either discreet or continuous or both) that specify the particular distribution from the family of possible distributions. The problem of determination of the probability density function then becomes the problem of determining the appropriate values for the parameters T..^ The parti- cular parameterized family can come from the researchers intuition, invariance principles (such as angular momentum conservation) or specific dynamical models. For example, the Lorentz invariant amplitude squared for a reaction is the probability density in the Lorentz invariant phase space. In non-parametric (model independent) analysis no a priori information is assumed about the probability density function. In this case one infers the prob- ability density function directly from the counted data, with very little or no information about what form it might take. Histogramming is an example of a non-parametric (one-dimensional) density estimation. There are relative advantages and disadvantages to both types of analysis. When it is properly applied parametric analysis is usually statistically much more powerful than non-parametric analysis. This is due to the tremendous increase of information in restricting the set of all possible probability densities to those of a particular parameterized family. The results of the analysis, how- ever, crucially depend upon the correctness of this assumption. If the prob- ability density function that gives rise to the data is not a member of the sup- posed parameterized family, then at best the statistical power is reduced com- pared to non-parametric techniques, and at worst (usually the case) the results are meaningless. Non-parametric techniques have the advantage of being appli- cable to a wide range of problems since they require few assumptions concerning
the data. It should be kept in mind, however, that even though non-parametric techniques are usually formulated independently of specific probability densities their statistical performance usually varies with the actual probability density of the data. Statistical theory is far more developed for parametric analysis than non- parametric. This is especially true for the family of normal or Gaussian dis- tributions
PPC, E„x) = (^) (2 ,r)d/ I z11/^
exp[-1/2(x-μ)TE-1(x-μ)l (^) (3)
where the parameters are the location vector μand covariance matrix E. A great many of the statistical techniques in common use were designed to be opti- mal for normal distributions and are referred to as normal theory techniques. These techniques can lose considerable statistical power when applied to data with non-normal density distributions.
estimate, a, for the value of the parameter, a, N 1 a = I 1 [P(x1,x2,.. .xN)I = [N
~,f(xi ) J
J (^) i= Since the function, f(a), was somewhat arbitrary it is clear that this procedure can be used to construct a variety of statistics for estimating the parameter, a. However, some will be better than others. For example cNT , which regulates the precision with which the parameter, a, is estimated, depends on f(r) (for N < co)^ through Eq. 8. The field of Statistics is concerned to a great degree with finding good statistics for estimation and determining their properties. Statistics used for estimation (usually called estimators) are rated in terms of four basic proper- ties of their probability density distributions p N(a ;Y); these are consistency, efficiency, bias, and robustness.
3.1 Consistency An estimator, Y = T(xl,z2,... ,zN ) is consistent if the following condition holds
N-.oo"m^ pN (a ;Y)^ = 6(Y-a).^ (11) That is as the number of samples gets arbitrarily large, p(a ;Y) becomes an arbi- trarily narrow function of Y about a, and the estimator provides an arbitrarily precise estimate of the parameter, a. Note that Eqs. 7 and 8 show that the estimator defined by Eq. 6 is consistent. Consistency is nearly always required for an estimator to be considered useful. 3 .2 Efficiency Consistency is concerned with the precision of the estimator for infinite sample size. (In the field of Statistics, a result that holds in the limit of infinite sample sizes is called an asymptotic result. ) Efficiency is concerned with the precision of the estimator for finite sample size N. An estimator is called efficient if the variance (mean squared error) of its probability density function V N^ = (^) f (Y-a)2 PN(a;Y)dY (12) R is as small as possible 3) for a given N. The square root of the variance, oN = --, is characteristic of the width of P N(a ;Y) about a, and thus is directly related to the precision of the estimator. Therefore, an efficient estimator for
a given N is one that (loosely speaking) has maximal precision. The relative efficiency between two estimators is the inverse ratio of the variances of their probability densities for a given sample size, N. The efficiency of an estimator is its relative efficiency to an efficient estimator (i. e. , efficient estimators are said to have 100% efficiency). This definition of efficiency can be related to the intuitive meaning of the word in the following manner. For large sample size, N, the variance of most estimators decreases as, VN - 1/N, for increasing N (i. e. , (^) uN -^ 1/,fN). Then the efficiency of an estimator is the inverse ratio of the number of samples (events) it requires to the number an efficient estimator requires for the same precision. Clearly high efficiency is a desirable property for an estimator. However, an estimator with the highest efficiency is quite often not the most desirable. Sometimes the computational complexity of the most effi- cient estimator makes it more expensive for a given precision than a less effi- cient estimator even though the less efficient estimator requires more events. 3 .3 Bias Like efficiency, bias refers to a property of estimators for finite sample
A biased estimator is one with an expected value that is different from the true value of the parameter being estimated. The bias is just the difference between E N [Y] and the true value of the parameter. Note that, although it might appear to be contradictory, a biased estimator can also be consistent and conversely an unbiased estimator can be inconsistent. If a biased estimator is consistent, then from Eq. 11 lim b = 0. N -^00
It may at first seem that bias would be a very undesirable property for an esti- mator to have. This is generally not the case. It is only important that the bias be relatively small compared to the square root of the variance (Eq. 12) (standard deviation) of the probability density function. Most of the commonly used estimators in particle physics are in fact biased. There are various tech- niques for reducing bias in estimators but they usually do this at the expense
size, N. Specifically the bias of an estimator is defined as
bN = f Y p N(a ;Y) dY - a (13a) I .e ., R bN = E N [Y]-a^. (13b)
weights each point by the square of its distance from the center. The percentile estimate on the other hand will be completely unaffected by the mismeasured point. For exploratory data analysis especially, robustness is essential. Robust estimators generally maintain from 60% to 90% efficiencies over wide ranges of data distributions while non-robust estimators tend to have near 100% efficiency when the data distribution exactly follows the predicted probability density func- tion, and low efficiency when it does not.
p(X) =P i if^ xEri ,^ i=1,M.
Let gi(x) be an indicator function for each region, i .e. , j 1 if xer i gi(x )^ l 0 otherwise.
Then we have for our estimator of p(x), M N PN~) = (^) N S (^) S gi (x)gi(x). (14) i,=1 j=1 i From the central limit theorem one has
and
where (^) ai = a 0 / -. T N ,
when ni Npi is large. A more careful analysis shows that for any ni , the ni = Npi are distributed according to a multinomial distribution M P n i pN(n1 ,n2 ,.. .nNi) = N! II I (17) i=1 (^) n i! if the total number of events, N, is considered fixed. 4)^ Note from Eq. 17
E[ni ] = ni (18)
so that the estimator is unbiased. The variance of ni is
so that
Equation 19 shows that pi is a consistent estimator of p i. For pi << 1 (large number of bins for example) Eq. 19a can be approximated by
V[ni ) - Npi nl. (20)
Since ni is usually not known it seems reasonable to make the further approxima- tion ni ce^ ni (21)
E(Pi] = Pi = (^) P(x)dx .. r. 1
__ (^1) (pf- pi) pN (pi) (^) 2 x ai e 2 2 ai
V [n i ] = 9Pi (1-^ Pi) (19a)
V1Pi] = Pi(' - Pi )/N^.^ (19b)
orthogonal functions defined on the real line
fR O(x) 0 .(x)dx^ = Sit
and we wish to estimate p(x) from the data points (^) {x .}Nj=1 with an estimator of the form [M~ PN(x) = L~^ ci(N) 0 i(x)^.^ (24) i= If the actual probability density function, p(x), were known then it is easy to show that the variance of the density estimation, V N, (Eq. 23) is minimal for
ci (^) = f i (x) P(x)dx = E[Oi l. (25) R For non-parametric estimation p(x) is not known so we estimate the integral from the data sample N C1 (^) = N O i(x .) (26) j= From the central limit theorem one has (for large N)
[c (N)J = 1 -1/ ((or
PN i - ci ) 2n vN )
e (^) (a i))2 (27 )
where
aN (^) = V( O (^) )/N = (^) E [('V - E [x'1)2] /N.
Thus, E[ c i~)) =ci so that the estimate is unbiased and lim^ aN = 0 so that it is consistent.^ N_^ M Combining Eqs. 24 and 26 we have for our density estimate M N PN (X) N^ 2:^1
: (^) ~Ni (x .) ~Vi(x) (26) i=1 j= The average variance of the estimate, V N , Eq. 23, is
I f (^) (P -p N)2 dx } (^) = Elf (p-P)2dxI + E (^) I f (P-PN)2dxl (29)
where
or
p2 (x)dx + R
P(x) = L~ c i * (x) (30) i= The first term on the right hand side of Eq. 29 is a constant independent of the data so that
i=1 N Equation 31 shows that the variance of the estimate is composed of a constant systematic part and a statistical part that approaches zero as N becomes infinite. Thus, like the histogramming approach,the orthogonal function density estimator is inconsistent (unless by some chance p(x) =p(x) for all x -^ i. e. ,^ either M = ~, or for finite M, p(x) can exactly be expressed by Eq. 30). It is no accident that the histogramming and orthogonal function estimators share this property of inconsistency. Inspecting Eq. 14, one sees that it is just a special case of Eq. 28 where the orthogonal functions are the indicator func- tions gi (x).^ Note that
and N (^) n.
j=1 I The general orthogonal function approach suffers from generalized analogs of most of the problems discussed for histogramming. The problem of specific bin choice and number of bins becomes the problem of number and specific
is negative for some value of x rendering it inadmissible as a probability den- sity function (although it still may be quite useful). 4 .1 .3 The Rosenblatt Estimator We will now begin to consider some consistent estimators of univariate pro- ability density. The first is the Rosenblatt or "naive pdf' (probability density
This estimator can be made consistent so long as h tends to zero, while the pro- duct (hN) approaches infinity. Parameterizing the window size as
h = CNa (39)
and choosing a so as to minimize the dominant terms in Eq. 38, one obtains a=-1/5 as the value that causes the variance to decrease most rapidly with in- creasing N. A careful analysis shows that the constant should be
C = [9p(x)/2Ip"(x)1 211/5.^ (40 ) The bias of this estimator (Eq. 37) is easy to understand. For finite window size the estimator p N(x) (Eq. 32) is an unbiased estimator of the average of the probability density within the window
_ x+h p(x) = 2h (^) x-h p(x')dx'
If p(x') is nonlinear within the window region, then this average will be different than the value of the probability density at the center of the window, p(x). As the window size approaches zero, or as the probability density approaches linearity, this effect will disappear, as reflected by Eq. 37. The expression for the variance (Eq. 38) shows that like histogramming, the variance of this estimate is proportional to the value of the probability density (standard deviation proportional to the square root of the probability density). Unlike histogramming, however, this probability density estimate is not piece- wise constant over fixed intervals (bins) and does not suffer from the sharp dis- continuities that histogramming produces at the boundaries of these intervals ('statistical fluxuations"). This estimator does, of course, suffer from statis- tical uncertainty as reflected by its variance (Eq. 38). However, the Rosenblatt estimator produces a relatively smooth probability density estimate which (at least in the limit of large sample size) can be shown to be more accurate than histogramming (see below for finite sample comparisons). 4 .1 .4 Parzen Estimators The Rosenblatt estimator is a special case of a general class of density esti- mators known as Parzen estimators or Parzen windows. 6)^ Let K(y) be a bounded absolutely integrable function such that
JR K(y)dy^ =^1 and^ lim^ IyK(y)I =0.^ (42) lyl -
Then the Parzen window estimators are defined as N pN(x) = h(N) E KC(hN))
The function K(y) is called the kernel or window function. The notation h(N) is used to explicitly indicate that the scale parameter for the kernel function de- pends upon the sample size, N. For the Rosenblatt estimator one has x-x i O(x ;xi ) K [h(N) 2N (44)
where f(x ;xi ) is defined in Eq. 34. Other possible kernels are : a) the double
exponential function a - '^ ;^ b) the standard normal (Gaussian) function ; c) the Cauchy function 1/(1+y 2 ); and d) sin 2y/y 2. Using procedures analogous to those for the Rosenblatt estimator, one can show that these estimators are biased, with the bias tending to zero quadratically as the scale parameter h(N) approaches zero. Also, the variance of the estimate tends to zero as 1/Nh(N) for increasing sample size, N. Thus, these estimators are consistent provided that h(N) -- 0 while Nh(N) - w.