









































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Paper; Class: Applied Multivariate Analysis; Subject: Statistics; University: West Virginia University; Term: Spring 2005;
Typology: Papers
1 / 81
This page cannot be seen from the preview
Don't miss anything!










































































Multivariate analysis is the process of finding patterns in high-dimensional data and in formalizing this structure in a model. Although structure is sometimes specified a priori, guided explorations of multivariate space—numerically and graphically—are more common activities.
The basic objects of statistical analyses are individuals and variables. An indi- vidual is often called a case, observation, respondent, subject, etc., depending on the field of investigation. The generic term individual will be used throughout this book except in examples in which the context suggests a more appropriate name. A variable^1 is an abstract object which assigns a unique value to each in- dividual under study. Examples of individuals and variables in several research areas are given in the following table:
Field Individual Variables Business Company Financial Characteristics Ecology Bird Habitat Variables Epidemiology Individual Risk Factors Engineering Process Operating Characteristics Genomics Sample Genes Geology Wells Site and Production Variables Psychology Subjects Personality Trait Variables Sociology City Crime Variables
Variables defined on the same individuals define a data table. The data values, determined for each individual by each variable, are arranged in a table with rows representing individuals and columns representing variables. The ith
(^1) The term random variable is often used in mathematical statistics to denote a rule for assigning values to the outcomes of an experiment.
according to their nature: names, grades, ranks, counts, counted fractions, amounts, and balances. This book classifies variables functionally according to the applicable type of statistical analysis: categorical, ordered categorical, discrete numerical, and continuous numerical. The “values” taken by a categorical variable are names or levels. An ordered categorical variable has categories which represent some order or relative position. Ordered categorical variables could have values representing ranks. Discrete numerical variables take on distinct values in a given range. These values are generally counts. A continuous variable can take on any value in a given range, which are generally measured in some units. A variable may also have a role, and possibly a parameterized distribution, in addition to its type. The role identifies whether the variable will be an outcome variable (Y ), an explanatory variable (X), a conditioning or “nuisance” variable (Z), a weight variable (W ), a frequency variable (F ), or a label identifier (L). The distributions commonly used in applications include the binomial and multinomial for categorical variables; the Poisson, binomial, and negative binomial for discrete numerical variables; the normal, log-normal, and gamma distributions for continuous numerical variables. The type, role, and distribution constrain the class of plots and analyses which should be applied to a variable or a group of variables. For example, if the species count recorded on sites is assigned the role “Y ”, the type “discrete”, and the distribution “Poisson” and the habitat diversity score is assigned the role “X” and the type “continuous,” then a Poisson regression may be the most appropriate analysis. The variable attributes described above collectively are called metadata, i.e., they are data about the data. In order to do guided explorations or modeling of the data, a rich metadata environment is essential. Other types of metadata are defined for each individual in the sample. These attributes are called state variables and their assigned “values” are useful within a statistical computing and graphics environment. State variables are not true variables since their values are determined by the investigator. The most com- mon state characteristics are color and symbol, which determine the visual rep- resentation of the individual in plots; mask, which determines whether or not the individual is included in the analyses and plots; and label, which defines an identifier.
1.3 Individual and Variable Space
Assuming all variables are numeric, the rows of the data matrix consist of vectors of length p; the columns consist of vectors of length n. Individual space is a p-dimensional space in which the n individuals (rows) are represented by points. The scatter plot is a special case when p = 2. Variable space is a n-dimensional space in which the p variables are represented by vectors. This convention of representing individuals by points and variables by vectors is also used in biplots in which both individuals and variables are represented in the same
Figure 1.1: Conceptual Views of Individual and Variable Spaces
space. Figure 1.1 conceptually illustrates individual and variable space.
1.4 An Overview
This book presents several classes of multivariate models. Generally, these mod- els are based on the multivariate normal distribution. However, extensions are developed when the normality assumption is not tenable or when alternative distributional assumptions are more appropriate. Numerical Summaries (Chapter 2) Arithmetic means, standard deviations, covariances, and correlation coeffi- cients are the primary summary statistics for continuous numerical variables. Robust versions of these statistics are increasingly being used. Counts and cu- mulative counts are the principal summary statistics for categorical and ordered categorical variables, respectively. Counts also naturally arise from Poisson or binomial distributions for discrete numerical variables. Graphical Techniques (Chapter 3) Exploratory plots are used to find patterns and structure in the data. Trends, relationships, outliers, and unusual behavior are often revealed, which would otherwise go unnoticed. Plots should be interactive and dynamic; have the ability to respond to new or changed information; be able to activate other displays or analyses. Multiple plots (views) of the same data table should be linkable to show relationships among several variables even if they are of different types. The most useful plot for viewing and interacting with multivariate data is the 2-dProj , which dynamically projects high-dimensional data into a two- dimensional plane using orthogonality constraints [Hurley and Buja, 1990]. Other graph objects, including univariate and bivariate plots, are discussed along with the tools for controlling them.
related to certain variables. Correspondence Analysis (Chapter 6) Count data, whether in the form of “abundances” in the original data table or in terms of frequency summaries for categorical variables, are not amenable to principal components, or related techniques, which assume underlying con- tinuous numerical scales. Furthermore, the number of variables may greatly exceed the number of individuals. Correspondence analysis provides a graphical representation of the individ- uals and variables in a type of biplot. Canonical correspondence analysis, like redundancy analysis, constrains the axes of the biplot to be linear combinations of the explanatory variables X 1 , X 2 , · · · , Xq. Partial canonical correspondence removes the linear effect of the nuisance variables Z 1 , Z 2 , · · · , Zr. Factor Analysis (Chapter 7) Factor analysis, like principal component analysis, aims to determine the ef- fective dimensionality of the outcome space. It differs in that a model is specified to explain the variation in the Y ’s in terms of q underlying, but unmeasurable, factor variables. The factor variables are indeterminate unless restrictions are places on their covariance structure. Different models are possible depending on the method of ‘extracting’ the factors. Discriminant Analysis (Chapter 8) The purpose of discriminant analysis is to differentiate among the g groups defined by the labels of a categorical variable Y using continuous numerical variables X 1 , X 2 , · · · , Xq. The discriminant variables are the variables in the X-space which maximally separate the groups defined by Y. Y may or may not represent a design variable. The results are most meaningful when the X’s are assumed to follow a multivariate normal distribution. Discriminant analysis is equivalent to a one-way multivariate analysis of variance. Classification uses the discriminant function to classify an individual into one of the g groups. Multivariate General Linear Models (Chapter 9) These models fit continuous numerical outcome variables Y 1 , Y 2 , · · · , Yp in terms of linear combinations of the explanatory variables X 1 , X 2 , · · · , Xq. Three cases, depending on the types of the X’s, are considered. Multivariate regres- sion, multivariate analysis of variance (MANOVA), and multivariate analysis of covariance (MANCOVA) result by considering the X’s to be continuous nu- merical, categorical, and mixture of continuous numerical and categorical, re- spectively. The X’s are considered to be design variables. The results are most meaningful when the Y ’s are assumed to be multivariate normal. A common type of problem concerns repeated measurements made on the same individual over time (or space). Provisions must be made for the correlations induced by the repeated measures. Exploratory Projection Pursuit (Chapter 10) Projection pursuit analyzes high-dimensional individual space, i.e., p > 3 us- ing projections into low-dimensional spaces. Meaningful projections are found by optimizing an objection function, often defined in terms of an index. Ob- jection functions can be defined in terms of the covariance matrix, but these functions simply find classical multivariate solutions such as principal compo-
nent or discriminant projections. The principal use of projection pursuit is to find nonlinearities in high-dimensional data which cannot be found by standard multivariate techniques. Cluster Analysis (Chapter 11) Distances can be computed between all possible pairs of individuals in a sample or similarities can be computed between pairs of variables. A collec- tion of techniques have been developed to analyze data summarized in terms of the resulting dissimilarity or similarity matrices. Hierarchical cluster analysis groups individuals or variables together in such a way that the groupings be- comes successively more diffuse as their size increases. The relationships among the individuals or variables are represented in a tree structure. Non-hierarchical forms of cluster analysis create a specified number of groupings (at least approx- imately) of individuals or variables in such a way that individuals within a group are more similar than individuals across groups. Multidimensional Scaling (Chapter 12) Multidimensional scaling attempts to find a representation of the individu- als or variables in a low-dimensional space such that their interpoint distances correspond to their dissimilarities or similarities. Both metric and nonmetric methods are available and extensions allow a type of biplot to be generated.
2.2 Measures of Scale and Association
The sample covariance between two variables, Yj and Yk, is a measure of their association and it is defined by:
cov(Yj , Yk) = sjk =
i(yij^ −^ y¯j^ )(yik^ −^ ¯yk) n − 1
The sample covariance between Yj and Yk estimates the corresponding pop- ulation covariance—denoted by[ σjk. The sample covariance matrix of Y ′^ = Y 1 Y 2 · · · Yp
is given by:
s^21 s 12 · · · s 1 p s 21 s^22 · · · s 2 p .. .
sp 1 sp 2 · · · s^2 p
sjk
where sjj = s^2 j and sjk = skj. The jth^ diagonal elements of S, s^2 j = var(Yj ), is called the sample variance of Yj. The positive square root of the sample variance of Yj , sj , is called the sample standard deviation. The standard deviation is a measures of spread (or dispersion) of the variable Yj. The direction of the relationship between Yj and Yk is determined by the sign of sjk, i.e.,
sjk > 0 ⇒ Yj and Yk are positively linearly related, sjk = 0 ⇒ Yj and Yk are not linearly related, sjk < 0 ⇒ Yj and Yk are negatively linearly related.
A shortcoming of the covariance is that it does not quantify the strength of the linear relationship between two variables. The covariance can be made arbitrary large in absolute value by changing the units of measurement. The sample correlation coefficient overcomes this objection. It is defined by:
corr(Yj , Yk) = rjk =
sjk sj sk
It can be shown that − 1 ≤ rjk ≤ 1. This quantity is also known as the Pearson product-moment correlation coefficient. The interpretation of rjk is similar to that of sjk except that rjk is bounded. A value of 0 implies no linear association. As rjk → 1 the strength of the positive linear relationship increases and as rjk → −1 the strength of the negative linear relationship increases. The sample correlation matrix of Y ′^ =
Y 1 Y 2 · · · Yp
is defined by:
1 r 12 · · · r 1 p r 21 1 · · · r 2 p .. .
rp 1 rp 2 · · · 1
rjk
where rjk = rkj. The correlation matrix is the covariance matrix of the stan- dardized variables, which are defined by:
Yj − y¯j sj
, j = 1, 2 , · · · , p
A standardized variable has a sample mean of 0 and sample standard deviation of 1. The correlation matrix is easily defined in terms of the sample covariance matrix. Let D 1 /sj be a diagonal matrix with the reciprocal of the standard deviations on the diagonal. Then:
R = D 1 /sj SD 1 /sj
defines the sample correlation matrix.
2.3 Derived Variables
A central feature of multivariate analysis for continuous numerical variables is the determination of variables V 1 , V 2 , · · · , Vs in Y -space that satisfy an optimal- ity criterion. These derived variables define a subspace V with dim(V ) ≤ s. Since Vj is a variable in Y -space, it is expressible as a linear combination of the basis variables, Y 1 , Y 2 ,... , Yp. Specifically,
Vj = aj 1 Y 1 + aj 2 Y 2 + · · · + ajpYp = a′ j Y
If the sample mean vector and covariance matrix of Y are known, it is easy to determine the sample mean and variance of Vj. They are given by:
¯vj = a′ j ¯y; s^2 vj = a′ j Saj.
Since the sample variance is always non-negative and aj is arbitrary, it is clear that S is a positive semi-definite matrix.
[ Likewise, the sample mean vector and sample covariance matrix of^ V^ ′^ = V 1 V 2 · · · Vs
is given by:
¯v = A′^ ¯y SV = A′SA,
where A is the p × s matrix of coefficients with the jth^ column given by aj. As a special case, the covariance between two derived variables is:
cov(Vj , Vk) = a′ j Sak.
y ¯∗^ =
w(di)yi ∑ w(di)
w^2 (di) (yi − ¯y) (yi − y¯)′ ∑ w^2 (di)
The weights will be 1 or near 1 for well-behaved data. On the other hand, outliers are detected as those observations with low weights. The above procedure is computer intensive. An alternative is to stop the iteration after a specified number of steps. One-step and two-step estimators are commonly used. The robustness properties of these latter estimators are generally good. These types of M-estimators have two undesirable properties. First, the breakdown of an estimator, denoted by , is the proportion of outliers it can handle. The Huber M-estimator has a breakdown of ≤ 1 /p. This generally is not a problem unless p is large relative to n. The second potential problem is that S∗^ may become singular at some stage of the iteration. This can only happen with unusual patterns of outliers and when p is large relative to n.
Graphs are powerful tools for revealing the structures and patterns, or the id- iosynchroncies, found in multivariate data. Some plots are designed to assess the underlying assumptions or the fit of a model, whereas others represent the structure of a multivariate dataset. This chapter examines both types of plots.
Most multivariate models assume an underlying normal distribution. This as- sumption will be assessed graphically in several ways. Although the objective of these techniques is to assess multivariate normality, the process will begin with univariate plots and will be extended naturally to the multivariate case.
Sample quantiles provide critical information about the distribution of the sam- ple values of a variable. The qth^ sample quantile, denoted by ˜yq , is a value along the measurement scale with a proportion q or less of the data less than ˜yq and a proportion 1 − q or less greater than ˜yq , i.e., for the random variable Y , ˜yq is any value satisfying:
#yi < ˜yq n
≤ q and #yi > y˜q n
≤ 1 − q.
The value of ˜yq may not be unique because a sample value may not satisfy the definition, e.g., the median when the number of observations is even. In this case, ˜yq is determined by interpolating linearly between adjacent ordered values in the quantile plot described below. Quantile plots consist of graphing the ˜yq on the y-axis versus q on the x-axis. Typically, equally spaced values of q between 0 and 1 are chosen. Unless n is large, the values of q are chosen to be in one-to-one correspondence with the sample values.
the two-sample problem. It is constructed by plotting the quantiles of one sam- ple against those of the other, using the qi corresponding to the smaller sample size. This approach can be used to assess the distributional assumptions under- lying a sample. In this case, the quantiles of the sample are plotted against the corresponding quantiles of a theoretical distribution. This is called a probability quantile-quantile plot, or a probability plot for short. The most common case is the normal probability plot. Denote the cumulative distribution function (cdf) of the random variable Y by F , where F is defined by:
F (y) = P (Y ≤ y) for −∞ < y < ∞.
The cumulative distribution function is nondecreasing with a range of [0, 1]. The qth^ theoretical quantile is any value yq satisfying:
F (yq ) = q.
If Y is discrete, yq may not be unique. In this case, choose yq as the middle value in the range of values satisfying the definition. Let yqi be the theoretical quantile corresponding to qi. If F is true, then:
yqi ≈ y(i) for i = 1, 2 ,... , n.
The probability plot is the graph of the pairs (yqi , y(i)). The points should fall approximately along a straight line with intercept 0 and slope 1 if F is true. Points systematically deviating from this line provides evidence that the data is not consistent with F. “Slight” deviations from linearity will occur even if F is true due to sampling variability, but no patterns or large deviations should occur. The development so far assumes that a plausible underlying F is completely specified. This is not likely to be the case. Generally F is a member of a parametric family. If F is determined by location and scale parameters, such as the normal, the construction of a probability plot is simplified. Assume that F (y) = G( y− θ 2 θ 1 ), i.e., F only depends on a location parameter
θ 1 and a scale parameter θ 2. Then the qth^ quantile of F is given by:
yqF = θ 1 + θ 2 yGq ,
where yqG is the qth^ quantile of G. This follows since:
q = P (Y ≤ yFq ) = F (yqF ) = G(
yqF − θ 1 θ 2
) = G(yGq ).
The cumulative distribution function G is called the standard or canonical dis- tribution of the family of distributions F indexed by θ 1 and θ 2. G has a location parameter of 0 and a scale parameter of 1.
A probability plot based on the quantiles of F can be constructed without estimating θ 1 or θ 2. If Y ∼ F , then the points (yqGi , y(i)) will fall approximately along a line with an intercept of θ 1 and a slope of θ 2. The adequacy of F is based on linearly and it is independent of the parameter values. If linearly holds, it is possible to estimate θ 1 and θ 2 by the intercept and slope of the least square regression fit, but this procedure generally is not optimal. The normal probability plot is used for assessing normality. In this case the location parameter is the mean μ and the scale parameter is the standard deviation σ. The canonical distribution function is called the standard normal which has a mean of 0 and a standard deviation of 1. The overall pattern of the points often falls into one of several classes, which define specific types of deviations from normality. In order to classify the pat- tern, it is convenient to fit a line through the points. It would seem reasonable to fit a linear least squares line, but this is inappropriate if normality does not hold. A better alternative is to draw a line through the lower and upper quartiles, i.e., (yG 0. 25 , y˜ 0. 25 ) and (yG 0. 75 , y˜ 0. 75 ), respectively. In order to provide a contrast to this robust (quartile) linear fit, a locally linear or quadratic regression (loess) curve can be fitted to the points. The loess curve reveals nonlinearities without the limitations of a parametric model. Five patterns are commonly observed. First, approximate linearity supports the assumed normal distribution. The curve is convex, or concave upwards, for positively skewed data, and concave downwards for negatively skewed data. If the data is positively skewed, the values in the upper tail of the distribution are spread out relative to the corresponding normal quantiles, whereas the values in the lower tail are less spread out than the corresponding normal quantiles. An opposite result holds for negatively skewed data. Suppose the points in the upper tail are above the line and those in the lower tail are below the line. This indicates that the values in both tails are more spread out than the corresponding normal quantiles. A distribution with this property is said to be thick-tailed or leptokurtic. Finally, consider the case in which the points in the upper tail are below the line and the points in the lower tail are above the line. This indicates that the values in the tails are too close relative to the corresponding normal quantiles. A distribution with this property is said to be short-tailed or platykurtic. The loess fit easily exposes these patterns if they are present. Outliers can easily be identified from the normal quantile plot by drawing appropriate horizontal reference lines. The sample interquartile range is defined by: iqr(Y ) = ˜y 0. 75 − ˜y 0. 25.
The reference lines should be drawn at ˜y 0. 25 − 1 .5(iqr), ˜y 0. 25 , ˜y 0. 5 , ˜y 0. 75 , and ˜y 0. 75 + 1.5(iqr). Values above the upper reference line or below the lower refer- ence line are classified as outliers relative to the normal distribution. Actually, it is only necessary to draw the outer reference lines, but the others are useful since they identify the quartiles. These reference lines essentially define the standard boxplot. It is sometimes useful to draw two more reference lines at ˜y 0. 25 − 3(iqr)