Multivariate Data Analysis – Questions/Answers | Exams Nursing

Multivariate Data Analysis – Questions/Answers

Bootstrapping ✔ Ans - An approach to validating a multivariate model by

drawing a large number of sub- samples and estimating models for each

subsample. Estimates from all the subsamples are then com- bined, providing not

only the "best" estimated coefficients (e.g., means of each estimated coefficient

across all the subsample models), but their expected variability and thus their

likelihood of differing from zero; that is, are the estimated coefficients statistically

different from zero or not? This approach does not rely on statistical assumptions

about the population to assess statistical significance, but instead makes its

assessment based solely on the sample data.

Composite measure ✔ Ans - See summated scales.

Dependence technique ✔ Ans - Classification of statistical techniques

distinguished by having a variable or set of variables identified as the dependent

variable(s) and the remaining variables as independent. The objective is

prediction of the dependent variable(s) by the independent variable(s). An

example is regression analysis.

Dependent variable ✔ Ans - Presumed effect of, or response to, a change in the

independent variable(s). Dummy variable Nonmetrically measured variable

transformed into a metric variable by assign- ing a 1 or a 0 to a subject, depending

on whether it possesses a particular characteristic.

Effect size ✔ Ans - Estimate of the degree to which the phenomenon being

studied (e.g., correlation or difference in means) exists in the population.

Independent variable ✔ Ans - Presumed cause of any change in the dependent

variable.

Indicator ✔ Ans - Single variable used in conjunction with one or more other

variables to form a composite measure.

Interdependence technique ✔ Ans - Classification of statistical techniques in

which the variables are not divided into dependent and independent sets; rather,

all variables are analyzed as a single set (e.g., factor analysis).

Partial preview of the text

Download Multivariate Data Analysis – Questions/Answers and more Exams Nursing in PDF only on Docsity!

Multivariate Data Analysis – Questions/Answers

Bootstrapping ✔Ans - An approach to validating a multivariate model by drawing a large number of sub- samples and estimating models for each subsample. Estimates from all the subsamples are then com- bined, providing not only the "best" estimated coefficients (e.g., means of each estimated coefficient across all the subsample models), but their expected variability and thus their likelihood of differing from zero; that is, are the estimated coefficients statistically different from zero or not? This approach does not rely on statistical assumptions about the population to assess statistical significance, but instead makes its assessment based solely on the sample data. Composite measure ✔Ans - See summated scales. Dependence technique ✔Ans - Classification of statistical techniques distinguished by having a variable or set of variables identified as the dependent variable(s) and the remaining variables as independent. The objective is prediction of the dependent variable(s) by the independent variable(s). An example is regression analysis. Dependent variable ✔Ans - Presumed effect of, or response to, a change in the independent variable(s). Dummy variable Nonmetrically measured variable transformed into a metric variable by assign- ing a 1 or a 0 to a subject, depending on whether it possesses a particular characteristic. Effect size ✔Ans - Estimate of the degree to which the phenomenon being studied (e.g., correlation or difference in means) exists in the population. Independent variable ✔Ans - Presumed cause of any change in the dependent variable. Indicator ✔Ans - Single variable used in conjunction with one or more other variables to form a composite measure. Interdependence technique ✔Ans - Classification of statistical techniques in which the variables are not divided into dependent and independent sets; rather, all variables are analyzed as a single set (e.g., factor analysis).

Measurement error ✔Ans - Inaccuracies of measuring the "true" variable values due to the fallibility of the measurement instrument (i.e., inappropriate response scales), data entry errors, or respondent errors. Metric data ✔Ans - Also called quantitative data, interval data, or ratio data, these measurements iden- tify or describe subjects (or objects) not only on the possession of an attribute but also by the amount or degree to which the subject may be characterized by the attribute. For example, a person's age and weight are metric data. Multicollinearity ✔Ans - Extent to which a variable can be explained by the other variables in the analy- sis. As multicollinearity increases, it complicates the interpretation of the variate because it is more difficult to ascertain the effect of any single variable, owing to their interrelationships. Multivariate analysis ✔Ans - Analysis of multiple variables in a single relationship or set of relationships. Multivariate measurement ✔Ans - Use of two or more variables as indicators of a single composite measure. For example, a personality test may provide the answers to a series of individual ques- tions (indicators), which are then combined to form a single score (summated scale) representing the personality trait. Nonmetric data ✔Ans - Also called qualitative data, these are attributes, characteristics, or categorical properties that identify or describe a subject or object. They differ from metric data by indicating the presence of an attribute, but not the amount. Examples are occupation (physician, attorney, professor) or buyer status (buyer, nonbuyer). Also called nominal data or ordinal data. Power ✔Ans - Probability of correctly rejecting the null hypothesis when it is false; that is, correctly finding a hypothesized relationship when it exists. Determined as a function of (1) the statistical significance level set by the researcher for a Type I error ( ), (2) the sample size used in the analysis, and (3) the effect size being examined. Practical significance ✔Ans - Means of assessing multivariate analysis results based on their substantive findings rather than their statistical significance.

or nonrandom error. Validity is concerned with how well the concept is defined by the measure(s), whereas reliability relates to the consistency of the measure(s). Variate ✔Ans - Linear combination of variables formed in the multivariate technique by deriving empirical weights applied to a set of variables specified by the researcher. All-available approach ✔Ans - Imputation method for missing data that computes values based on all-available valid observations, also known as the pairwise approach. Boxplot ✔Ans - Method of representing the distribution of a variable. A box represents the major portion of the distribution, and the extensions-called whiskers-reach to the extreme points of the dis- tribution. This method is useful in making comparisons of one or more variables across groups. Censored data ✔Ans - Observations that are incomplete in a systematic and known way. One example occurs in the study of causes of death in a sample in which some individuals are still living. Censored data are an example of ignorable missing data. Comparison group ✔Ans - See reference category. Complete case approach ✔Ans - Approach for handling missing data that computes values based on data from complete cases, that is, cases with no missing data. Also known as the listwise approach. Data transformations ✔Ans - A variable may have an undesirable characteristic, such as nonnormality, that detracts from its use in a multivariate technique. A transformation, such as taking the logarithm or square root of the variable, creates a transformed variable that is more suited to portraying the relationship. Transformations may be applied to either the dependent or independent variables, or both. The need and specific type of transformation may be based on theoretical reasons (e.g., trans- forming a known nonlinear relationship) or empirical reasons (e.g., problems identified through graphical or statistical means).

Dummy variable ✔Ans - Special metric variable used to represent a single category of a nonmetric variable. To account for L levels of a nonmetric variable, L

1 dummy variables are needed. For example, gender is measured as male or female and could be represented by two dummy variables (X1 and X2). When the respondent is male, X1 = 1 and X2 = 0. Likewise, when the respondent is female, X = 0 and X2 = 1. However, when X1 = 1, we know that X2 must equal 0. Thus, we need only one variable, either X1 or X2, to represent the variable gender. If a nonmetric variable has three levels, only two dummy variables are needed. We always have one dummy variable less than the number of levels for the nonmetric variable. The omitted category is termed the reference category. Effects coding ✔Ans - Method for specifying the reference category for a set of dummy variables where the reference category receives a value of minus one (-1) across the set of dummy variables. With this type of coding, the dummy variable coefficients represent group deviations from the mean of all groups, which is in contrast to indicator coding. Heteroscedasticity ✔Ans - See homoscedasticity. Histogram ✔Ans - Graphical display of the distribution of a single variable. By forming frequency counts in categories, the shape of the variable's distribution can be shown. Used to make a visual comparison to the normal distribution. Homoscedasticity ✔Ans - When the variance of the error terms (e) appears constant over a range of predictor variables, the data are said to be homoscedastic. The assumption of equal variance of the population error E (where E is estimated from e) is critical to the proper application of many multivariate techniques. When the error terms have increasing or modulating variance, the data are said to be heteroscedastic. Analysis of residuals best illustrates this point. Ignorable missing data ✔Ans - Missing data process that is explicitly identifiable and/or is under the control of the researcher. Ignorable missing data do not require a remedy because the missing data are explicitly handled in the technique used. Imputation ✔Ans - Process of estimating the missing data of an observation based on valid values of the other variables. The objective is to employ known

Multivariate graphical display ✔Ans - Method of presenting a multivariate profile of an observation on three or more variables. The methods include approaches such as glyphs, mathematical transfor- mations, and even iconic representations (e.g., faces). Normal distribution ✔Ans - Purely theoretical continuous probability distribution in which the horizon- tal axis represents all possible values of a variable and the vertical axis represents the probability of those values occurring. The scores on the variable are clustered around the mean in a symmet- rical, unimodal pattern known as the bell-shaped, or normal, curve. Normal probability plot ✔Ans - Graphical comparison of the form of the distribution to the normal distri- bution. In the normal probability plot, the normal distribution is represented by a straight line angled at 45 degrees. The actual distribution is plotted against this line so that any differences are shown as devi- ations from the straight line, making identification of differences quite apparent and interpretable. Normality ✔Ans - Degree to which the distribution of the sample data corresponds to a normal distribution. Outlier An observation that is substantially different from the other observations (i.e., has an extreme value) on one or more characteristics (variables). At issue is its representativeness of the population. Reference category ✔Ans - The category of a nonmetric variable that is omitted when creating dummy variables and acts as a reference point in interpreting the dummy variables. In indicator coding, the reference category has values of zero (0) for all dummy variables. With effects coding, the ref- erence category has values of minus one (-1) for all dummy variables. Residual ✔Ans - Portion of a dependent variable not explained by a multivariate technique. Associated with dependence methods that attempt to predict the dependent variable, the residual represents the unexplained portion of the dependent variable. Residuals can be used in diagnostic procedures to identify problems in the estimation technique or to identify unspecified relationships. Robustness ✔Ans - The ability of a statistical technique to perform reasonably well even when the underlying statistical assumptions have been violated in some

manner. Scatterplot Representation of the relationship between two metric variables portraying the joint values of each observation in a two-dimensional graph. Skewness ✔Ans - Measure of the symmetry of a distribution; in most instances the comparison is made to a normal distribution. A positively skewed distribution has relatively few large values and tails off to the right, and a negatively skewed distribution has relatively few small values and tails off to the left. Skewness values falling outside the range of -1 to +1 indicate a substantially skewed distribution. Variate ✔Ans - Linear combination of variables formed in the multivariate technique by deriving empir- ical weights applied to a set of variables specified by the researcher. Anti-image correlation matrix ✔Ans - Matrix of the partial correlations among variables after factor analysis, representing the degree to which the factors explain each other in the results. The diag- onal contains the measures of sampling adequacy for each variable, and the off-diagonal values are partial correlations among variables. Bartlett test of sphericity ✔Ans - Statistical test for the overall significance of all correlations within a correlation matrix. Cluster analysis ✔Ans - Multivariate technique with the objective of grouping respondents or cases with similar profiles on a defined set of characteristics. Similar to Q factor analysis. Common factor analysis ✔Ans - Factor model in which the factors are based on a reduced correlation matrix. That is, communalities are inserted in the diagonal of the correlation matrix, and the extracted factors are based only on the common variance, with specific and error variance excluded. Common variance ✔Ans - Variance shared with other variables in the factor analysis. Communality ✔Ans - Total amount of variance an original variable shares with all other variables included in the analysis.

Factor indeterminacy ✔Ans - Characteristic of common factor analysis such that several different factor scores can be calculated for a respondent, each fitting the estimated factor model. It means the factor scores are not unique for each individual. Factor loadings ✔Ans - Correlation between the original variables and the factors, and the key to under- standing the nature of a particular factor. Squared factor loadings indicate what percentage of the variance in an original variable is explained by a factor. Factor matrix ✔Ans - Table displaying the factor loadings of all variables on each factor. Factor pattern matrix One of two factor matrices found in an oblique rotation that is most com- parable to the factor matrix in an orthogonal rotation. Factor rotation ✔Ans - Process of manipulation or adjusting the factor axes to achieve a simpler and pragmatically more meaningful factor solution. Factor score Composite measure created for each observation on each factor extracted in the factor analysis. The factor weights are used in conjunction with the original variable values to calculate each observation's score. The factor score then can be used to represent the factor(s) in subsequent analyses. Factor scores are standardized to have a mean of 0 and a standard deviation of 1. Factor structure matrix ✔Ans - A factor matrix found in an oblique rotation that represents the simple correlations between variables and factors, incorporating the unique variance and the correlations between factors. Most researchers prefer to use the factor pattern matrix when interpreting an oblique solution. Indicator Single variable used in conjunction with one or more other variables to form a composite measure. Latent root ✔Ans - See eigenvalue. Measure of sampling adequacy (MSA) ✔Ans - Measure calculated both for the entire correlation matrix and each individual variable evaluating the appropriateness of applying factor analysis. Values above .50 for either the entire matrix or an individual variable indicate appropriateness.

Measurement error ✔Ans - Inaccuracies in measuring the "true" variable values due to the fallibility of the measurement instrument (i.e., inappropriate response scales), data entry errors, or respondent errors. Multicollinearity ✔Ans - Extent to which a variable can be explained by the other variables in the analysis. Oblique factor rotation ✔Ans - Factor rotation computed so that the extracted factors are correlated. Rather than arbitrarily constraining the factor rotation to an orthogonal solution, the oblique rota- tion identifies the extent to which each of the factors is correlated. Orthogonal ✔Ans - Mathematical independence (no correlation) of factor axes to each other (i.e., at right angles, or 90 degrees). Orthogonal factor rotation ✔Ans - Factor rotation in which the factors are extracted so that their axes are maintained at 90 degrees. Each factor is independent of, or orthogonal to, all other factors. The correlation between the factors is determined to be 0. Q factor analysis ✔Ans - Forms groups of respondents or cases based on their similarity on a set of characteristics. QUARTIMAX ✔Ans - A type of orthogonal factor rotation method focusing on simplifying the columns of a factor matrix. Generally considered less effective than the VARIMAX rotation. R factor analysis ✔Ans - Analyzes relationships among variables to identify groups of variables forming latent dimensions (factors). Reliability ✔Ans - Extent to which a variable or set of variables is consistent in what it is intended to measure. If multiple measurements are taken, reliable measures will all be consistent in their val- ues. It differs from validity in that it does not relate to what should be measured, but instead to how it is measured. Reverse scoring ✔Ans - Process of reversing the scores of a variable, while retaining the distributional characteristics, to change the relationships (correlations) between two variables. Used in summated scale construction to

Although the addition of independent variables will always cause the coefficient of determination to rise, the adjusted coefficient of determination may fall if the added independent variables have little explanatory power or if the degrees of freedom become too small. This statistic is quite useful for comparison between equations with different numbers of independent variables, differing sample sizes, or both. All-possible-subsets regression ✔Ans - Method of selecting the variables for inclusion in the regression model that considers all possible combinations of the independent variables. For example, if the researcher specifies four potential independent variables, this technique would estimate all possi- ble regression models with one, two, three, and four variables. The technique would then identify the model(s) with the best predictive accuracy. Backward elimination ✔Ans - Method of selecting variables for inclusion in the regression model that starts by including all independent variables in the model and then eliminating those variables not making a significant contribution to prediction. Beta coefficient ✔Ans - Standardized regression coefficient (see standardization) that allows for a direct comparison between coefficients as to their relative explanatory power of the dependent variable. Whereas regression coefficients are expressed in terms of the units of the associated variable, thereby making comparisons inappropriate, beta coefficients use standardized data and can be directly compared. Coefficient of determination (R 2 ) ✔Ans - Measure of the proportion of the variance of the depend- ent variable about its mean that is explained by the independent, or predictor, variables. The coefficient can vary between 0 and 1. If the regression model is properly applied and estimated, the researcher can assume that the higher the value of R 2 , the greater the explana- tory power of the regression equation, and therefore the better the prediction of the dependent variable. Collinearity ✔Ans - Expression of the relationship between two (collinearity) or more (multicollinearity) independent variables. Two independent variables are said to exhibit complete collinearity if their correlation coefficient is 1, and complete lack of collinearity if their correlation coefficient is 0. Multicollinearity occurs when any single independent variable is highly correlated with a set of

other independent variables. An extreme case of collinearity/multicollinearity is singularity, in which an independent variable is perfectly predicted (i.e., correlation of 1.0) by another independent variable (or more than one). Correlation coefficient (r) ✔Ans - Coefficient that indicates the strength of the association between any two metric variables. The sign (+ or -) indicates the direction of the relationship. The value can range from +1 to -1, with +1 indicating a perfect positive relationship, 0 indicating no relationship, and -1 indicating a perfect negative or reverse relationship (as one variable grows larger, the other variable grows smaller). Criterion variable (Y) See dependent variable. Degrees of freedom (df) ✔Ans - Value calculated from the total number of observations minus the number of estimated parameters. These parameter estimates are restrictions on the data because, once made, they define the population from which the data are assumed to have been drawn. For example, in estimating a regression model with a single independent variable, we estimate two parameters, the intercept (b0) and a regression coefficient for the independent variable (b1). In estimating the random error, defined as the sum of the prediction errors (actual minus predicted dependent values) for all cases, we would find (n - 2) degrees of freedom. Degrees of freedom provide a measure of how restricted the data are to reach a certain level of prediction. If the number of degrees of freedom is small, the resulting prediction may be less generalizable because all but a few observations were incorporated in the prediction. Conversely, a large degrees-of-freedom value indicates the prediction is fairly robust with regard to being representative of the overall sample of respondents. Dependent variable (Y) ✔Ans - Variable being predicted or explained by the set of independent variables. Dummy variable ✔Ans - Independent variable used to account for the effect that different levels of a nonmetric variable have in predicting the dependent variable. To account for L levels of a non- metric independent variable, L - 1 dummy variables are needed. For example, gender is measured as male or female and could be represented by two dummy variables, X1 and X2. When the respon- dent is male, X1 = 1 and X2 = 0. Likewise, when the respondent is female, X1 = 0 and X2 = 1. However, when X1 = 1, we know that X2 must equal 0. Thus, we need only one variable, either X1 or X2, to represent gender. We need not include both variables because one is perfectly predicted by the other (a singularity) and the regression coefficients cannot be estimated. If a variable has three levels, only two

group deviations on the dependent variable from the overall mean of the dependent variable. Influential observation ✔Ans - An observation that has a disproportionate influence on one or more aspects of the regression estimates. This influence may be based on extreme values of the independent or dependent variables, or both. Influential observations can either be "good," by reinforcing the pattern of the remaining data, or "bad," when a single or small set of cases unduly affects the regression estimates. It is not necessary for the observation to be an outlier, although many times outliers can be classified as influential observations as well. Intercept (b0) ✔Ans - Value on the Y axis (dependent variable axis) where the line defined by the regression equation Y = b0 + b1X1 crosses the axis. It is described by the constant term b0 in the regression equation. In addition to its role in prediction, the intercept may have a managerial interpretation. If the complete absence of the independent variable has meaning, then the intercept represents that amount. For example, when estimating sales from past advertising expenditures, the intercept repre- sents the level of sales expected if advertising is eliminated. But in many instances the constant has only predictive value because in no situation are all independent variables absent. An example is predicting product preference based on consumer attitudes. All individuals have some level of attitude, so the intercept has no managerial use, but it still aids in prediction. Least squares ✔Ans - Estimation procedure used in simple and multiple regression whereby the regression coefficients are estimated so as to minimize the total sum of the squared residuals. Leverage points ✔Ans - Type of influential observation defined by one aspect of influence termed leverage. These observations are substantially different on one or more independent variables, so that they affect the estimation of one or more regression coefficients. Linearity ✔Ans - Term used to express the concept that the model possesses the properties of additivity and homogeneity. In a simple sense, linear models predict values that fall in a straight line by having a constant unit change (slope) of the dependent variable for a constant unit change of the independent variable. In the population model Y = b0 + b1X1 + , the effect of changing X1 by a value of 1.0 is to add b1 (a constant) units of Y.

Measurement error ✔Ans - Degree to which the data values do not truly measure the characteristic being represented by the variable. For example, when asking about total family income, many sources of measurement error (e.g., reluctance to answer full amount, error in estimating total income) make the data values imprecise. Moderator effect ✔Ans - Effect in which a third independent variable (the moderator variable) causes the relationship between a dependent/independent variable pair to change, depending on the value of the moderator variable. It is also known as an interactive effect and is similar to the interaction effect seen in analysis of variance methods. Multicollinearity See collinearity. Multiple regression ✔Ans - Regression model with two or more independent variables. Normal probability plot ✔Ans - Graphical comparison of the shape of the sample distribution to the normal distribution. In the graph, the normal distribution is represented by a straight line angled at 45 degrees. The actual distribution is plotted against this line, so any differences are shown as deviations from the straight line, making identification of differences quite simple. Null plot ✔Ans - Plot of residuals versus the predicted values that exhibits a random pattern. A null plot is indicative of no identifiable violations of the assumptions underlying regression analysis. Outlier ✔Ans - In strict terms, an observation that has a substantial difference between the actual value for the dependent variable and the predicted value. Cases that are substantially different with regard to either the dependent or independent variables are often termed outliers as well. In all instances, the objective is to identify observations that are inappropriate representations of the population from which the sample is drawn, so that they may be discounted or even eliminated from the analysis as unrepresentative. Parameter ✔Ans - Quantity (measure) characteristic of the population. For example, μ andare the symbols used for the population parameters mean (μ) and variance ( ). They are typically estimated from sample data in which the

tion coefficient). This portrayal is particularly helpful in assessing the form of the relationship (linear versus nonlinear) and the identification of influential observations Polynomial ✔Ans - Transformation of an independent variable to represent a curvilinear relationship with the dependent variable. By including a squared term (X 2 ), a single inflection point is esti- mated. A cubic term estimates a second inflection point. Additional terms of a higher power can also be estimated. Power ✔Ans - Probability that a significant relationship will be found if it actually exists. Complements the more widely used significance level alpha ( ). Prediction error ✔Ans - Difference between the actual and predicted values of the dependent variable for each observation in the sample (see residual). Predictor variable (Xn) ✔Ans - See independent variable. PRESS statistic Validation measure obtained by eliminating each observation one at a time and predicting this dependent value with the regression model estimated from the remaining observations. Reference category ✔Ans - The omitted level of a nonmetric variable when a dummy variable is formed from the nonmetric variable. Regression coefficient (bn) ✔Ans - Numerical value of the parameter estimate directly associated with an independent variable; for example, in the model Y = b0 + b1X1 the value b1 is the regression coefficient for the variable X1. The regression coefficient represents the amount of change in the dependent variable for a one-unit change in the independent variable. In the multiple predictor model (e.g., Y = b0 + b1X1 + b2X2), the regression coefficients are partial coefficients because each takes into account not only the relationships between Y and X1 and between Y and X2, but also between X1 and X2. The coefficient is not limited in range, because it is based on both the degree of association and the scale units of the independent variable. For instance, two variables with the same association to Y would have different coefficients if one independent variable was measured on a 7-point scale and another was based on a 100-point scale. Regression variate ✔Ans - Linear combination of weighted independent variables used collectively to predict the dependent variable.

Residual (e or ?) ✔Ans - Error in predicting our sample data. Seldom will our predictions be perfect. We assume that random error will occur, but we assume that this error is an estimate of the true random error in the population (?), not just the error in prediction for our sample (e). We assume that the error in the population we are estimating is distributed with a mean of 0 and a constant (homoscedastic) variance. Sampling error ✔Ans - The expected variation in any estimated parameter (intercept or regression coef- ficient) that is due to the use of a sample rather than the population. Sampling error is reduced as the sample size is increased and is used to statistically test whether the estimated parameter differs from zero. Significance level (alpha) ✔Ans - Commonly referred to as the level of statistical significance, the significance level represents the probability the researcher is willing to accept that the estimated coefficient is classified as different from zero when it actually is not. This is also known as Type I error. The most widely used level of significance is .05, although researchers use levels ranging from .01 (more demanding) to .10 (less conservative and easier to find significance). Simple regression ✔Ans - Regression model with a single independent variable, also known as bivariate regression. Singularity ✔Ans - The extreme case of collinearity or multicollinearity in which an independent vari- able is perfectly predicted (a correlation of ;1.0) by one or more independent variables. Regression models cannot be estimated when a singularity exists. The researcher must omit one or more of the independent variables involved to remove the singularity. Specification error ✔Ans - Error in predicting the dependent variable caused by excluding one or more relevant independent variables. This omission can bias the estimated coefficients of the included variables as well as decrease the overall predictive power of the regression model. Standard error ✔Ans - Expected distribution of an estimated regression coefficient. The standard error is similar to the standard deviation of any set of data values, but instead denotes the expected range of the coefficient across

Multivariate Data Analysis – Questions/Answers, Exams of Nursing

Related documents

Partial preview of the text

Download Multivariate Data Analysis – Questions/Answers and more Exams Nursing in PDF only on Docsity!

Multivariate Data Analysis – Questions/Answers