


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
It is important to remember that even a single outlier can dramatically alter the correlation coefficient. iv. If there is a non-linear relationship between the.
Typology: Schemes and Mind Maps
1 / 4
This page cannot be seen from the preview
Don't miss anything!



Department of Clinical Pharmacology, Seth GS Medical College & KEM Hospital, Mumbai, Maharashtra Received: 04.12.2016; Accepted: 10.12.
h e f i e l d o f m e d i c i n e o f t e n requires drawing inferences r e g a r d i n g t h e a s s o c i a t i o n o r relationship between two or more va r i a bles. In an earlier article o n “ M e a s u r e s o f A s s o c i a t i o n ” we i n t r o d u c e d t h e c o n c e p t o f finding associations [relationships] b e t w e e n t w o v a r i a b l e s t h a t were binary and categorical in n a t u r e. 1 T h e r e i n , we e x p l o r e d s e ve r a l p o s s i b l e r e l a t i o n s h i p s between these binary variables and understood metrics such as absolute risk, relative risk and odds ratio.
In the present article, we discuss how to establish a relationship or an association between two q u a n t i t a t i v e v a r i a b l e s , i. e. , variables that can be “measured”. 2 As an example, we could perhaps a s k t h e q u e s t i o n “ I s t h e r e a relationship between the number of hours of work put in by a sales representative and the actual sales o f a p r o d u c t? ” O r “ I s t h e r e a relationship between maternal age [measured in years] and parity [total number of pregnancies that a woman has carried past 20 weeks of pregnancy]? Correlation analysis helps answer questions such as these.
C o r r e l a t i o n , a l s o c a l l e d a s correlation analysis, is a term used to denote the association or relationship between two (or more) quantitative variables. This analysis is fundamentally based on the assumption of a straight –line
with the construction of a scatter plot or scatter diagram [a graphical representation of the data] with one variable on the X-axis and the other on the Y-axis. Let us understand this with an example. We had carried out a study 3 earlier that evaluated whether two modalities of the informed consent process – the written informed consent process, and the audio visual [AV] recording of this (in the same clinical trial) were different from each other in terms of the extent of understanding of the study by the participant using a pre-validated questionnaire. This questionnaire gave a “total score” [a quantitative measure] at the end of administration. One of the study objectives was to see if there was a relationship between the time (in minutes) taken to administer the consent in the two groups [again a quantitative measure] and the total score. Table 1 gives data on individual participants in both groups for time taken to consent [measured in minutes] and the total
[linear ] relationship between the quantitative variables. Similar to the measures of association for binary variables, it measures the “strength ” or the “extent” of an association between the variables and also its direction. The end result of a correlation analysis is a Correlation coefficient whose v alues range from -1 to +1. A correlation coefficient of + indicates that the two variables are perfectly related in a positive [ l i n e a r ] m a n n e r , a c o r r e l a t i o n coefficient of -1 indicates that two variables are perfectly related in a negative [linear ] manner, while a correlation coefficient of zero indicates that there is no linear relationship between the two variables being studied. These are depicted in Figures 1 and 2.
A correlation analysis begins
Fig. 1: Scatter Plot showing Correlation between two variables. Note: Fig. 1a shows a weak positive correlation, Fig. 1b shows no correlation and Fig. 1c shows a weak negative correlation
(1a) (1b) (1c)
Positive Correlation No correlation Negative
r = 0 r = -0.
Fig. 2: The spectrum of the correlation coefficient (-1 to +1)
Scatter plot 1: Written informed consent [Total score vs. time to administer consent]
Scatter plot 2: AV consent group [Total score vs. time to administer consent]
Table 1
Participant Number
Group 1 Written informed consent [WIC] Total score [n=17]
Time to administer WIC [minutes]
Group 2 AV consent Total Score [n=21]
Time to administer AV consent [minutes]
1 30 73 44 75 2 29 28 37 42 3 42 25 44 20 4 40 30 42 20 5 40 30 46 55 6 43 35 38 90 7 29 50 43 30 8 36 55 38 73 9 38 55 46 104 10 43 60 45 81 11 46 68 44 149 12 41 55 42 60 13 34 80 41 58 14 35 85 39 54 15 27 35 44 120 16 21 35 37 60 17 19 30 38 80 18 43 60 19 23 58 20 45 80 21 27 70
Correlation Coefficient Shows Strength & Direction of Correlation
Strong (^) Strong
-1.0 -0.5 0.0 +0.5 +1. Negative Correlation
Positive Correlation
Zero
Weak (^) Weak
Scatter Plot for Group 1 (Written Informed Consent) 90 80 70 60 50 40 30 20 10 0
y = 0.4639x + 32. R^2 = 0.
0 5 10 15 20 25 30 35 40 45 Time to consent (mins)
Scatter Plot for Group 2 (AV Consent) 160 140 120 100 80 60 40 20 0 0 5 10 15 20 25 30 35 40 45 50
Total score
Time to consent (mins)
y = 0.7852x + 36. R 2 = 0.
score obtained by the participant [presented as a number ].
The scatter plot or scatter diagram
of the total score on the Y axis with the time taken to administer consent on the X axis, enables us to
get a feel of the relationship (if any) between the two. Each point on the scatter plot represents the values of X and Y as a single coordinate. The closer the points are to a straight line, the stronger is the linear relationship between two variables. Two scatter plots, one for each group can be easily constructed using Microsoft Excel and those for our example are shown below. B o t h s c a t t e r p l o t s f r o m o u r s t u d y s h o w a we a k , p o s i t i ve , linear relationship between the total scores and the time taken to administer the consent. The advantage of the scatter plot is that it is simple to construct, is non-mathematical in nature and is unaffected by any extreme values that may be present in the data set. It also tells us immediately if there are outliers or if the relationship i s a c t u a l l y n o n - l i n e a r o r n o t entirely linear. A line is usually drawn through the points on a scatter plot to identify linearity in the relationship. This line is called the regression line or the least squares line , because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. This will be discussed in greater detail in the next article on regression analysis. The disadvantage of a scatter plot is that it does not give us one single value that will help us to understand whether or not there is a correlation between the variables
greater a country’s annual per capita chocolate consumption, more were the number of Nobel Laureates per 10 million population and thus established a “relationship” or “association” between chocolate consumption and getting a Nobel prize!
S e v e r a l f a c t o r s m u s t b e considered when a correlation analysis is planned. These include:
i. Correlation analysis should not be used when data is repeated measures of the same variable f r o m t h e s a m e i n d i v i d u a l at the same or varied time points. For example, if you have measured pain scores in patients with Rheumatoid arthritis at monthly intervals over 2 years in a study, it is inappropriate to find out a correlation coefficient for this data.
ii. It is useful to draw a scatter p l o t a s a n i m p o r t a n t p r e - requisite to any correlation analysis as it helps eyeball t h e d a t a f o r o u t l i e r s , n o n - l i n e a r r e l a t i o n s h i p s a n d heteroscedasticity
iii. An outlier is essentially an infrequently occurring value in the data set. It is important to remember that even a single outlier can dramatically alter the correlation coefficient.
iv. I f t h e r e i s a n o n - l i n e a r r e l a t i o n s h i p b e t w e e n t h e q u a n t i t a t i v e v a r i a b l e s , correlation analysis should not be performed. For example, during the growth phase in adolescence, there would a linear relationship between height and weight, as both i n c r e a s e. H o w e v e r , t h i s r e l a t i o n s h i p c e a s e s o n c e a person enters adulthood.
v. If the dataset has two distinct subgroups of individuals whose values for one or both variables differ considerably from each other, a false correlation may b e f o u n d , w h e n n o n e m a y exist. An example given by Aggarwal and Ranganathan 8 illustrates this point well. If you were to plot heights (on X-axis) and hemoglobin levels (on Y-axis), of a group of men (n=20) and women (n=20), most w o m e n m a y e n d u p i n t h e left lower corner (shorter and lower hemoglobin) and most men in the right upper corner (taller and higher hemoglobin). A n a l y s i s w o u l d s u g g e s t a relationship with a positive “r” value between height and hemoglobin levels! vi. The sample size should be a p p r o p r i a t e l y c a l c u l a t e d à p r i o r i. 9 S m a l l s a m p l e s i z e s m a y s h o w a f a l s e p o s i t i ve relationship. vii. If one data set forms part of the second data set, for example, height at age 12 (X - axis) and height at age 30 (Y-axis) we would expect to find a positive c o r r e l a t i o n b e t w e e n t h e m because the second quantity “contains” the first quantity. viii. Heteroscedasticity is a situation i n w h i c h o n e va r i a b l e h a s u n e q u a l va r i a b i l i t y a c r o s s t h e r a n g e o f va l u e s o f t h e second variable. For instance, if one were to plot time on the X-axis and the Sensex on the Y-axis, one would find a great variability in the Sensex as compared to the relative stability in time. Conclusion I n s u m m a r y , c o r r e l a t i o n coefficients are used to assess the strength and direction of the linear relationships between pairs of continuous variables. When both
variables are normally distributed w e u s e Pe a r s o n ’ s c o r r e l a t i o n coefficient “r”. Otherwise, we use Spearman’s correlation coefficient rho (ρ), which is non–parametric i n n a t u r e , a n d i s m o r e r o b u s t to outliers than is the Pearson’s correlation coefficient “r”. Correlation analysis is seldom u s e d a l o n e a n d i s u s u a l l y accompanied by the regression analysis. The difference between correlation and regression lies in the fact that while a correlation analysis stops with the calculation of the correlation coefficient and perhaps a test of significance, a regression analysis goes ahead to expresses the relationship in the form of an equation and moves into the realm of prediction. The next article in the series will deal with regression analysis.