Principles of Correlation Analysis, Schemes and Mind Maps of Construction

It is important to remember that even a single outlier can dramatically alter the correlation coefficient. iv. If there is a non-linear relationship between the.

Typology: Schemes and Mind Maps

2022/2023

Uploaded on 02/28/2023

newfound
newfound 🇨🇦

4.5

(13)

362 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Journal of The Association of Physicians of India Vol. 65 March 2017
78
Principles of Correlation Analysis
NJ Gogtay, UM Thatte
Department of Clinical Pharmacology, Seth GS Medical College & KEM Hospital, Mumbai, Maharashtra
Received: 04.12.2016; Accepted: 10.12.2016
StatiSticS for reSearcherS
Introduction
The field of medicine often
requires drawing inferences
regarding the association or
relationship between two or more
variables. In an earlier article
on “Measures of Association”
we introduced the concept of
finding associations [relationships]
between two variables that
were binary and categorical in
nature.1 Therein, we explored
several possible relationships
between these binary variables
and understood metrics such as
absolute risk, relative risk and
odds ratio.
In the present article, we discuss
how to establish a relationship
or an association between two
quantitative variables, i.e.,
variables that can be “measured”.2
As an example, we could perhaps
ask the question “Is there a
relationship between the number
of hours of work put in by a sales
representative and the actual sales
of a product?” Or “Is there a
relationship between maternal age
[measured in years] and parity
[total number of pregnancies that
a woman has carried past 20 weeks
of pregnancy]? Correlation analysis
helps answer questions such as
these.
Definition of Correlation,
its Assumptions and the
Correlation Coefficient
Correlation, also called as
correlation analysis, is a term
used to denote the association
or relationship between two (or
more) quantitative variables. This
analysis is fundamentally based on
the assumption of a straight –line
with the construction of a scatter
plot or scatter diagram [a graphical
representation of the data] with one
variable on the X-axis and the other
on the Y-axis. Let us understand
this with an example.
We had carried out a study3
earlier that evaluated whether two
modalities of the informed consent
process – the written informed
consent process, and the audio
visual [AV] recording of this (in the
same clinical trial) were different
from each other in terms of the
extent of understanding of the
study by the participant using a
pre-validated questionnaire. This
questionnaire gave a “total score”
[a quantitative measure] at the end
of administration. One of the study
objectives was to see if there was a
relationship between the time (in
minutes) taken to administer the
consent in the two groups [again
a quantitative measure] and the
total score. Table 1 gives data on
individual participants in both
groups for time taken to consent
[measured in minutes] and the total
[linear] relationship between the
quantitative variables. Similar to
the measures of association for
binary variables, it measures the
“strength or the “extent” of an
association between the variables
and also its direction.
The end result of a correlation
analysis is a Correlation coefficient
whose values range from -1 to
+1. A correlation coefficient of +1
indicates that the two variables
are perfectly related in a positive
[linear] manner, a correlation
coefficient of -1 indicates that two
variables are perfectly related
in a negative [linear] manner,
while a correlation coefficient
of zero indicates that there is no
linear relationship between the two
variables being studied. These are
depicted in Figures 1 and 2.
Eyeballing and Analyzing
the Data for Correlation -
Construction of the Scatter
Plot/Scatter Diagram
A correlation analysis begins
Fig. 1: Scatter Plot showing Correlation between two variables. Note: Fig. 1a
shows a weak positive correlation, Fig. 1b shows no correlation and Fig.
1c shows a weak negative correlation
(1a) (1b) (1c)
r = 0.4
Positive Correlation No correlation Negative
r = -0.4
r = 0
pf3
pf4

Partial preview of the text

Download Principles of Correlation Analysis and more Schemes and Mind Maps Construction in PDF only on Docsity!

Principles of Correlation Analysis

NJ Gogtay, UM Thatte

Department of Clinical Pharmacology, Seth GS Medical College & KEM Hospital, Mumbai, Maharashtra Received: 04.12.2016; Accepted: 10.12.

S t a t i S t i c S f o r r e S e a r c h e r S

Introduction

T

h e f i e l d o f m e d i c i n e o f t e n requires drawing inferences r e g a r d i n g t h e a s s o c i a t i o n o r relationship between two or more va r i a bles. In an earlier article o n “ M e a s u r e s o f A s s o c i a t i o n ” we i n t r o d u c e d t h e c o n c e p t o f finding associations [relationships] b e t w e e n t w o v a r i a b l e s t h a t were binary and categorical in n a t u r e. 1 T h e r e i n , we e x p l o r e d s e ve r a l p o s s i b l e r e l a t i o n s h i p s between these binary variables and understood metrics such as absolute risk, relative risk and odds ratio.

In the present article, we discuss how to establish a relationship or an association between two q u a n t i t a t i v e v a r i a b l e s , i. e. , variables that can be “measured”. 2 As an example, we could perhaps a s k t h e q u e s t i o n “ I s t h e r e a relationship between the number of hours of work put in by a sales representative and the actual sales o f a p r o d u c t? ” O r “ I s t h e r e a relationship between maternal age [measured in years] and parity [total number of pregnancies that a woman has carried past 20 weeks of pregnancy]? Correlation analysis helps answer questions such as these.

Definition of Correlation,

its Assumptions and the

Correlation Coefficient

C o r r e l a t i o n , a l s o c a l l e d a s correlation analysis, is a term used to denote the association or relationship between two (or more) quantitative variables. This analysis is fundamentally based on the assumption of a straight –line

with the construction of a scatter plot or scatter diagram [a graphical representation of the data] with one variable on the X-axis and the other on the Y-axis. Let us understand this with an example. We had carried out a study 3 earlier that evaluated whether two modalities of the informed consent process – the written informed consent process, and the audio visual [AV] recording of this (in the same clinical trial) were different from each other in terms of the extent of understanding of the study by the participant using a pre-validated questionnaire. This questionnaire gave a “total score” [a quantitative measure] at the end of administration. One of the study objectives was to see if there was a relationship between the time (in minutes) taken to administer the consent in the two groups [again a quantitative measure] and the total score. Table 1 gives data on individual participants in both groups for time taken to consent [measured in minutes] and the total

[linear ] relationship between the quantitative variables. Similar to the measures of association for binary variables, it measures the “strength or the “extent” of an association between the variables and also its direction. The end result of a correlation analysis is a Correlation coefficient whose v alues range from -1 to +1. A correlation coefficient of + indicates that the two variables are perfectly related in a positive [ l i n e a r ] m a n n e r , a c o r r e l a t i o n coefficient of -1 indicates that two variables are perfectly related in a negative [linear ] manner, while a correlation coefficient of zero indicates that there is no linear relationship between the two variables being studied. These are depicted in Figures 1 and 2.

Eyeballing and Analyzing

the Data for Correlation -

Construction of the Scatter

Plot/Scatter Diagram

A correlation analysis begins

Fig. 1: Scatter Plot showing Correlation between two variables. Note: Fig. 1a shows a weak positive correlation, Fig. 1b shows no correlation and Fig. 1c shows a weak negative correlation

(1a) (1b) (1c)

r = 0.

Positive Correlation No correlation Negative

r = 0 r = -0.

Fig. 2: The spectrum of the correlation coefficient (-1 to +1)

Scatter plot 1: Written informed consent [Total score vs. time to administer consent]

Scatter plot 2: AV consent group [Total score vs. time to administer consent]

Table 1

Participant Number

Group 1 Written informed consent [WIC] Total score [n=17]

Time to administer WIC [minutes]

Group 2 AV consent Total Score [n=21]

Time to administer AV consent [minutes]

1 30 73 44 75 2 29 28 37 42 3 42 25 44 20 4 40 30 42 20 5 40 30 46 55 6 43 35 38 90 7 29 50 43 30 8 36 55 38 73 9 38 55 46 104 10 43 60 45 81 11 46 68 44 149 12 41 55 42 60 13 34 80 41 58 14 35 85 39 54 15 27 35 44 120 16 21 35 37 60 17 19 30 38 80 18 43 60 19 23 58 20 45 80 21 27 70

Correlation Coefficient Shows Strength & Direction of Correlation

Strong (^) Strong

-1.0 -0.5 0.0 +0.5 +1. Negative Correlation

Positive Correlation

Zero

Weak (^) Weak

Scatter Plot for Group 1 (Written Informed Consent) 90 80 70 60 50 40 30 20 10 0

y = 0.4639x + 32. R^2 = 0.

0 5 10 15 20 25 30 35 40 45 Time to consent (mins)

Scatter Plot for Group 2 (AV Consent) 160 140 120 100 80 60 40 20 0 0 5 10 15 20 25 30 35 40 45 50

Total score

Time to consent (mins)

y = 0.7852x + 36. R 2 = 0.

score obtained by the participant [presented as a number ].

The scatter plot or scatter diagram

of the total score on the Y axis with the time taken to administer consent on the X axis, enables us to

get a feel of the relationship (if any) between the two. Each point on the scatter plot represents the values of X and Y as a single coordinate. The closer the points are to a straight line, the stronger is the linear relationship between two variables. Two scatter plots, one for each group can be easily constructed using Microsoft Excel and those for our example are shown below. B o t h s c a t t e r p l o t s f r o m o u r s t u d y s h o w a we a k , p o s i t i ve , linear relationship between the total scores and the time taken to administer the consent. The advantage of the scatter plot is that it is simple to construct, is non-mathematical in nature and is unaffected by any extreme values that may be present in the data set. It also tells us immediately if there are outliers or if the relationship i s a c t u a l l y n o n - l i n e a r o r n o t entirely linear. A line is usually drawn through the points on a scatter plot to identify linearity in the relationship. This line is called the regression line or the least squares line , because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. This will be discussed in greater detail in the next article on regression analysis. The disadvantage of a scatter plot is that it does not give us one single value that will help us to understand whether or not there is a correlation between the variables

greater a country’s annual per capita chocolate consumption, more were the number of Nobel Laureates per 10 million population and thus established a “relationship” or “association” between chocolate consumption and getting a Nobel prize!

Factors that Affect a

Correlation Analysis

S e v e r a l f a c t o r s m u s t b e considered when a correlation analysis is planned. These include:

i. Correlation analysis should not be used when data is repeated measures of the same variable f r o m t h e s a m e i n d i v i d u a l at the same or varied time points. For example, if you have measured pain scores in patients with Rheumatoid arthritis at monthly intervals over 2 years in a study, it is inappropriate to find out a correlation coefficient for this data.

ii. It is useful to draw a scatter p l o t a s a n i m p o r t a n t p r e - requisite to any correlation analysis as it helps eyeball t h e d a t a f o r o u t l i e r s , n o n - l i n e a r r e l a t i o n s h i p s a n d heteroscedasticity

iii. An outlier is essentially an infrequently occurring value in the data set. It is important to remember that even a single outlier can dramatically alter the correlation coefficient.

iv. I f t h e r e i s a n o n - l i n e a r r e l a t i o n s h i p b e t w e e n t h e q u a n t i t a t i v e v a r i a b l e s , correlation analysis should not be performed. For example, during the growth phase in adolescence, there would a linear relationship between height and weight, as both i n c r e a s e. H o w e v e r , t h i s r e l a t i o n s h i p c e a s e s o n c e a person enters adulthood.

v. If the dataset has two distinct subgroups of individuals whose values for one or both variables differ considerably from each other, a false correlation may b e f o u n d , w h e n n o n e m a y exist. An example given by Aggarwal and Ranganathan 8 illustrates this point well. If you were to plot heights (on X-axis) and hemoglobin levels (on Y-axis), of a group of men (n=20) and women (n=20), most w o m e n m a y e n d u p i n t h e left lower corner (shorter and lower hemoglobin) and most men in the right upper corner (taller and higher hemoglobin). A n a l y s i s w o u l d s u g g e s t a relationship with a positive “r” value between height and hemoglobin levels! vi. The sample size should be a p p r o p r i a t e l y c a l c u l a t e d à p r i o r i. 9 S m a l l s a m p l e s i z e s m a y s h o w a f a l s e p o s i t i ve relationship. vii. If one data set forms part of the second data set, for example, height at age 12 (X - axis) and height at age 30 (Y-axis) we would expect to find a positive c o r r e l a t i o n b e t w e e n t h e m because the second quantity “contains” the first quantity. viii. Heteroscedasticity is a situation i n w h i c h o n e va r i a b l e h a s u n e q u a l va r i a b i l i t y a c r o s s t h e r a n g e o f va l u e s o f t h e second variable. For instance, if one were to plot time on the X-axis and the Sensex on the Y-axis, one would find a great variability in the Sensex as compared to the relative stability in time. Conclusion I n s u m m a r y , c o r r e l a t i o n coefficients are used to assess the strength and direction of the linear relationships between pairs of continuous variables. When both

variables are normally distributed w e u s e Pe a r s o n ’ s c o r r e l a t i o n coefficient “r”. Otherwise, we use Spearman’s correlation coefficient rho (ρ), which is non–parametric i n n a t u r e , a n d i s m o r e r o b u s t to outliers than is the Pearson’s correlation coefficient “r”. Correlation analysis is seldom u s e d a l o n e a n d i s u s u a l l y accompanied by the regression analysis. The difference between correlation and regression lies in the fact that while a correlation analysis stops with the calculation of the correlation coefficient and perhaps a test of significance, a regression analysis goes ahead to expresses the relationship in the form of an equation and moves into the realm of prediction. The next article in the series will deal with regression analysis.

References

  1. Gogtay NJ, Deshpande S, Thatte UM. Measures of Association. J Assoc Phy Ind 2016; [in press]
  2. Deshpande S, Gogtay NJ, Thatte UM. Data types. J Assoc Phy Ind 2016; 64:64-65.
  3. Figer BH, Chaturvedi M, Thaker SJ, Gogtay NJ, Thatte UM. A comparative study of the informed consent process with or without audio-visual recording. Nat Med J Ind 2017; in press.
  4. Deshpande S, Gogtay NJ, Thatte UM. Data types. J Assoc Phy Ind 2016; 64:64-5.
  5. Gogtay NJ, Deshpande S, Thatte UM. Normal distributions, p values and confidence intervals. J Assoc Phy Ind 2016; 64:74-6.
  6. Deshpande S, Gogtay NJ, Thatte UM. Which test where? J Assoc Phy Ind 2016; 64:64-66.
  7. Messerli FH. Chocolate consumption, cognitive function and Nobel Laureates. N Engl J Med 2012; 367:1562-4.
  8. Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: The use of correlation techniques. Perspect Clin Res 2016; 7:187-90.
  9. Gogtay NJ, Thatte UM. Samples and their sizes- the bane of researchers. J Assoc Phy Ind 2016; 64:68-71.