Regression and Correlation Analysis in Biostatistics: A Practical Guide with Examples, Exercises of Biostatistics

Correlation is a measure of the strength of a relationship between variables. The variables are data which are measured and/or counted in an experiment.

Typology: Exercises

2022/2023

Uploaded on 03/01/2023

obesix
obesix 🇺🇸

4.3

(19)

237 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Regression and Correlation
This file is part of a program based on the Bio 4835 Biostatistics class taught at Kean University in Union, New Jersey. The
course uses the following text:
Daniel, W. W. 1999. Biostatistics: a foundation for analysis in the health sciences. New York: John Wiley and Sons.
The file follows this text very closely and readers are encouraged to consult the text for further information.
REGRESSION AND CORRELATION
Introduction
Regression and correlation analysis procedures are used to study the relationships between variables.
Regression is used to predict the value of one variable based on the value of a different variable. Correlation
is a measure of the strength of a relationship between variables. The variables are data which are measured
and/or counted in an experiment. In the case of the examples used here, the data were obtained by counting
the breathing rate of goldfish in a laboratory experiment.
Nature of data
The data for regression and correlation consist of pairs in the form (x,y). The independent variable (x) is
determined by the experimenter. This means that the experimenter has control over the variable during the
experiment. In our experiment, the temperature was controlled during the experiment. The dependent
variable (y) is the effect that is observed during the experiment. It is assumed that the values obtained for
the dependent variable result from the changes in the independent variable. Regression and correlation
analyses will determine the nature of this relationship, if any, and the strength of the relationship. It can be a
consideration that all of the (x,y) pairs form a population. In some experiments, numerous observations of y
are taken at each value of x. In these cases, each set of values of y taken at a particular value of x form a
subpopulation of the data.
Graphical representation
Data are represented using a plot called a scatter plot or scatter diagram or x-y plot. During analysis
we try to find the equation of a line that fits the data. This is called the regression line. From algebra, we
recall that points which are (x,y) pairs can be plotted on the Cartesian coordinate system. We also recall that
a straight line on the Cartesian coordinate system has the equation y = mx + b, where m is the slope of the
line, and b is the y-intercept of the line. The slope is always the coefficient of the x term in the equation.
Following this pattern, the slope of the regression line can be given using various forms of equations, for
example, y = ax + b, y = a + bx, etc. By looking at the equation we can determine the slope and the y-
intercept.
Scatter diagrams can show a direct relationship between x and y. Let us have an equation, y = a + bx
where b is the slope. The direct relationship exists when the slope of the line (b) is positive. An inverse
relationship exists when the slope of the line is negative. When the slope b=0, then there is no relationship.
The nature of the relationship is discussed as part of correlation.
The regression line
As noted above, a straight line plotted on the Cartesian coordinate system can have the equation y =
mx + b. We remember that m is the slope and b is the y-intercept.
A regression line will have a general form
y = + x +
pf3
pf4
pf5
pf8

Partial preview of the text

Download Regression and Correlation Analysis in Biostatistics: A Practical Guide with Examples and more Exercises Biostatistics in PDF only on Docsity!

Regression and Correlation

This file is part of a program based on the Bio 4835 Biostatistics class taught at Kean University in Union, New Jersey. The course uses the following text: Daniel, W. W. 1999. Biostatistics: a foundation for analysis in the health sciences. New York: John Wiley and Sons. The file follows this text very closely and readers are encouraged to consult the text for further information.

REGRESSION AND CORRELATION

Introduction

Regression and correlation analysis procedures are used to study the relationships between variables. Regression is used to predict the value of one variable based on the value of a different variable. Correlation is a measure of the strength of a relationship between variables. The variables are data which are measured and/or counted in an experiment. In the case of the examples used here, the data were obtained by counting the breathing rate of goldfish in a laboratory experiment.

Nature of data

The data for regression and correlation consist of pairs in the form (x,y). The independent variable (x) is determined by the experimenter. This means that the experimenter has control over the variable during the experiment. In our experiment, the temperature was controlled during the experiment. The dependent variable (y) is the effect that is observed during the experiment. It is assumed that the values obtained for the dependent variable result from the changes in the independent variable. Regression and correlation analyses will determine the nature of this relationship, if any, and the strength of the relationship. It can be a consideration that all of the (x,y) pairs form a population. In some experiments, numerous observations of y are taken at each value of x. In these cases, each set of values of y taken at a particular value of x form a subpopulation of the data.

Graphical representation

Data are represented using a plot called a scatter plot or scatter diagram or x-y plot. During analysis we try to find the equation of a line that fits the data. This is called the regression line. From algebra, we recall that points which are (x,y) pairs can be plotted on the Cartesian coordinate system. We also recall that a straight line on the Cartesian coordinate system has the equation y = mx + b , where m is the slope of the line, and b is the y-intercept of the line. The slope is always the coefficient of the x term in the equation. Following this pattern, the slope of the regression line can be given using various forms of equations, for example, y = ax + b, y = a + bx, etc. By looking at the equation we can determine the slope and the y- intercept.

Scatter diagrams can show a direct relationship between x and y. Let us have an equation, y = a + bx where b is the slope. The direct relationship exists when the slope of the line (b) is positive. An inverse relationship exists when the slope of the line is negative. When the slope b=0, then there is no relationship. The nature of the relationship is discussed as part of correlation.

The regression line

As noted above, a straight line plotted on the Cartesian coordinate system can have the equation y = mx + b. We remember that m is the slope and b is the y-intercept.

A regression line will have a general form

y =  + x + 

where:

 is the y-intercept

 is the slope of the line

 is an error term

In practice, under ordinary circumstances, we do not know the value of the error term so we use the following form of the equation

y = a + bx

although alternative forms (such as y + ax + b) will also yield the same results.

From the study of correlation we learn that when the slope of the regression line is positive (meaning that the value of b is positive) the value of y increases as the value of x increases. This is called a positive correlation. When the slope of the regression line is negative (meaning that the value of b is negative) the value of y decreases as x increases. The strength of these relationships is given by the correlation coefficient (r) which can be calculated.

Calculations

Regression and correlation work depends on a set of calculations. These are done by taking the distance that a point is from the theoretical regression line and squaring it. By adding these squares you obtain the sum of squares. Sum of squares information can be determined by calculating basic statistics on the data of the dependent and independent variables. Refer to the section on calculations for regression and correlation.

Regression analysis

Regression analysis is used to predict the value of the variable based on the value of a second variable which is controlled by the experimenter. Results may be plotted on a scatter plot as noted earlier.

Data

Data for regression analysis are in the form of (x,y) pairs which can be listed in two columns to form a data table. In the following example, opercular breathing rates (in counts per minute) were measured in the biology laboratory. Counts were made at various temperatures ranging from 9°C to 27°C. The data are presented in the figure below.

Regression equation

A linear equation in the form of y = mx + b can be calculated for these data. Sometimes the equation is given as y = ax + b and other times it is given as y = a + bx. No matter which form is used, we are interested in the coefficient accompanying the variable (x).

In each case, the desired value was within the range of the x values. Finding intermediate values this way is called interpolation. Finding values outside the range of the x values is called extrapolation. Regression calculations have limitations. For example we could calculate the breathing rate at 28°C.

y = 4.54x - 1.

= 4.54 (28) - 1.

= 128.

There are limitations on using regression equations for prediction. In this example, one can put a temperature like 42 or 100 into the equation and get an answer, but one must remember that many enzymes stop functioning at 42°C and at a temperature of 100°C, water boils and the fish would be cooked.

Regression analysis theory indicates that the safest place to obtain interpolation is in the middle of the range of the x values. It is less secure at the ends of the range. One should be cautious with extrapolation because the results become more and more unreliable very quickly as one goes further away from the range of the x values.

Significance of regression analysis

It is possible to perform a linear regression t test. Testing the relationship involves a null hypothesis that there is no relationship. In stating the hypotheses,  is the population regression coefficient and  is the population correlation coefficient.

Hypotheses

Results

Calculation of the linear regression t test gives a t value of 9.62, with probability, p = 2.06 x 10-4. This permits us to say that the result is significant. There is a very low probability that it occurred by chance. See the figure below.

A. Calculation setup B. Results

Linear regression t test.

Correlation

Correlation is used to give information about the relationship between x and y. When the regression equation is calculated, the corelation results indicate the nature and the strength of the relationship.

Correlation coefficient

The correlation coefficient, r, indicates the nature and strength of the relationship between x and y. Values of r range from -1 to +1. A correlation coefficient of 0 means that there is no relationship. A value of -1 is a perfect negative coefficient and a correlation value of +1 indicates a perfect positive correlation.

A. Perfect negative B. No correlation C. Perfect positive

correlation r = 0 correlation

r = -1 r = +

Examples of correlational relationship.

Coefficient of determination

Another value of use in correlation analysis is the coefficient of determination which is represented as r^2. Because it is a square, it is always a positive number and varies between 0 and 1.

The coefficient of determination gives an indication of the contribution of the factor being studied in the regression analysis to the relationship between x and y. In the case of goldfish data, the regression analysis results in an equation of

y = 4.54x - 1.

We see that as the temperature increases, the breathing rate increases. For these data,

r =.

r^2 =.

The value of the correlation coefficient, r, is 0.97. This indicates a very strong positive correlation between temperature and breathing rate. The coefficient of determination, r^2 , has a value of .948. This indicates that about 95% of the relationship is the result of the temperature which is the factor being considered in this activity.

Calculations for regression and correlation

The general form for the regression line is

y =  + x + 

These items of information will be used to calculate the equation of the regression line, the correlation coefficient (which is also used to obtain the coefficient of determination) and the variance of the residuals.

Calculate the regression equation

The regression equation is in the form y = a + bx. It is found using the values in the table above. The value of b is calculated first, then the value of a is obtained using the value obtained for b.

Calculation of b.

Calculation of a.

These two results give us the equation of the regression line in figure B which is

y = 4.53x - 1.

Calculate the correlation coefficient

The correlation coefficient is given by the formula below which is sued in its calculaton.

The value of r obtained from this calculation is the correlation coefficient. The square of the correlation coefficient is the coefficient of determination.

Calculation of residual variance

For the calculation of the residual variance, some additional formulas are used. These are listed in the following table.

Information used in calculating residual variance.

The formula for residual variance is given below as is the calculation for the sample.

So the residual variance is 56.1365714. By taking the square root of this variance, a value of 7. is obtained. This is the residual standard deviation that is found using the linear regression t test on the TI83. The calculator will give all of the results obtained during the discussion above. The figure below gives the two screens resulting from the linear regression t test.

First Screen Second Screen

Results of linear regression t test.