Regression and Correlation Analysis in Biostatistics: A Practical Guide with Examples | Exercises Biostatistics

Regression and Correlation

This file is part of a program based on the Bio 4835 Biostatistics class taught at Kean University in Union, New Jersey. The

course uses the following text:

Daniel, W. W. 1999. Biostatistics: a foundation for analysis in the health sciences. New York: John Wiley and Sons.

The file follows this text very closely and readers are encouraged to consult the text for further information.

REGRESSION AND CORRELATION

Introduction

Regression and correlation analysis procedures are used to study the relationships between variables.

Regression is used to predict the value of one variable based on the value of a different variable. Correlation

is a measure of the strength of a relationship between variables. The variables are data which are measured

and/or counted in an experiment. In the case of the examples used here, the data were obtained by counting

the breathing rate of goldfish in a laboratory experiment.

Nature of data

The data for regression and correlation consist of pairs in the form (x,y). The independent variable (x) is

determined by the experimenter. This means that the experimenter has control over the variable during the

experiment. In our experiment, the temperature was controlled during the experiment. The dependent

variable (y) is the effect that is observed during the experiment. It is assumed that the values obtained for

the dependent variable result from the changes in the independent variable. Regression and correlation

analyses will determine the nature of this relationship, if any, and the strength of the relationship. It can be a

consideration that all of the (x,y) pairs form a population. In some experiments, numerous observations of y

are taken at each value of x. In these cases, each set of values of y taken at a particular value of x form a

subpopulation of the data.

Graphical representation

Data are represented using a plot called a scatter plot or scatter diagram or x-y plot. During analysis

we try to find the equation of a line that fits the data. This is called the regression line. From algebra, we

recall that points which are (x,y) pairs can be plotted on the Cartesian coordinate system. We also recall that

a straight line on the Cartesian coordinate system has the equation y = mx + b, where m is the slope of the

line, and b is the y-intercept of the line. The slope is always the coefficient of the x term in the equation.

Following this pattern, the slope of the regression line can be given using various forms of equations, for

example, y = ax + b, y = a + bx, etc. By looking at the equation we can determine the slope and the y-

intercept.

Scatter diagrams can show a direct relationship between x and y. Let us have an equation, y = a + bx

where b is the slope. The direct relationship exists when the slope of the line (b) is positive. An inverse

relationship exists when the slope of the line is negative. When the slope b=0, then there is no relationship.

The nature of the relationship is discussed as part of correlation.

The regression line

As noted above, a straight line plotted on the Cartesian coordinate system can have the equation y =

mx + b. We remember that m is the slope and b is the y-intercept.

A regression line will have a general form

y =  + x + 

Regression and Correlation Analysis in Biostatistics: A Practical Guide with Examples, Exercises of Biostatistics