Linear Regression Exercise: One-Dimensional Arrays and Regression Analysis, Exercises of Computer Science

An exercise on linear regression analysis using one-dimensional arrays. It explains the concept of regression analysis, the equations for calculating the intercept (a) and slope (b), and introduces two goodness-of-fit measures: standard error and coefficient of variation. The document also includes instructions for completing the exercise, which involves preparing a data file, running the program, and analyzing the results.

Typology: Exercises

2013/2014

Uploaded on 02/01/2014

savitri_122
savitri_122 🇮🇳

4.6

(14)

184 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Programming Exercise Eight (and last)
Objective
This assignment provides an example in the use of one-dimensional arrays and introduces the
concept of regression analysis, which is used to estimate a relationship between two variables.
Mathematical Background
If several measurements are made on pairs of experimental data {(xi,yi), i = 1,...,N}, we can use a
technique, known as regression analysis, to determine an approximate equation of a straight line
that gives a best fit to the data. The equation of this best-fit line is written as follows.
^
y = a + b x
In this equation, we use the symbol ^
y instead of y to indicate that the predicted value found from
the equation, ^
y = a + b x is an approximate result. For a given data point, (xi,yi), the value of yi
represents the actual data and we would obtain the predicted value of y, at the point x = xi from
the equation ^
yi = a + b xi. The difference between the measured and predicted value is |yi - ^
yi|.
Fitted Line
y
x
indicates data points
y
i
x
i
i
y
ˆ
In the chart at the left, the data points are indicated by
the small ellipses. The coordinates of one of a typical
data point are shown by the dotted lines indicating the
coordinates xi and yi. The solid line is the fitted
regression line, ^
y = a + b x. The point where the dotted
line at x = xi crosses the regression line has the
coordinates (xi,^
yi). In this particular example the value of
^
yi is less than the value of yi. There is a large scatter of
data points about the regression line in this example.
The example plot above might represent calibration data on an instrument. The x values would
denote the instrument reading and the y values would indicate the true value of the quantity being
measured. Once the calibration tests were completed, it would be useful to have a simple
equation to relate the instrument reading (x) to the actual quantity being measured(y).
In addition to finding the values of a and b that give the best-fit line, we would also like to have
some measure of how well the line fits the data. Two different goodness-of-fit measures, the
standard error and the coefficient of variation are presented below in the equations section.
Equations used
The equations used to calculate a and b can be found by an analysis which minimizes the
distances between the actual data points, yi, and the fitted points, ^
yi = a + b xi. The results of this
analysis are shown below. The equations to compute the intercept, a, and the slope, b, in terms of
the entire set of data, {xi,yi}, use the following the definitions of mean values:
N
ii
N
iix
N
xandy
N
y
11
11
docsity.com
pf3
pf4

Partial preview of the text

Download Linear Regression Exercise: One-Dimensional Arrays and Regression Analysis and more Exercises Computer Science in PDF only on Docsity!

Programming Exercise Eight (and last)

Objective

This assignment provides an example in the use of one-dimensional arrays and introduces the concept of regression analysis, which is used to estimate a relationship between two variables.

Mathematical Background

If several measurements are made on pairs of experimental data {(x (^) i,yi), i = 1,...,N}, we can use a technique, known as regression analysis, to determine an approximate equation of a straight line that gives a best fit to the data. The equation of this best-fit line is written as follows.

^y = a + b x

In this equation, we use the symbol ^y instead of y to indicate that the predicted value found from

the equation, ^y = a + b x is an approximate result. For a given data point, (xi,yi), the value of yi represents the actual data and we would obtain the predicted value of y, at the point x = x (^) i from

the equation ^yi = a + b xi. The difference between the measured and predicted value is |yi - ^yi |.

Fitted Line

y

x indicates data points

y i

x (^) i

y ˆ i

In the chart at the left, the data points are indicated by the small ellipses. The coordinates of one of a typical data point are shown by the dotted lines indicating the coordinates xi and yi. The solid line is the fitted regression line, ^y = a + b x. The point where the dotted line at x = xi crosses the regression line has the coordinates (xi ,^yi ). In this particular example the value of ^yi is less than the value of y i. There is a large scatter of data points about the regression line in this example.

The example plot above might represent calibration data on an instrument. The x values would denote the instrument reading and the y values would indicate the true value of the quantity being measured. Once the calibration tests were completed, it would be useful to have a simple equation to relate the instrument reading (x) to the actual quantity being measured(y).

In addition to finding the values of a and b that give the best-fit line, we would also like to have some measure of how well the line fits the data. Two different goodness-of-fit measures, the standard error and the coefficient of variation are presented below in the equations section.

Equations used

The equations used to calculate a and b can be found by an analysis which minimizes the

distances between the actual data points, y (^) i, and the fitted points, ^yi = a + b xi. The results of this

analysis are shown below. The equations to compute the intercept, a, and the slope, b, in terms of the entire set of data, {x (^) i,y (^) i}, use the following the definitions of mean values:

   

 

N

i

i

N

i

i x N

y and x N

y 1 1

1 1

With these definitions, the slope, b, and the intercept, a, are found as follows.

and a y b x x N x

xy N x y

b (^) N

i

i

N

i

i i   

2 1

2

1

( )

( )( )

A statistical estimate of the variability can be found from the difference between the actual data

points yi and the estimated value ^y (^) i = a + b xi. This measure, which is called the standard error

and has the symbol s (^) y|x, is defined as follows:

s (^) y|x = 2

( ˆ) 1

2

N

y y

N

i

i i

Another measure, called the R 2 value or the coefficient of variation is considered to be a measure of the amount of variation in the data which is explained by the regression equation. An R 2 value of zero means that the regression cannot explain any of the variation in y; an R 2 value of one means that all the variation in y can be explained by the regression equation. The value of R 2 is computed from the following equation:

2

1

2

2 2 (^2 ) | 1 y N y

N s R (^) N

i

i

yx

  

Task One

You can use a previously written program for this task. Download the program file from the exercise page on the course web site. Review that program and see how the various functions are used to enter array data and do calculations with array data in loops. Note that the program determines the number of data points (N in the equations above) by reading the data. The user is not required to count the data and input a value for N. The program has summary output to the screen and detailed output of a, b, s (^) y|x, R 2 , and a table of xi, yi , and ŷi.

Prepare a data file for the test case below. Review the input statements to see how you should prepare this file. Run the program with your test data file to make sure you are using the program correctly by matching the results below.

Test Data and Results for Linear Regression xi 510 533 603 670 750 yi 1.3 0.1 1.5 1.8 3. Results: a = -5.77566; b = 0.0122238; R 2 = 0.

Copy the output file from the test data set in the table above to your submission file. Do not copy the code or the full output file from the downloaded data set to the submission file.

Submit a copy of the submission file with all the elements asked for in the each task above.

 A copy of the output file from task one

 Your code for task two  A copy of the output for task two that has the correct answers