Stata Commands for Multiple Regression in Sociology: UNC Chapel Hill, Soci708-001, Study notes of Statistics

Stata commands for conducting multiple regression analysis in sociology. It includes statistical functions, descriptive statistics, correlation matrix, scatterplot matrix, and multiple regression output. The example uses data from the university of north carolina at chapel hill, sociology 708-001 course, and covers topics such as normal distribution functions, t-distribution functions, f-distribution functions, descriptive statistics, correlations, scatterplot matrix, and multiple regression.

Typology: Study notes

2011/2012

Uploaded on 12/29/2012

sankait
sankait 🇮🇳

4.2

(13)

113 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
University of North Carolina
Chapel Hill
Soci708-001 Statistics for Sociologists
Fall 2009
Professor François Nielsen
Stata Commands for Module 11 Multiple Regression
For further information on any command in this handout, simply type help
followed by the name of the command in Stata.
See also the Stata and SAS Guide pdf (click on Documents in side bar; guide
is linked under Software Documentation).
1 Statistical Functions in Stata
The following statistical functions in Stata are useful for regression work. The
regression printout itself usually comprises all necessary statistics.
1.1 Normal Distribution Functions
The function normal(z) returns P(Zz), the area under the standard normal
curve to the left of z. (Compare with Table A.)
. display normal(1.207)
.88628393
The function invnormal(p) returns zsuch that P(Zz) = p, i.e. zsuch that
the area under the standard normal curve to the left of zis p. (Compare with
Table A and Table D, bottom row.)
. display invnormal(0.975)
1.959964
1.2 Student tDistribution Functions
The function ttail(df, t) returns P(T>t), the area under the Student’s t
distribution with df degrees of freedom to the right of t. (Compare with Table
D.)
. display ttail(7, 1.960)
.04540985
The function invttail(df, p) returns tsuch that P(T>t) = p, i.e. tsuch
that the area under Student’s t distribution with df degrees of freedom to the
right of tis p. (Compare with Table D.)
. display invttail(7, 0.025)
2.3646243
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download Stata Commands for Multiple Regression in Sociology: UNC Chapel Hill, Soci708-001 and more Study notes Statistics in PDF only on Docsity!

University of North Carolina Chapel Hill

Soci708-001 Statistics for Sociologists

Fall 2009

Professor François Nielsen

Stata Commands for Module 11 – Multiple Regression

For further information on any command in this handout, simply type help

followed by the name of the command in Stata.

See also the Stata and SAS Guide pdf (click on Documents in side bar; guide

is linked under Software Documentation).

1 Statistical Functions in Stata

The following statistical functions in Stata are useful for regression work. The

regression printout itself usually comprises all necessary statistics.

1.1 Normal Distribution Functions

The function normal(z) returns P ( Z ≤ z ), the area under the standard normal

curve to the left of z. (Compare with Table A.)

. display normal(1.207) .

The function invnormal(p) returns z such that P ( Z ≤ z ) = p , i.e. z such that

the area under the standard normal curve to the left of z is p. (Compare with

Table A and Table D, bottom row.)

. display invnormal(0.975)

1.2 Student t Distribution Functions

The function ttail(df, t) returns P ( T > t ), the area under the Student’s t

distribution with df degrees of freedom to the right of t. (Compare with Table

D.)

. display ttail(7, 1.960) .

The function invttail(df, p) returns t such that P ( T > t ) = p , i.e. t such

that the area under Student’s t distribution with df degrees of freedom to the

right of t is p. (Compare with Table D.)

. display invttail(7, 0.025)

1.3 F Distribution Functions

The function Ftail(n1, n2, f) returns P ( F > f ), the area under the F

distribution with n1 and n2 degrees of freedom to the right of f. (Compare

with Table E.)

. display Ftail(1, 14, 21.55) .

The function invFtail(n1, n2, p) returns f such that P ( F > f ) = p , i.e. f

such that the area under the F distribution with n1 and n2 degrees of freedom

to the right of f is p. (Compare with Table E.)

. display invFtail(1, 14, .00038068)

2 Descriptive Statistics, Correlations, and Scatterplot Matrix

I am using as an example the CSDATA from IPS6e (see Appendix D-2 for de-

scription). The units are 224 Computer Science majors at a large university. To

enter the data in Stata I retrieved the csdata.xls file in the CD-ROM, selected

the data and copied them to the clipboard (Ctrl-C). Then in Stata I opened the

Data Editor (Data -> Data Editor) and pasted the data (Ctrl-V). Then I closed

the Data Editor by clicking on ×. (You can save the data as a *.dta file if

desired with File -> Save ...) Then I listed the first 5 cases.

. list in 1/

+--------------------------------------------------+

obs gpa hsm hss hse satm satv sex
  1. | 1 3.32 10 10 10 670 600 1 |
  2. | 2 2.26 6 8 5 700 640 1 |
  3. | 3 2.35 8 6 8 640 530 1 |
  4. | 4 2.08 9 10 7 670 600 1 |
  5. | 5 3.38 8 9 8 540 580 1 | +--------------------------------------------------+

The response variable of interest is grade point average after three semesters

(gpa). The explanatory variables are high school grades in mathematics (hsm),

science (hss) and English or language arts (hse); SAT score in math (satm)

and verbal (satv); and sex (1=male, 2=female).

First I produced descriptive statistics for all the variables I intend to put in

the regression with the command su (for summarize).

. su hsm hss hse satm satv gpa

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- hsm | 224 8.321429 1.638737 2 10 hss | 224 8.089286 1.699663 3 10 hse | 224 8.09375 1.507874 3 10 satm | 224 595.2857 86.40144 300 800 satv | 224 504.5491 92.61046 285 760 -------------+-------------------------------------------------------- gpa | 224 2.635223 .7793949 .12 4

Figure 1: Scatterplot matrix of variables for the gpa regression (CSDATA).

Produced with Stata command graph matrix hsm hss hse satm satv

gpa, half xsize(4) ysize(4)

. reg gpa hsm hss hse satm satv, level(99)

To obtain standardized coefficients (in place of the confidence intervals

shown by default) use the option beta.

. reg gpa hsm hss hse satm satv, beta

Source | SS df MS Number of obs = 224 -------------+------------------------------ F( 5, 218) = 11. Model | 28.6436439 5 5.72872878 Prob > F = 0. Residual | 106.819145 218 .489996078 R-squared = 0. -------------+------------------------------ Adj R-squared = 0. Total | 135.462789 223 .607456452 Root MSE =.


gpa | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------- hsm | .1459611 .039261 3.72 0.000. hss | .0359053 .0377984 0.95 0.343. hse | .0552926 .0395687 1.40 0.164. satm | .0009436 .0006857 1.38 0.170. satv | -.0004078 .0005919 -0.69 0.492 -. _cons | .3267187 .3999964 0.82 0..


Using the command vif after running a regression will calculate the vari-

ance inflation factors (VIF). These are measures of collinearity, the degree to

which each explanatory variable is associated with all the other explanatory

variables. A VIF above 10 is considered bothersome, but there is no VIF above

10 in this particular example. (We are not going to use vif in this class.)

. vif

Variable | VIF 1/VIF -------------+---------------------- hsm | 1.88 0. hss | 1.88 0. hse | 1.62 0. satm | 1.60 0. satv | 1.37 0. -------------+---------------------- Mean VIF | 1.

4 Analysis of Residuals

To calculate the predicted values of gpa and the gpa residuals I can use the

command predict, with the option xb and residuals, respectively, assign-

ing variable names of my choice. Then to check the distribution of residuals

I draw a histogram of the residuals (shown in Figure 2) and a normal quan-

tile plot of the residuals (shown in Figure 3). The only fancy options I use is

xsize(3.5) ysize(3.5) with the normal quantile plot, to make the plot

square. Together with the straight line that Stata draws automatically the

square format shows deviations of the plot from linearity better than a rect-

angular plot (compare with IPS6e Figure 11.5 p.620). We can see the left-skew

in the distribution.

Figure 3: Normal quantile plot of residuals for the gpa regression (CSDATA).

Produced by Stata command qnorm gparesid, xsize(3.5) ysize(3.5)

Figure 4: Residual plot for the gpa regression (CSDATA). Produced by Stata

command twoway (scatter gparesid gpapredict), yline(0)

for Stata; see IPS6e p.594 for formulas). I check the values for the first 5

observations. Note that the SE of forecast is always larger than the SE for

the mean response, as the SE of forecast contains individual variation in the

response variable in addition to uncertainty about the mean response.

. predict gpasepred, stdp . predict gpaseforecast, stdf . list gpapredict gpasepred gpaseforecast in 1/

+--------------------------------+

gpapre~t gpasep~d gpasef~t
  1. | 3.085806 .0871256 .7053984 |
  2. | 2.165682 .1740038 .7212998 |
  3. | 2.539919 .0932556 .7061818 |
  4. | 2.773967 .1156441 .7094855 |
  5. | 2.532883 .0898973 .7057461 | +--------------------------------+