Categorical Variable, Tedious Command - Sociologists Statistics - Lecture Notes | Study notes Statistics

1

Stata Hand-out for Module 4

Helpful tips

1. Creating a variable involves the “generate” or “gen” command. For example, “gen educ=.”

tells Stata to create a variable called “educ” and to make it numeric (the “.” tells Stata this).

See the extra material at the end of this module for more about generating variables

2. If you don’t want to keep the variable you created, you can delete it by typing “drop” and

then the variable name. For example, “drop educ” tells Stata to drop the variable above.

3. If you have a really long tedious command, like a graphic with 6 overlaid plots, you will

probably NOT want to use the graphics interface because typing the commands is faster once

you know what they are. Even better, you will probably want to copy and paste commands

so you don’t have to keep retyping the same ones. Sometimes it will help to put the

command in a text file (using WordPad or NotePad) and using the search and replace

function.

4. Some of the graphics commands are complicated in Stata. Don’t worry about this because

graphics are really not the point of the class!

Tables similar to those created in the slides can be made easily in Stata. To take an example that is

not in the lecture notes, if you open the dataset “Chile” you can make a table using two categorical

variables, “region” and “oil”. I’m choosing these to show you simply because they are categorical

variables; this is not theory-driven. You can simply write “tab educ vote” to get a simple table. You

can choose to get the row percentages, column percentages, and/or cell percentages by writing

“row”, “col”, “cell” (or all three “row col cell”) after a comma. For example, to get row percentages:

From the dataset “Chile”

. tab educ vote, row

+----------------+

| Key |

|----------------|

| frequency |

| row percentage |

+----------------+

| vote

education | A N NA U Y | Total

-----------+-------------------------------------------------------+----------

1 | 0 0 0 1 0 | 1

| 0.00 0.00 0.00 100.00 0.00 | 100.00

-----------+-------------------------------------------------------+----------

NA | 0 2 1 3 5 | 11

| 0.00 18.18 9.09 27.27 45.45 | 100.00

-----------+-------------------------------------------------------+----------

P | 52 266 71 295 422 | 1,106

| 4.70 24.05 6.42 26.67 38.16 | 100.00

-----------+-------------------------------------------------------+----------

PS | 32 224 24 52 130 | 462

| 6.93 48.48 5.19 11.26 28.14 | 100.00

-----------+-------------------------------------------------------+----------

S | 103 397 72 237 311 | 1,120

| 9.20 35.45 6.43 21.16 27.77 | 100.00

-----------+-------------------------------------------------------+----------

Total | 187 889 168 588 868 | 2,700

docsity.com

Categorical Variable, Tedious Command - Sociologists Statistics - Lecture Notes, Study notes of Statistics

Related documents

Partial preview of the text

Download Categorical Variable, Tedious Command - Sociologists Statistics - Lecture Notes and more Study notes Statistics in PDF only on Docsity!

Stata Hand-out for Module 4

Helpful tips

1. Creating a variable involves the “generate” or “gen” command. For example, “gen educ=.”

tells Stata to create a variable called “educ” and to make it numeric (the “.” tells Stata this).

See the extra material at the end of this module for more about generating variables

2. If you don’t want to keep the variable you created, you can delete it by typing “drop” and

then the variable name. For example, “drop educ” tells Stata to drop the variable above.

3. If you have a really long tedious command, like a graphic with 6 overlaid plots, you will

probably NOT want to use the graphics interface because typing the commands is faster once

you know what they are. Even better, you will probably want to copy and paste commands

so you don’t have to keep retyping the same ones. Sometimes it will help to put the

command in a text file (using WordPad or NotePad) and using the search and replace

function.

4. Some of the graphics commands are complicated in Stata. Don’t worry about this because

graphics are really not the point of the class!

Tables similar to those created in the slides can be made easily in Stata. To take an example that is

not in the lecture notes, if you open the dataset “Chile” you can make a table using two categorical

variables, “region” and “oil”. I’m choosing these to show you simply because they are categorical

variables; this is not theory-driven. You can simply write “tab educ vote” to get a simple table. You

can choose to get the row percentages, column percentages, and/or cell percentages by writing

“row”, “col”, “cell” (or all three “row col cell”) after a comma. For example, to get row percentages:

From the dataset “Chile”

type_num==B |

lue Collar | Freq. Percent Cum.

Total | 102 100.

-> tabulation of type

type_num==P |

rofessional | Freq. Percent Cum.

Total | 102 100.

-> tabulation of type

type_num==W |

hite Collar | Freq. Percent Cum.

Total | 102 100.

To make it easier on ourselves, let’s rename the variables so we can see at a glance what they refer

to. Do this with the command “rename” – rename oldvar newvar [“var” stands for “variable”]

. rename type1 type_BC

. rename type2 type_PROF

. rename type3 type_WC

Now you can regress prestige on income and type. Because “type” is a categorical variable rather

than an ordinal or ratio variable, you have to put in the separate dummy variables. You can’t just put

in the original variable “type” because the categories are not numbers, they are categorical values:

blue collar, professional, white collar.

In this example, only type_PROF and type_WC are listed as independent variables, which tells Stata

that type_BC is to be understood as the “reference category.”

. reg prestige income type_PROF type_WC

Source | SS df MS Number of obs = 102

-------------+------------------------------ F( 3, 98) = 109.

Model | 23016.1084 3 7672.03612 Prob > F = 0.

Residual | 6879.31774 98 70.1971198 R-squared = 0.

-------------+------------------------------ Adj R-squared = 0.

Total | 29895.4261 101 295.994318 Root MSE = 8.

prestige | Coef. Std. Err. t P>|t| [95% Conf. Interval]

income | .0014871 .0002428 6.12 0.000 .0010052.

type_PROF | 24.42513 2.327585 10.49 0.000 19.80611 29.

type_WC | 7.010142 2.125056 3.30 0.001 2.793037 11.

_cons | 27.71983 1.749331 15.85 0.000 24.24834 31.

The following method is based on getting predictions based on the regression equation just obtained.

You can try this out if you’re feeling adventurous. (But don’t worry too much about graphics at this

stage):

Do the same regression (reg prestige income type_PROF type_WC)

Then these (as separate commands) – the first command obtains predicted values based on the

regression equation, the next command creates new variables for each category of “type”, and the

graph command tries to put this all together.

.predict yhat

.separate yhat, by(type)

The “predict” command predicts the y-values, which are called yhat.

The”separate” command creates 3 new yhat variables corresponding to the 3 values of type.

This graph lets you see the different lines for different values of type by graphing the 3 yhats, and

then laying a regression line over the whole thing (lfit)

. graph twoway (scatter prestige yhat1 yhat2 yhat3 income,

connect(i l l l) msymbol(o i i i) sort) (lfit prestige income)

WC_income | -.0020176 .000947 -2.13 0.036 -.0038974 -.

_cons | 15.31756 2.77366 5.52 0.000 9.811886 20.

To see this as three lines conditional on “type” I had to predict yhat again. This time I labeled it

“yhat_inter” to indicate that these yhats were predicted based on the regression equation with an

interaction in it.