Categorical Variable, Tedious Command - Sociologists Statistics - Lecture Notes, Study notes of Statistics

Its Sociologists Statistics handout. Key points are: Categorical Variable, Wordpad or Notepad, Commands, Graphics Commands, Simple Table, Row Col Cell, Categorical Variable, Value of Sex, Percentage, Bar Chart

Typology: Study notes

2011/2012

Uploaded on 12/29/2012

sankait
sankait 🇮🇳

4.2

(13)

113 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Stata Hand-out for Module 4
Helpful tips
1. Creating a variable involves the “generate” or “gen” command. For example, “gen educ=.”
tells Stata to create a variable called “educ” and to make it numeric (the “.” tells Stata this).
See the extra material at the end of this module for more about generating variables
2. If you don’t want to keep the variable you created, you can delete it by typing “drop” and
then the variable name. For example, “drop educ” tells Stata to drop the variable above.
3. If you have a really long tedious command, like a graphic with 6 overlaid plots, you will
probably NOT want to use the graphics interface because typing the commands is faster once
you know what they are. Even better, you will probably want to copy and paste commands
so you don’t have to keep retyping the same ones. Sometimes it will help to put the
command in a text file (using WordPad or NotePad) and using the search and replace
function.
4. Some of the graphics commands are complicated in Stata. Don’t worry about this because
graphics are really not the point of the class!
Tables similar to those created in the slides can be made easily in Stata. To take an example that is
not in the lecture notes, if you open the dataset “Chileyou can make a table using two categorical
variables, “region” and “oil”. I’m choosing these to show you simply because they are categorical
variables; this is not theory-driven. You can simply write “tab educ vote” to get a simple table. You
can choose to get the row percentages, column percentages, and/or cell percentages by writing
“row”, “col”, “cell” (or all three “row col cell”) after a comma. For example, to get row percentages:
From the dataset “Chile”
. tab educ vote, row
+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+
| vote
education | A N NA U Y | Total
-----------+-------------------------------------------------------+----------
1 | 0 0 0 1 0 | 1
| 0.00 0.00 0.00 100.00 0.00 | 100.00
-----------+-------------------------------------------------------+----------
NA | 0 2 1 3 5 | 11
| 0.00 18.18 9.09 27.27 45.45 | 100.00
-----------+-------------------------------------------------------+----------
P | 52 266 71 295 422 | 1,106
| 4.70 24.05 6.42 26.67 38.16 | 100.00
-----------+-------------------------------------------------------+----------
PS | 32 224 24 52 130 | 462
| 6.93 48.48 5.19 11.26 28.14 | 100.00
-----------+-------------------------------------------------------+----------
S | 103 397 72 237 311 | 1,120
| 9.20 35.45 6.43 21.16 27.77 | 100.00
-----------+-------------------------------------------------------+----------
Total | 187 889 168 588 868 | 2,700
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Categorical Variable, Tedious Command - Sociologists Statistics - Lecture Notes and more Study notes Statistics in PDF only on Docsity!

Stata Hand-out for Module 4

Helpful tips

1. Creating a variable involves the “generate” or “gen” command. For example, “gen educ=.”

tells Stata to create a variable called “educ” and to make it numeric (the “.” tells Stata this).

See the extra material at the end of this module for more about generating variables

2. If you don’t want to keep the variable you created, you can delete it by typing “drop” and

then the variable name. For example, “drop educ” tells Stata to drop the variable above.

3. If you have a really long tedious command, like a graphic with 6 overlaid plots, you will

probably NOT want to use the graphics interface because typing the commands is faster once

you know what they are. Even better, you will probably want to copy and paste commands

so you don’t have to keep retyping the same ones. Sometimes it will help to put the

command in a text file (using WordPad or NotePad) and using the search and replace

function.

4. Some of the graphics commands are complicated in Stata. Don’t worry about this because

graphics are really not the point of the class!

Tables similar to those created in the slides can be made easily in Stata. To take an example that is

not in the lecture notes, if you open the dataset “Chile” you can make a table using two categorical

variables, “region” and “oil”. I’m choosing these to show you simply because they are categorical

variables; this is not theory-driven. You can simply write “tab educ vote” to get a simple table. You

can choose to get the row percentages, column percentages, and/or cell percentages by writing

“row”, “col”, “cell” (or all three “row col cell”) after a comma. For example, to get row percentages:

From the dataset “Chile”

. tab educ vote, row

Key
frequency
row percentage

+----------------+

| vote education | A N NA U Y | Total -----------+-------------------------------------------------------+---------- 1 | 0 0 0 1 0 | 1 | 0.00 0.00 0.00 100.00 0.00 | 100. -----------+-------------------------------------------------------+---------- NA | 0 2 1 3 5 | 11 | 0.00 18.18 9.09 27.27 45.45 | 100. -----------+-------------------------------------------------------+---------- P | 52 266 71 295 422 | 1, | 4.70 24.05 6.42 26.67 38.16 | 100. -----------+-------------------------------------------------------+---------- PS | 32 224 24 52 130 | 462 | 6.93 48.48 5.19 11.26 28.14 | 100. -----------+-------------------------------------------------------+---------- S | 103 397 72 237 311 | 1, | 9.20 35.45 6.43 21.16 27.77 | 100. -----------+-------------------------------------------------------+---------- Total | 187 889 168 588 868 | 2,

If there is another categorical variable and you would like to see what the same table looks like for each value of that variable. For example, “by sex:” tells it to make two tables, one for each value of sex. But first I sorted the data according to sex (sort sex).

. sort sex . by sex: tab educ vote, row


-> sex = F

Key
frequency
row percentage

+----------------+

| vote education | A N NA U Y | Total -----------+-------------------------------------------------------+---------- NA | 0 2 0 1 4 | 7 | 0.00 28.57 0.00 14.29 57.14 | 100. -----------+-------------------------------------------------------+---------- P | 32 112 36 177 250 | 607 | 5.27 18.45 5.93 29.16 41.19 | 100. -----------+-------------------------------------------------------+---------- PS | 15 86 7 31 60 | 199 | 7.54 43.22 3.52 15.58 30.15 | 100. -----------+-------------------------------------------------------+---------- S | 57 163 27 153 166 | 566 | 10.07 28.80 4.77 27.03 29.33 | 100. -----------+-------------------------------------------------------+---------- Total | 104 363 70 362 480 | 1, | 7.54 26.32 5.08 26.25 34.81 | 100.

-> sex = M

+----------------+

Key
frequency
row percentage

+----------------+

| vote education | A N NA U Y | Total -----------+-------------------------------------------------------+---------- 1 | 0 0 0 1 0 | 1 | 0.00 0.00 0.00 100.00 0.00 | 100. -----------+-------------------------------------------------------+---------- NA | 0 0 1 2 1 | 4 | 0.00 0.00 25.00 50.00 25.00 | 100. -----------+-------------------------------------------------------+---------- P | 20 154 35 118 172 | 499 | 4.01 30.86 7.01 23.65 34.47 | 100. -----------+-------------------------------------------------------+----------

type_num==B |

lue Collar | Freq. Percent Cum.

Total | 102 100.

-> tabulation of type

type_num==P |

rofessional | Freq. Percent Cum.

Total | 102 100.

-> tabulation of type

type_num==W |

hite Collar | Freq. Percent Cum.

Total | 102 100.

To make it easier on ourselves, let’s rename the variables so we can see at a glance what they refer

to. Do this with the command “rename” – rename oldvar newvar [“var” stands for “variable”]

. rename type1 type_BC

. rename type2 type_PROF

. rename type3 type_WC

Now you can regress prestige on income and type. Because “type” is a categorical variable rather

than an ordinal or ratio variable, you have to put in the separate dummy variables. You can’t just put

in the original variable “type” because the categories are not numbers, they are categorical values:

blue collar, professional, white collar.

In this example, only type_PROF and type_WC are listed as independent variables, which tells Stata

that type_BC is to be understood as the “reference category.”

. reg prestige income type_PROF type_WC

Source | SS df MS Number of obs = 102

-------------+------------------------------ F( 3, 98) = 109.

Model | 23016.1084 3 7672.03612 Prob > F = 0.

Residual | 6879.31774 98 70.1971198 R-squared = 0.

-------------+------------------------------ Adj R-squared = 0.

Total | 29895.4261 101 295.994318 Root MSE = 8.

prestige | Coef. Std. Err. t P>|t| [95% Conf. Interval]

income | .0014871 .0002428 6.12 0.000 .0010052.

type_PROF | 24.42513 2.327585 10.49 0.000 19.80611 29.

type_WC | 7.010142 2.125056 3.30 0.001 2.793037 11.

_cons | 27.71983 1.749331 15.85 0.000 24.24834 31.

The following method is based on getting predictions based on the regression equation just obtained.

You can try this out if you’re feeling adventurous. (But don’t worry too much about graphics at this

stage):

Do the same regression (reg prestige income type_PROF type_WC)

Then these (as separate commands) – the first command obtains predicted values based on the

regression equation, the next command creates new variables for each category of “type”, and the

graph command tries to put this all together.

.predict yhat

.separate yhat, by(type)

The “predict” command predicts the y-values, which are called yhat.

The”separate” command creates 3 new yhat variables corresponding to the 3 values of type.

This graph lets you see the different lines for different values of type by graphing the 3 yhats, and

then laying a regression line over the whole thing (lfit)

. graph twoway (scatter prestige yhat1 yhat2 yhat3 income,

connect(i l l l) msymbol(o i i i) sort) (lfit prestige income)

WC_income | -.0020176 .000947 -2.13 0.036 -.0038974 -.

_cons | 15.31756 2.77366 5.52 0.000 9.811886 20.

To see this as three lines conditional on “type” I had to predict yhat again. This time I labeled it

“yhat_inter” to indicate that these yhats were predicted based on the regression equation with an

interaction in it.

. predict yhat_inter

. separate yhat_inter, by(type)

. graph twoway scatter prestige yhat_inter1 yhat_inter2 yhat_inter3 income,

connect(i l l l) msymbol(o i i i) sort

income

prestige yhat_inter, type == bc

yhat_inter, type == prof yhat_inter, type == wc

Dot plots… don’t seem that great in Stata. Maybe one of you can improve on this!

Prestige data: graph dot (sum) type_BC type_PROF type_WC

Creates this (not very useful) graph, showing that there are about 23 white collar workers, 31

professionals, 44 blue collar workers:

0 10 20 30 40 sum of type_BC sum of type_PROF sum of type_WC

EXTRA: Data Management Exercise: Turning a string variable into a numeric variable

As we saw earlier, having data in a string variable can sometimes make things a little harder in Stata.

We encountered this with the variable “type” in the Prestige dataset. This variable has values “bc”

“prof” and “wc”. We can make this variable into a numeric variable (though it doesn’t actually solve

the problems with making a bar chart or using it in regression; those things still appear to be easier in

R). Here is one way to do it.

This is our old variable.

. tab type

type | Freq. Percent Cum.

NA | 4 3.92 3.

bc | 44 43.14 47.

prof | 31 30.39 77.

wc | 23 22.55 100.

Total | 102 100.

First, generate a new variable. Let’s call it “type_num”. We want to make sure Stata knows it’s

numeric, so we tell Stata this as follows:

. gen type_num=.

. lab values type_num type_num

. tab type_num

type_num | Freq. Percent Cum.

Blue Collar | 44 43.14 43.

Professional | 31 30.39 73.

White Collar | 23 22.55 96.

N/A | 4 3.92 100.

Total | 102 100.

Alternative ways of graphing:

The graph below is actually 7 graphs, one on top of the other: 3 scatter plots for the different values

of “type”, 3 plots with a fitted line for the different values of “type”, and 1 plot showing the fitted

line for all values of “type”. I turned “legend” off (see end of command) because the legend was not

informative for this graph. This graph is not as good as the one that uses “yhat”.

twoway (scatter prestige income if type=="bc", mcolor(red) msymbol(circle_hollow))(lfit prestige

income if type=="bc", lcolor(red)) (scatter prestige income if type=="prof", mcolor(midgreen)

msymbol(triangle_hollow)) (lfit prestige income if type=="prof", lcolor(midgreen))(scatter prestige

income if type=="wc", mcolor(blue) msymbol(plus)) (lfit prestige income if type=="wc", lcolor(blue))

(lfit prestige income , lcolor(black)), legend(off)

income