

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Basic commands in programs of Stata and R.
Typology: Study notes
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Obj <- or =number/characters Vector <- c(number, “value”) Length(V) : Min/max/range/mean/sum(V) Setwd(“location”) Dataframe <- read.csv(“file name”) Summary(df): simple stats of each variable #(min quartile median mean max)/N(length) Head(df):first six observations Tail(df): last six observations Head(df$variable):first 6 obs. for variable Same for mean/table – frequencies Save file: write.csv(resume, file="resume1.csv") save(resume, file="resume1.RData") !not &and |or!=not equal>= 大于等于 <= 小于等于 ==logical operator (e.g.5==6 false) !=not true (e.g. 5!=6 True) newV <- V2>V1 drop obs. V2>V
df<- c(1,2,3,4) df(3) = 3/df(-3) = 1 2 4 [] indexing: df[c(1,3)] = 1 3 df[c(TRUE, FALSE, TRUE, FALSE)] = 1, 3 (only 1st^ & 3rd^ is true) school <- c("UCSD", "UCB", "UCLA", "UCR") school=="UCSD" [1] TRUE FALSE FALSE FALSE Subset graduationdate vector graduationdates[school=="UCSD"] [1] 2010 Conditional vector Vec <- df$variable == “condition” Frequency of conditional vector Sum(df$variable==“condition”) = # of obs. Mean of V1 with the condition of V Mean(df$variable[df$variable2]) Subsetting data frames students <- data.frame(school=c("UCSD", "UCB", "UCSD"), graduationdate=c(2010, 2019, 2015))= students school graduationdate 1 UCSD 2010 2 UCB 2019 3 UCSD 2015 Specify data frame, row, column: Students[3,1] / [1,] / [,2] [1] “UCSD” / 1 UCSD 2010/2010 2019 2015 Extract row 1&3 w/o 2 students[c(1,3),] (or use students[-2,]) school graduationdate 1 UCSD 2010 3 UCSD 2015 Extract where ==“UCSD” is true students school=="UCSD" [1] TRUE FALSE TRUE ⬆️ (return logical factor)⬇️ filter students[students$school=="UCSD",] school graduationdate 1 UCSD 2010 3 UCSD 2015
Tidyverse: subset(df,logical statement) subset(students,students$school=="UCSD") school graduationdate 1 UCSD 2010 3 UCSD 2015 Create dataframe for subset resume_blacknames <-resume[resume$race=="black",]
resume_whitenames <- resume[resume$race=="white",] mean of subset df mean(resume_blacknames$variable) [1] 0. CONDITIONAL STATEMENTS if (logical statement) { code to be executed if logical statement is TRUE }
EG) door <- "locked" if (door=="locked") { print("sorry, you need a key to enter") } [1] "sorry, you need a key to enter"
EG2) door <- "unlocked" if (door=="locked") { print("sorry, you need a key to enter") } NOT EXECUTED AS DOOR LOCKED NOT TRUE ELSE EG) door <- "locked" if (door=="locked"){ print("Sorry, you need a key to enter") } else { {and else same line must print("Please Come in!") } [1] "Sorry, you need a key to enter" Executes else if door<-“unlocked” [1]”please come in!” Loops: for (some set of things) { do some stuff } EG) i <- 2 print(2i) [1] 4 EG2) for (i in c(3,10,99)){ print(2i) } [1] 6 [1] 20 [1] 198 EG3) homework <- c("math", "reading", "writing") for (i in homework) { cat("Do", i, "\n") } Do math Do reading Do writing *cat(): concatenates & prints *\n: display in next line Same process by looping over words in hw: for (i in 1:length(homework)) { cat("Do", homework[i], "\n") } Dimensions: Dim(df) Subset individuals (age>=25 & age<=34 & mother2==1): mothers2534 <- subset(df, mother2==1 & age>=25 & age<=34) install tidyverse install.packages("tidyverse") library(tidyverse) convert data frame to tibble: rr <- as_tibble(rr) retrieves rows of data that meet certain condition: filter(dataframe, some logical statement) EG) mothers2534 <- filter(rr, age<=34 & age>=25 & mother2==1) Or EG) filter(mothers2534, dataset%in%2003:2008) Pipe eg) mothers2534 <- rr %>% filter(age<=34 & age>=25 & mother2==1) Select remain tibble wanted: Eg) mothers2534 <- select(mothers2534, dataset, mother2, age, childtot) Sort data based on values of variable: mothers2534 <- arrange(mothers2534,age) want descending order: mothers2534 <- arrange(mothers2534,desc(age)) head(mothers2534$age) [1] 34 34 34 34 34 34 Pipe operator: F(x) = x%>% f() For multiple functions: h(g(f(x))) x %>% f %>% g %>% 1st filter 2nd selected variables needed: mothers2534 <- rr %>% filter(age<=34 & age>=25 & mother2==1) %>% select(dataset, mother2, age, childtot) *if want descending order add: %>% arrange(desc(age))
new variable df$newvarname <- expression eg) rr$childcollegeprep <- rr$childeduc + rr$childtravel or: mutate(dataframe, newvarname = expression) eg) rr <- mutate(rr, childcollegeprep=childtravel+childeduc) can create mutiple new var at same time: rr <- mutate(rr, childcollegeprep=childtravel+childeduc, childnotcollegeprep=childtot-childcollegeprep)
create new var, drop all prior var: collegeprepdat <- transmute(rr, childcollegeprep = childeduc + childtravel, childnotcollegeprep = childtot - collegeprep)
Error in transmute(): ℹ In argument: childnotcollegeprep = childtot - collegeprep. Caused by error: ! object 'collegeprep' not found
Summarize() – generate summary stats: summarize(rr, meanchildtot = mean(childtot, na.rm=T))
meanchildtot
1 4. ***1st sum(rr = generating summary stats from dataset rr 2 meanchiltot = giving name to sum stats na.rm=T ignores missing values Can compute multiple var at once: summarize(rr, meanchildtot = mean(childtot, na.rm=T), medianchildtot=median(childtot, na.rm = T)) mean of chiltot taken by each value of dataset, store under meanchildtot: rr %>% group_by(dataset) %>% summarize(meanchildtot=mean(childtot,na.rm=T))
create unique combinations:A 2010, A 2015, B2010, B student.df %>% group_by(school,graduationdate) %>% summarize(mean.gpa=mean(gpa)) summarise() has grouped output by 'school'. You can override using the .groups argument.
school graduationdate mean.gpa
1 A 2010 3. 2 A 2015 2. 3 B 2010 3. 4 B 2015 1. Save output: totchildbyyearcollege <- rr %>% group_by(dataset, college) %>% summarize(meanchildtot = mean(childtot, na.rm=T)) combination: collegeprep <- rr %>% filter(mother2==1, age>24, age<35) %>% mutate(collegeprep = childeduc + childtravel) %>% group_by(dataset, college) %>% summarize(meancollegeprep=mean(collegeprep, na.rm=T)) summarise() has grouped output by 'dataset'. You can override using the .groups argument. Filter wanted mother data, creat new var, group dta aby dataset and college & summarize mean Create histogram: hist(pm_bycity$meanpm10, xlab="Mean PM10", ylab="Frequency", main="Mean PM10 by City") create multiple plots:
par(mfrow=c(1,2)) SINGLE ROW TWO COLUMNS (indicate with mfrow=) #Smaller bins hist(pm_bycity$meanpm10, xlab="Mean PM10", ylab="Frequency", main="Mean PM10 by City", breaks=20) #Larger bins hist(pm_bycity$meanpm10, xlab="Mean PM10", ylab="Frequency", main="Mean PM10 by City", breaks=4)
**new var = T deciding date >/% mutate(T=date-auto_date) %>% filter(T<0) %>% group_by(code_city) %>% summarize(meanpm10 = mean(pm10, na.rm=TRUE)) pm_bycityafter <- pm %>% mutate(T=date-auto_date) %>% filter(T > 0) %>% group_by(code_city) %>% summarize(meanpm10 = mean(pm10, na.rm=TRUE)) comparing histograms: par(mfrow=c(1,2)) #Plot histogram for before hist(pm_bycitybefore$meanpm10, xlab="MeanPM10", ylab="Frequency", main="Before Automation") #Plot histogram for after hist(pm_bycityafter$meanpm10, xlab="MeanPM10", ylab="Frequency", main="After Automation") same scale for both histograms: #Create two panes for plots par(mfrow=c(1,2)) #Plot histogram for before hist(pm_bycitybefore$meanpm10, xlab="MeanPM10", ylab="Frequency", main="Before automation", xlim=c(0,250), ylim=c(0,50)) lines(c(meanbefore, meanbefore), c(-10, 100), lty=2, col="red") #Plot histogram for after hist(pm_bycityafter$meanpm10, xlab="MeanPM10", ylab="Frequency", main="After automation", xlim=c(0,250), ylim=c(0,50)) lines(c(meanbefore, meanbefore), c(-10, 100), lty=2, col="red")
boxplot function: boxplot(pm_byday$meanpm10) *lines that extend from the box are referred to as whiskers whiskers – 1.5xIQR IQR = Q3-Q Horizontal, blue, with label box plot: boxplot(pm_byday$meanpm10, xlab="Mean PM10",
main="Mean PM 10 by city", col="blue", border="darkblue", pch=16, horizontal=T)
separate box plots: (1st add new var) pm_byday$after <- pm_byday$T>
boxplot(pm_byday$meanpm10 ~ pm_byday$after, xlab="Mean PM10", ylab="After Automation", main="Mean PM 10 by Day Before and After Automation", col="blue", border="darkblue", pch=16, horizontal=T)
scatter plot: plot(df$xvar, df$yvar) **pch = .pch plot size lines() add line EG) plot(pm_byday$T, pm_byday$meanpm10, xlab="Days Relative to Automation", ylab="Mean PM10", main="Automation and Mean PM10", pch=16, ylim=c(0,300)) lines(c(0,0), c(-10,250), col="red", lty=2) text(0,260, "Automation", col="red") save plot as pdf: pdf("Automation.pdf") dev.off() stop saving into pdf **library(lubridate) today(), now()showsh date *ymd(“2012-01-22”) convert string into date obj. can also mdy/dmy/ymd can add days()/months()/years() portion – e.g year year(ymd("2012-01-22")) #> [1] 2012 *add ymd("1960-01-01")+days(18628) #> [1] "2011-01-01" Plot mean amount of rain by month: pm_bymonth <- pm %>% mutate(rdate = ymd("1960-01-01") + days(date), month = month(rdate)) %>% group_by(month) %>% summarize(meanrain = mean(rain, na.rm=TRUE))
plot(pm_bymonth$month, pm_bymonth$meanrain, col="blue", pch=16, xlab="Date",
ylab="Mean Daily Rain (mm)")
ggplot(data = ) + (mapping = aes()) } EG) ggplot(data = pm_byday) + geom_point(mapping = aes(x = T, y = meanpm10,color=meanrain)) + xlab("Time After Automation") + ylab("Mean PM10") + ggtitle("Time After Automation and PM10") GGPLOT BOXPLOT
*one plot just y = var ggplot(data=pm_byday) + geom_boxplot(mapping=aes(x=after,y=meanpm10)) + xlab("After Automation?") + ylab("Pollution Levels") + ggtitle("Boxplot of Pollution Levels Before and After Automation")
ggplot(data = pm_byday) + geom_boxplot(mapping = aes(x = after, y = meanpm10)) + scale_x_discrete(labels=c("Before Automation","After Automation")) + xlab("") *changing x-axis labels
Plot votes Ggplot(df, mapping=aes(x=xvar,y=yvar))+ Geom_point()+ geom_text(vjust=1.5, size=3)
geom_text(vjust=1.5, Linear Regression model
Linear Regression: Lm.fit<-lm(yvar ~ xvar, data=df) Plot regression line: ggplot(df, mapping=aes(x=Xvar, y=Yvar))+ geom_point()+ xlab(“x axis label)+ ylab(“y axis label)+ geom_smooth(method = “lm”, se = False, color=”red”, linetype=”dashed”)
EG)df$predXVar <- lm.df$fitted.values *Order by largest residuals Arrange(df,desc(abs(residuals)))
Myvec<-c(3,5,7) Mymean(myvec) = 5 Mean after exp. For treated and controlled mean(uct$asset_after[uct$treat == 1])
mean(uct$asset_after[uct$treat == 0]) treated – by tidyverse uct %>% filter(treat==1) %>% summarize(meanassetafter = mean(asset_after)) mean asset after exp: treat.mean <- function(treatvalue){ return(mean(uct$asset_after[uct$treat==treatvalue])) } treat.mean(0) treat.mean(1) tidyverse treat.mean <- function(treatvalue){ output <- uct %>% filter(treat==treatvalue) %>% summarize(meanasset = mean(asset_after)) return(output) } treat.mean(0) *compute averages of treated vs controlled before/after experiment Variable <-“asset” AfterV<-paste(variable,”_after”,sep””) BeforeV<-paste(variable,”_before”,sep=””) **sep=”” tell R not include space between “asset” and “_after” pull() function grabs columns from df Pull(uct_treat, aftervar)
[display] presents visualization [browse] show spreadsheet [summarize newvar] retrieve summary stats (mean, std. devi, min, max) [gen newvar = (condition)] condition 加减乘除&if statements [gen mobility_rate = parq1kqcond] [gen CA = (state == “CA”)] [Label var varname “Given label”] [sum newvar if condition >= requirement] *can be ==, >=, <= [sum mobility if name == “UCSD”] [histogram varname, frac] y-axis = perct/frac x-axis = interval intervals [a,b) include a exclude b [use dataA.dta, clear///Append using dataB.dta] append (extend dta by adding dataset to master dataset on same variables) [use dataA.dta, clear///merge (type) var using dataB.dta] (type=m:1, 1:1) 1:1 – both sets have unique keys M:1 – 1st dta has duplicate, 2nd^ unique (vice versa for 1:m) [keep if _merge ==3] keep if both matched [merge m:1 var using dataB.dta, keep(3)nogen Save new_file.dta, replace] merge, keep only both appreciate, save in the directory clock [di clock(“18:30:00”,”hms”) [drop if var [Tab var] retrieve by variable/create table of frequency [tab var if var2 == (1 or 0)] retrieve by var but based on true or false of var [graph bar(stats) yvar, over (xvar)] (bar) Stats can be replaced as mean, count, sum etc [title(“Name of graph”)] create title (same for xtitle and ytitle) [collapse (stats) varlist1, by varlist2] *varlist 1&2 can be more than one variable *e.g change unit from indivi. to by state
1.[gen obs_count = 1] variable to store count 2.[collapse (count) obs_count, by var]
3.find fraction of number of observ. [bysort var: egen var2 = total (obs_count)] Creates new var: sum total of var(obs_count) [gen fraction_var = obs_count/total_count] [graph bar fraction_var, over(var1) over (var2)]
0 = intercept & 1 = slope coefficient
[reg yvar xvar] _con = intercept Slope coeff. – next to variable lane [twoway scatter yvar xvar/// || lfit yvar xvar, lw(0.4) lc(red)/// (titles)] create scatter plot with linear regression line Conditional [reg yvar xvar if (logical sta.)] Logi st e.g. if var == Jitter(adds small amounts of noise so points don’t overlap) 2.Collapse (count) constant, by (airlines) Value of constant for any row with “Alaska” = 2
3.Imagine we have loaded in the dataset above and want to merge to another dataset where the unit-of- observation is an airline, and the dataset contains a variable that captures the total number of employees of that airline. What type of merge would this be? (m:1) 4.If we wanted to understand the statistical relationship between par_q1 and kq5_cond_parq1, what figure would be most appropriate? (scatter) 5.What code would successfully retrieve the average of par_q1, conditional on the institution having a value of count that is greater than or equal to 3000. = 3000> 6.The table above shows the UC Irvine graduates from low-income backgrounds are more likely to reach high incomes than low-income students from UC Santa Barbara. Can we conclude this is a causal effect of UC Irvine? (not enough info) 8.In week 4, we studied Mindspark, a technology to teach-at-the-right-level. Instead of an experiment, imagine we had observational data on Mindspark use across students in a different city in India. We find that these students who attend Mindspark have higher test scores than students that do not. What can we conclude?
(correlate higher scores, same conclusion as original)
09. In Week 3, we studied discrimination in traffic stops using the veil of darkness test. What was the unit-of- observation in the Standford Open Policing Project. (Stop) - exhibits long right-tail
10.We find that the intercept in this plot is equal to 3., we find that someone made a mistake and added 1 to everyone's X-value. If we re-estimate the regression with the new variable, how will the intercept change? (It will be larger in magnitude) What code would correctly generate a binary indicator variable that is equal to one if the number of sales is greater than or equal to 100 million units and zero otherwise. [gen greater_than_100 = (sales >= 100)] computes the number of individuals in the dataset that are male vs. female. table(resume$sex) or sum(resume$sex==”male”) sum(resume$sex==”female”) Add a variable to resume that is equal to 1 if the individual is male and zero otherwise.
Name this new variable male. Tidyverse version resume <- resume %>% mutate(male=(sex==”male”)) Base R version resume$male <- (male=(sex==”male”))
If you esEmate the regression above in the Bertrand and Mullainathan (2004) data, you -will find 𝛽 = -0.01. Write a clear sentence interpreEng this slope coefficient. This implies that male applicants are 1 percentage point less likely to be called for an interview relative to female applicants. Next create a dataframe that restricts only to Black applicants. sub <- resume %>% filter(race == “black”) Write code below that computes the average of sat_avg_2013 only for colleges in CA. mean(colleges$sat_avg_2013[colleges$state==”CA”]) You want to understand the distribution of sticker prices separately for public and private schools. Write a few sentences describing a figure that would accomplish this goal. In other words, describe what type of figure you would create, and what appears on the vertical and horizontal axis. In the question below, you will be asked to write code to construct this figure. BOXPLOT y=sticker price x = private vs public, graph shows distribution between diff. percentiles
colleges %>% ggplot(aes=(x=public,y=price)) + geom_boxplot() Write code below that (1) restricts the data to private universes, and then (2) sorts the data from highest sticker price to the lowest sticker price. You should accomplish this using functions from tidyverse. colleges %>% filter(public==0) %>% arrange(desc(price)) data frame + histogram show distribution avg price state_level <- colleges %>% group_by(state) %>% summarize(meanprice=mean(price)) state_level %>% ggplot(aes(x=meanprice)) + geom_histogram() hist(state_level$meanprice)