Cheat Sheet for Data Analytics, Study notes of Data Analysis & Statistical Methods

Basic commands in programs of Stata and R.

Typology: Study notes

2023/2024

Uploaded on 04/24/2025

kenma-kozume-30
kenma-kozume-30 🇺🇸

1 document

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Obj <- or =number/characters
Vector <- c(number, “value”)
Length(V) :
Min/max/range/mean/sum(V)
Setwd(“location”)
Dataframe <- read.csv(“file name”)
Summary(df): simple stats of each variable
#(min quartile median mean max)/N(length)
Head(df):first six observations
Tail(df): last six observations
Head(df$variable):first 6 obs. for variable
Same for mean/table – frequencies
Save file:
write.csv(resume, file="resume1.csv")
save(resume, file="resume1.RData")
!not &and |or!=not equal>=大于等于<=小于等于
==logical operator (e.g.5==6 false)
!=not true (e.g. 5!=6 True)
newV <- V2>V1 drop obs. V2>V1
df<- c(1,2,3,4)
df(3) = 3/df(-3) = 1 2 4
[] indexing:
df[c(1,3)] = 1 3
df[c(TRUE, FALSE, TRUE, FALSE)]
= 1, 3 (only 1st & 3rd is true)
school <- c("UCSD", "UCB", "UCLA", "UCR")
school=="UCSD"
[1] TRUE FALSE FALSE FALSE
Subset graduationdate vector
graduationdates[school=="UCSD"]
[1] 2010
Conditional vector
Vec <- df$variable == “condition”
Frequency of conditional vector
Sum(df$variable==“condition”)
= # of obs.
Mean of V1 with the condition of V2
Mean(df$variable[df$variable2])
Subsetting data frames
students <- data.frame(school=c("UCSD",
"UCB", "UCSD"),
graduationdate=c(2010, 2019, 2015))=
students
school graduationdate
1 UCSD 2010
2 UCB 2019
3 UCSD 2015
Specify data frame, row, column:
Students[3,1] / [1,] / [,2]
[1] “UCSD” / 1 UCSD 2010/2010 2019 2015
Extract row 1&3 w/o 2
students[c(1,3),] (or use students[-2,])
school graduationdate
1 UCSD 2010
3 UCSD 2015
Extract where ==“UCSD” is true
students school=="UCSD"
[1] TRUE FALSE TRUE
⬆️(return logical factor)⬇️ filter
students[students$school=="UCSD",]
school graduationdate
1 UCSD 2010
3 UCSD 2015
Tidyverse: subset(df,logical statement)
subset(students,students$school=="UCSD")
school graduationdate
1 UCSD 2010
3 UCSD 2015
Create dataframe for subset
resume_blacknames <-resume[resume$race=="black",]
resume_whitenames <-
resume[resume$race=="white",]
mean of subset df
mean(resume_blacknames$variable)
[1] 0.06447639
CONDITIONAL STATEMENTS
if (logical statement) {
code to be executed if logical statement is TRUE
}
EG) door <- "locked"
if (door=="locked") {
print("sorry, you need a key to enter")
}
[1] "sorry, you need a key to enter"
EG2) door <- "unlocked"
if (door=="locked") {
print("sorry, you need a key to enter")
} NOT EXECUTED AS DOOR LOCKED NOT TRUE
ELSE EG) door <- "locked"
if (door=="locked"){
print("Sorry, you need a key to enter")
} else { {and else same line must
print("Please Come in!")
}
[1] "Sorry, you need a key to enter"
Executes else if door<-“unlocked”
[1]”please come in!”
Loops:
for (some set of things) {
do some stuff
}
EG) i <- 2
print(2*i)
[1] 4
EG2) for (i in c(3,10,99)){
print(2*i)
}
[1] 6 [1] 20 [1] 198
EG3) homework <- c("math", "reading", "writing")
for (i in homework) {
cat("Do", i, "\n")
}
Do math
Do reading
Do writing
*cat(): concatenates & prints
*\n: display in next line
Same process by looping over words in hw:
for (i in 1:length(homework)) {
cat("Do", homework[i], "\n")
}
Dimensions: Dim(df)
Subset individuals (age>=25 & age<=34 & mother2==1):
mothers2534 <- subset(df, mother2==1 & age>=25 &
age<=34)
install tidyverse
install.packages("tidyverse")
library(tidyverse)
convert data frame to tibble:
rr <- as_tibble(rr)
retrieves rows of data that meet certain condition:
filter(dataframe, some logical statement)
EG) mothers2534 <- filter(rr, age<=34 & age>=25 &
mother2==1)
Or EG) filter(mothers2534, dataset%in%2003:2008)
Pipe eg) mothers2534 <- rr %>% filter(age<=34 &
age>=25 & mother2==1)
Select remain tibble wanted:
Eg) mothers2534 <- select(mothers2534, dataset,
mother2, age, childtot)
Sort data based on values of variable:
mothers2534 <- arrange(mothers2534,age)
want descending order:
mothers2534 <- arrange(mothers2534,desc(age))
head(mothers2534$age)
[1] 34 34 34 34 34 34
Pipe operator:
F(x) = x%>% f()
For multiple functions:
h(g(f(x)))
x %>%
f %>%
g %>%
1st filter 2nd selected variables needed:
mothers2534 <- rr %>%
filter(age<=34 & age>=25 & mother2==1) %>%
select(dataset, mother2, age, childtot)
*if want descending order add:
%>%
arrange(desc(age))
new variable
df$newvarname <- expression
eg) rr$childcollegeprep <- rr$childeduc + rr$childtravel
or: mutate(dataframe, newvarname = expression)
eg) rr <- mutate(rr,
childcollegeprep=childtravel+childeduc)
can create mutiple new var at same time:
rr <- mutate(rr, childcollegeprep=childtravel+childeduc,
childnotcollegeprep=childtot-childcollegeprep)
create new var, drop all prior var:
collegeprepdat <- transmute(rr,
childcollegeprep = childeduc + childtravel,
childnotcollegeprep = childtot - collegeprep)
Error in `transmute()`:
In argument: `childnotcollegeprep = childtot -
collegeprep`.
Caused by error:
! object 'collegeprep' not found
Summarize() – generate summary stats:
summarize(rr, meanchildtot = mean(childtot, na.rm=T))
# A tibble: 1 × 1
meanchildtot
<dbl>
1 4.69
*1st sum(rr = generating summary stats from dataset rr
*2 meanchiltot = giving name to sum stats
*na.rm=T ignores missing values
Can compute multiple var at once:
summarize(rr, meanchildtot = mean(childtot, na.rm=T),
medianchildtot=median(childtot, na.rm = T))
mean of chiltot taken by each value of dataset, store
under meanchildtot:
rr %>%
group_by(dataset) %>%
summarize(meanchildtot=mean(childtot,na.rm=T))
create unique combinations:A 2010, A 2015, B2010,
B2015
student.df %>%
group_by(school,graduationdate) %>%
summarize(mean.gpa=mean(gpa))
`summarise()` has grouped output by 'school'. You can
override using the `.groups` argument.
# A tibble: 4 × 3
# Groups: school [2]
school graduationdate mean.gpa
<chr> <dbl> <dbl>
1 A 2010 3.45
2 A 2015 2.9
3 B 2010 3.6
4 B 2015 1.8
Save output:
totchildbyyearcollege <- rr %>%
group_by(dataset, college) %>%
summarize(meanchildtot = mean(childtot, na.rm=T))
combination:
collegeprep <- rr %>%
filter(mother2==1, age>24, age<35) %>%
mutate(collegeprep = childeduc + childtravel) %>%
group_by(dataset, college) %>%
summarize(meancollegeprep=mean(collegeprep,
na.rm=T))
`summarise()` has grouped output by 'dataset'. You can
override using the `.groups` argument.
Filter wanted mother data, creat new var, group dta aby
dataset and college & summarize mean
Create histogram:
hist(pm_bycity$meanpm10,
xlab="Mean PM10",
ylab="Frequency",
main="Mean PM10 by City")
create multiple plots:
par(mfrow=c(1,2))
SINGLE ROW TWO COLUMNS (indicate with mfrow=)
#Smaller bins
hist(pm_bycity$meanpm10, xlab="Mean PM10",
ylab="Frequency", main="Mean PM10 by City",
breaks=20)
#Larger bins
hist(pm_bycity$meanpm10, xlab="Mean PM10",
ylab="Frequency", main="Mean PM10 by City",
breaks=4)
new var = T deciding date >/<auto_date:
pm_bycitybefore <- pm %>%
mutate(T=date-auto_date) %>%
filter(T<0) %>%
group_by(code_city) %>%
summarize(meanpm10 = mean(pm10, na.rm=TRUE))
pm_bycityafter <- pm %>%
mutate(T=date-auto_date) %>%
filter(T > 0) %>%
group_by(code_city) %>%
summarize(meanpm10 = mean(pm10, na.rm=TRUE))
comparing histograms:
par(mfrow=c(1,2))
#Plot histogram for before
hist(pm_bycitybefore$meanpm10, xlab="MeanPM10",
ylab="Frequency",
main="Before Automation")
#Plot histogram for after
hist(pm_bycityafter$meanpm10, xlab="MeanPM10",
ylab="Frequency",
main="After Automation")
same scale for both histograms:
#Create two panes for plots
par(mfrow=c(1,2))
#Plot histogram for before
hist(pm_bycitybefore$meanpm10, xlab="MeanPM10",
ylab="Frequency",
main="Before automation",
xlim=c(0,250), ylim=c(0,50))
lines(c(meanbefore, meanbefore), c(-10, 100),
lty=2, col="red")
#Plot histogram for after
hist(pm_bycityafter$meanpm10, xlab="MeanPM10",
ylab="Frequency",
main="After automation",
xlim=c(0,250), ylim=c(0,50))
lines(c(meanbefore, meanbefore), c(-10, 100),
lty=2, col="red")
boxplot function:
boxplot(pm_byday$meanpm10)
lines that extend from the box are referred to as
whiskers
whiskers – 1.5xIQR
IQR = Q3-Q1
*Horizontal, blue, with label box plot:
boxplot(pm_byday$meanpm10, xlab="Mean PM10",
main="Mean PM 10 by city",
col="blue", border="darkblue",
pch=16, horizontal=T)
separate box plots:
(1st add new var)
pm_byday$after <- pm_byday$T>0
boxplot(pm_byday$meanpm10 ~ pm_byday$after,
xlab="Mean PM10", ylab="After Automation",
main="Mean PM 10 by Day Before and After
Automation",
col="blue", border="darkblue",
pch=16, horizontal=T)
scatter plot:
plot(df$xvar, df$yvar)
*pch = .pch plot size
*lines() add line
EG) plot(pm_byday$T, pm_byday$meanpm10,
xlab="Days Relative to Automation",
ylab="Mean PM10",
main="Automation and Mean PM10",
pch=16, ylim=c(0,300))
lines(c(0,0), c(-10,250), col="red", lty=2)
text(0,260, "Automation", col="red")
save plot as pdf:
pdf("Automation.pdf")
dev.off() stop saving into pdf
library(lubridate)
today(), now()showsh date
*ymd(“2012-01-22”) convert string into date obj. can
also mdy/dmy/ymd
*can add days()/months()/years()
*portion – e.g year
year(ymd("2012-01-22"))
#> [1] 2012
*add
ymd("1960-01-01")+days(18628)
#> [1] "2011-01-01"
Plot mean amount of rain by month:
pm_bymonth <- pm %>%
mutate(rdate = ymd("1960-01-01") + days(date),
month = month(rdate)) %>%
group_by(month) %>%
summarize(meanrain = mean(rain, na.rm=TRUE))
plot(pm_bymonth$month, pm_bymonth$meanrain,
col="blue",
pch=16,
xlab="Date",
pf2

Partial preview of the text

Download Cheat Sheet for Data Analytics and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Obj <- or =number/characters Vector <- c(number, “value”) Length(V) : Min/max/range/mean/sum(V) Setwd(“location”) Dataframe <- read.csv(“file name”) Summary(df): simple stats of each variable #(min quartile median mean max)/N(length) Head(df):first six observations Tail(df): last six observations Head(df$variable):first 6 obs. for variable Same for mean/table – frequencies Save file: write.csv(resume, file="resume1.csv") save(resume, file="resume1.RData") !not &and |or!=not equal>= 大于等于 <= 小于等于 ==logical operator (e.g.5==6 false) !=not true (e.g. 5!=6 True) newV <- V2>V1 drop obs. V2>V

df<- c(1,2,3,4) df(3) = 3/df(-3) = 1 2 4 [] indexing: df[c(1,3)] = 1 3 df[c(TRUE, FALSE, TRUE, FALSE)] = 1, 3 (only 1st^ & 3rd^ is true) school <- c("UCSD", "UCB", "UCLA", "UCR") school=="UCSD" [1] TRUE FALSE FALSE FALSE Subset graduationdate vector graduationdates[school=="UCSD"] [1] 2010 Conditional vector Vec <- df$variable == “condition” Frequency of conditional vector Sum(df$variable==“condition”) = # of obs. Mean of V1 with the condition of V Mean(df$variable[df$variable2]) Subsetting data frames students <- data.frame(school=c("UCSD", "UCB", "UCSD"), graduationdate=c(2010, 2019, 2015))= students school graduationdate 1 UCSD 2010 2 UCB 2019 3 UCSD 2015 Specify data frame, row, column: Students[3,1] / [1,] / [,2] [1] “UCSD” / 1 UCSD 2010/2010 2019 2015 Extract row 1&3 w/o 2 students[c(1,3),] (or use students[-2,]) school graduationdate 1 UCSD 2010 3 UCSD 2015 Extract where ==“UCSD” is true students school=="UCSD" [1] TRUE FALSE TRUE ⬆️ (return logical factor)⬇️  filter students[students$school=="UCSD",] school graduationdate 1 UCSD 2010 3 UCSD 2015

Tidyverse: subset(df,logical statement) subset(students,students$school=="UCSD") school graduationdate 1 UCSD 2010 3 UCSD 2015 Create dataframe for subset resume_blacknames <-resume[resume$race=="black",]

resume_whitenames <- resume[resume$race=="white",] mean of subset df mean(resume_blacknames$variable) [1] 0. CONDITIONAL STATEMENTS if (logical statement) { code to be executed if logical statement is TRUE }

EG) door <- "locked" if (door=="locked") { print("sorry, you need a key to enter") } [1] "sorry, you need a key to enter"

EG2) door <- "unlocked" if (door=="locked") { print("sorry, you need a key to enter") } NOT EXECUTED AS DOOR LOCKED NOT TRUE ELSE EG) door <- "locked" if (door=="locked"){ print("Sorry, you need a key to enter") } else { {and else same line must print("Please Come in!") } [1] "Sorry, you need a key to enter" Executes else if door<-“unlocked” [1]”please come in!” Loops: for (some set of things) { do some stuff } EG) i <- 2 print(2i) [1] 4 EG2) for (i in c(3,10,99)){ print(2i) } [1] 6 [1] 20 [1] 198 EG3) homework <- c("math", "reading", "writing") for (i in homework) { cat("Do", i, "\n") } Do math Do reading Do writing *cat(): concatenates & prints *\n: display in next line Same process by looping over words in hw: for (i in 1:length(homework)) { cat("Do", homework[i], "\n") } Dimensions: Dim(df) Subset individuals (age>=25 & age<=34 & mother2==1): mothers2534 <- subset(df, mother2==1 & age>=25 & age<=34) install tidyverse install.packages("tidyverse") library(tidyverse) convert data frame to tibble: rr <- as_tibble(rr) retrieves rows of data that meet certain condition: filter(dataframe, some logical statement) EG) mothers2534 <- filter(rr, age<=34 & age>=25 & mother2==1) Or EG) filter(mothers2534, dataset%in%2003:2008) Pipe eg) mothers2534 <- rr %>% filter(age<=34 & age>=25 & mother2==1) Select remain tibble wanted: Eg) mothers2534 <- select(mothers2534, dataset, mother2, age, childtot) Sort data based on values of variable: mothers2534 <- arrange(mothers2534,age) want descending order: mothers2534 <- arrange(mothers2534,desc(age)) head(mothers2534$age) [1] 34 34 34 34 34 34 Pipe operator: F(x) = x%>% f() For multiple functions: h(g(f(x))) x %>% f %>% g %>% 1st filter 2nd selected variables needed: mothers2534 <- rr %>% filter(age<=34 & age>=25 & mother2==1) %>% select(dataset, mother2, age, childtot) *if want descending order add: %>% arrange(desc(age))

new variable df$newvarname <- expression eg) rr$childcollegeprep <- rr$childeduc + rr$childtravel or: mutate(dataframe, newvarname = expression) eg) rr <- mutate(rr, childcollegeprep=childtravel+childeduc) can create mutiple new var at same time: rr <- mutate(rr, childcollegeprep=childtravel+childeduc, childnotcollegeprep=childtot-childcollegeprep)

create new var, drop all prior var: collegeprepdat <- transmute(rr, childcollegeprep = childeduc + childtravel, childnotcollegeprep = childtot - collegeprep)

Error in transmute(): ℹ In argument: childnotcollegeprep = childtot - collegeprep. Caused by error: ! object 'collegeprep' not found

Summarize() – generate summary stats: summarize(rr, meanchildtot = mean(childtot, na.rm=T))

A tibble: 1 × 1

meanchildtot

1 4. ***1st sum(rr = generating summary stats from dataset rr 2 meanchiltot = giving name to sum stats na.rm=T ignores missing values Can compute multiple var at once: summarize(rr, meanchildtot = mean(childtot, na.rm=T), medianchildtot=median(childtot, na.rm = T)) mean of chiltot taken by each value of dataset, store under meanchildtot: rr %>% group_by(dataset) %>% summarize(meanchildtot=mean(childtot,na.rm=T))

create unique combinations:A 2010, A 2015, B2010, B student.df %>% group_by(school,graduationdate) %>% summarize(mean.gpa=mean(gpa)) summarise() has grouped output by 'school'. You can override using the .groups argument.

A tibble: 4 × 3

Groups: school [2]

school graduationdate mean.gpa

1 A 2010 3. 2 A 2015 2. 3 B 2010 3. 4 B 2015 1. Save output: totchildbyyearcollege <- rr %>% group_by(dataset, college) %>% summarize(meanchildtot = mean(childtot, na.rm=T)) combination: collegeprep <- rr %>% filter(mother2==1, age>24, age<35) %>% mutate(collegeprep = childeduc + childtravel) %>% group_by(dataset, college) %>% summarize(meancollegeprep=mean(collegeprep, na.rm=T)) summarise() has grouped output by 'dataset'. You can override using the .groups argument. Filter wanted mother data, creat new var, group dta aby dataset and college & summarize mean Create histogram: hist(pm_bycity$meanpm10, xlab="Mean PM10", ylab="Frequency", main="Mean PM10 by City") create multiple plots:

par(mfrow=c(1,2)) SINGLE ROW TWO COLUMNS (indicate with mfrow=) #Smaller bins hist(pm_bycity$meanpm10, xlab="Mean PM10", ylab="Frequency", main="Mean PM10 by City", breaks=20) #Larger bins hist(pm_bycity$meanpm10, xlab="Mean PM10", ylab="Frequency", main="Mean PM10 by City", breaks=4)

**new var = T deciding date >/% mutate(T=date-auto_date) %>% filter(T<0) %>% group_by(code_city) %>% summarize(meanpm10 = mean(pm10, na.rm=TRUE)) pm_bycityafter <- pm %>% mutate(T=date-auto_date) %>% filter(T > 0) %>% group_by(code_city) %>% summarize(meanpm10 = mean(pm10, na.rm=TRUE)) comparing histograms: par(mfrow=c(1,2)) #Plot histogram for before hist(pm_bycitybefore$meanpm10, xlab="MeanPM10", ylab="Frequency", main="Before Automation") #Plot histogram for after hist(pm_bycityafter$meanpm10, xlab="MeanPM10", ylab="Frequency", main="After Automation") same scale for both histograms: #Create two panes for plots par(mfrow=c(1,2)) #Plot histogram for before hist(pm_bycitybefore$meanpm10, xlab="MeanPM10", ylab="Frequency", main="Before automation", xlim=c(0,250), ylim=c(0,50)) lines(c(meanbefore, meanbefore), c(-10, 100), lty=2, col="red") #Plot histogram for after hist(pm_bycityafter$meanpm10, xlab="MeanPM10", ylab="Frequency", main="After automation", xlim=c(0,250), ylim=c(0,50)) lines(c(meanbefore, meanbefore), c(-10, 100), lty=2, col="red")

boxplot function: boxplot(pm_byday$meanpm10) *lines that extend from the box are referred to as whiskers whiskers – 1.5xIQR IQR = Q3-Q Horizontal, blue, with label box plot: boxplot(pm_byday$meanpm10, xlab="Mean PM10",

main="Mean PM 10 by city", col="blue", border="darkblue", pch=16, horizontal=T)

separate box plots: (1st add new var) pm_byday$after <- pm_byday$T>

boxplot(pm_byday$meanpm10 ~ pm_byday$after, xlab="Mean PM10", ylab="After Automation", main="Mean PM 10 by Day Before and After Automation", col="blue", border="darkblue", pch=16, horizontal=T)

scatter plot: plot(df$xvar, df$yvar) **pch = .pch plot size lines() add line EG) plot(pm_byday$T, pm_byday$meanpm10, xlab="Days Relative to Automation", ylab="Mean PM10", main="Automation and Mean PM10", pch=16, ylim=c(0,300)) lines(c(0,0), c(-10,250), col="red", lty=2) text(0,260, "Automation", col="red") save plot as pdf: pdf("Automation.pdf") dev.off() stop saving into pdf **library(lubridate) today(), now()showsh date *ymd(“2012-01-22”) convert string into date obj. can also mdy/dmy/ymd can add days()/months()/years() portion – e.g year year(ymd("2012-01-22")) #> [1] 2012 *add ymd("1960-01-01")+days(18628) #> [1] "2011-01-01" Plot mean amount of rain by month: pm_bymonth <- pm %>% mutate(rdate = ymd("1960-01-01") + days(date), month = month(rdate)) %>% group_by(month) %>% summarize(meanrain = mean(rain, na.rm=TRUE))

plot(pm_bymonth$month, pm_bymonth$meanrain, col="blue", pch=16, xlab="Date",

ylab="Mean Daily Rain (mm)")

GGPLOT – SCATTER PLOT

ggplot(data = ) + (mapping = aes()) } EG) ggplot(data = pm_byday) + geom_point(mapping = aes(x = T, y = meanpm10,color=meanrain)) + xlab("Time After Automation") + ylab("Mean PM10") + ggtitle("Time After Automation and PM10") GGPLOT BOXPLOT

*one plot just y = var ggplot(data=pm_byday) + geom_boxplot(mapping=aes(x=after,y=meanpm10)) + xlab("After Automation?") + ylab("Pollution Levels") + ggtitle("Boxplot of Pollution Levels Before and After Automation")

ggplot(data = pm_byday) + geom_boxplot(mapping = aes(x = after, y = meanpm10)) + scale_x_discrete(labels=c("Before Automation","After Automation")) + xlab("") *changing x-axis labels

Plot votes Ggplot(df, mapping=aes(x=xvar,y=yvar))+ Geom_point()+ geom_text(vjust=1.5, size=3)

geom_text(vjust=1.5, Linear Regression model

Linear Regression: Lm.fit<-lm(yvar ~ xvar, data=df) Plot regression line: ggplot(df, mapping=aes(x=Xvar, y=Yvar))+ geom_point()+ xlab(“x axis label)+ ylab(“y axis label)+ geom_smooth(method = “lm”, se = False, color=”red”, linetype=”dashed”)

  • Use names to find saved outputs Lm.fit<-lm(yvar ~ xvar, data=df) Names(lm.df)

EG)df$predXVar <- lm.df$fitted.values *Order by largest residuals Arrange(df,desc(abs(residuals)))

  • BY PIPE: df %>% Mutate(preXVar = lm.df$Fitted.values, Residuals=lm.df$residuals)%>% Select(var, preXVar, XVar, residuals)%>% Arrange(desc(abs(residuals))) FUNCTION Functionname<-function(input){ Output<-code that performs operation Return(output) } EG)take mean of vector Mymean <-function(num){ Output<-sum(num)/length(num) Return(output)

Myvec<-c(3,5,7) Mymean(myvec) = 5 Mean after exp. For treated and controlled mean(uct$asset_after[uct$treat == 1])

mean(uct$asset_after[uct$treat == 0]) treated – by tidyverse uct %>% filter(treat==1) %>% summarize(meanassetafter = mean(asset_after)) mean asset after exp: treat.mean <- function(treatvalue){ return(mean(uct$asset_after[uct$treat==treatvalue])) } treat.mean(0) treat.mean(1) tidyverse treat.mean <- function(treatvalue){ output <- uct %>% filter(treat==treatvalue) %>% summarize(meanasset = mean(asset_after)) return(output) } treat.mean(0) *compute averages of treated vs controlled before/after experiment Variable <-“asset” AfterV<-paste(variable,”_after”,sep””) BeforeV<-paste(variable,”_before”,sep=””) **sep=”” tell R not include space between “asset” and “_after” pull() function grabs columns from df Pull(uct_treat, aftervar)

[display] presents visualization [browse] show spreadsheet [summarize newvar] retrieve summary stats (mean, std. devi, min, max) [gen newvar = (condition)] condition 加减乘除&if statements [gen mobility_rate = parq1kqcond] [gen CA = (state == “CA”)] [Label var varname “Given label”] [sum newvar if condition >= requirement] *can be ==, >=, <= [sum mobility if name == “UCSD”] [histogram varname, frac] y-axis = perct/frac x-axis = interval intervals [a,b) include a exclude b [use dataA.dta, clear///Append using dataB.dta] append (extend dta by adding dataset to master dataset on same variables) [use dataA.dta, clear///merge (type) var using dataB.dta] (type=m:1, 1:1) 1:1 – both sets have unique keys M:1 – 1st dta has duplicate, 2nd^ unique (vice versa for 1:m) [keep if _merge ==3] keep if both matched [merge m:1 var using dataB.dta, keep(3)nogen Save new_file.dta, replace] merge, keep only both appreciate, save in the directory clock [di clock(“18:30:00”,”hms”) [drop if var [Tab var] retrieve by variable/create table of frequency [tab var if var2 == (1 or 0)] retrieve by var but based on true or false of var [graph bar(stats) yvar, over (xvar)] (bar) Stats can be replaced as mean, count, sum etc [title(“Name of graph”)] create title (same for xtitle and ytitle) [collapse (stats) varlist1, by varlist2] *varlist 1&2 can be more than one variable *e.g change unit from indivi. to by state

1.[gen obs_count = 1] variable to store count 2.[collapse (count) obs_count, by var]

3.find fraction of number of observ. [bysort var: egen var2 = total (obs_count)] Creates new var: sum total of var(obs_count) [gen fraction_var = obs_count/total_count] [graph bar fraction_var, over(var1) over (var2)]

0 = intercept & 1 = slope coefficient

[reg yvar xvar] _con = intercept Slope coeff. – next to variable lane [twoway scatter yvar xvar/// || lfit yvar xvar, lw(0.4) lc(red)/// (titles)] create scatter plot with linear regression line Conditional [reg yvar xvar if (logical sta.)] Logi st e.g. if var == Jitter(adds small amounts of noise so points don’t overlap) 2.Collapse (count) constant, by (airlines) Value of constant for any row with “Alaska” = 2

3.Imagine we have loaded in the dataset above and want to merge to another dataset where the unit-of- observation is an airline, and the dataset contains a variable that captures the total number of employees of that airline. What type of merge would this be? (m:1) 4.If we wanted to understand the statistical relationship between par_q1 and kq5_cond_parq1, what figure would be most appropriate? (scatter) 5.What code would successfully retrieve the average of par_q1, conditional on the institution having a value of count that is greater than or equal to 3000. = 3000> 6.The table above shows the UC Irvine graduates from low-income backgrounds are more likely to reach high incomes than low-income students from UC Santa Barbara. Can we conclude this is a causal effect of UC Irvine? (not enough info) 8.In week 4, we studied Mindspark, a technology to teach-at-the-right-level. Instead of an experiment, imagine we had observational data on Mindspark use across students in a different city in India. We find that these students who attend Mindspark have higher test scores than students that do not. What can we conclude?

(correlate higher scores, same conclusion as original)

09. In Week 3, we studied discrimination in traffic stops using the veil of darkness test. What was the unit-of- observation in the Standford Open Policing Project. (Stop) - exhibits long right-tail

10.We find that the intercept in this plot is equal to 3., we find that someone made a mistake and added 1 to everyone's X-value. If we re-estimate the regression with the new variable, how will the intercept change? (It will be larger in magnitude) What code would correctly generate a binary indicator variable that is equal to one if the number of sales is greater than or equal to 100 million units and zero otherwise. [gen greater_than_100 = (sales >= 100)] computes the number of individuals in the dataset that are male vs. female. table(resume$sex) or sum(resume$sex==”male”) sum(resume$sex==”female”) Add a variable to resume that is equal to 1 if the individual is male and zero otherwise.

Name this new variable male. Tidyverse version resume <- resume %>% mutate(male=(sex==”male”)) Base R version resume$male <- (male=(sex==”male”))

If you esEmate the regression above in the Bertrand and Mullainathan (2004) data, you -will find 𝛽 = -0.01. Write a clear sentence interpreEng this slope coefficient. This implies that male applicants are 1 percentage point less likely to be called for an interview relative to female applicants. Next create a dataframe that restricts only to Black applicants. sub <- resume %>% filter(race == “black”) Write code below that computes the average of sat_avg_2013 only for colleges in CA. mean(colleges$sat_avg_2013[colleges$state==”CA”]) You want to understand the distribution of sticker prices separately for public and private schools. Write a few sentences describing a figure that would accomplish this goal. In other words, describe what type of figure you would create, and what appears on the vertical and horizontal axis. In the question below, you will be asked to write code to construct this figure. BOXPLOT y=sticker price x = private vs public, graph shows distribution between diff. percentiles

colleges %>% ggplot(aes=(x=public,y=price)) + geom_boxplot() Write code below that (1) restricts the data to private universes, and then (2) sorts the data from highest sticker price to the lowest sticker price. You should accomplish this using functions from tidyverse. colleges %>% filter(public==0) %>% arrange(desc(price)) data frame + histogram show distribution avg price state_level <- colleges %>% group_by(state) %>% summarize(meanprice=mean(price)) state_level %>% ggplot(aes(x=meanprice)) + geom_histogram() hist(state_level$meanprice)