Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


SUMMARY: RStudio and Radiant, Appunti di Statistica

Summary of the lectures and book of RStudio and Radiant for the statistics exam in Bocconi

Tipologia: Appunti

2020/2021

In vendita dal 22/06/2022

rebeccacordioli
rebeccacordioli 🇮🇹

5

(1)

28 documenti

1 / 9

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
R STUDIO
INTRODUCTION
1) Console It is possible to type in the commands we intend to execute and in which the output produced by the
commands is printed.
2) Script It allows you to write several lines of code (while the console is useful for just few lines).
If you want to execute just a single line use Ctrl+Enter
To create a new script: File New File R Script
To save the script in a specific folder:
o “More”
o “Set as working directory”
o Now you can save it (top left corner)
3) Environment It shows the list of objects created so far with a summary for each of them.
To clean up the environment use rm(list=ls()), where ls() is the function to see the list of object created so far.
4) History You can search for a variable and find all the occurrences where you used that variable.
5) Files It shows the contents of the current directory, with the possibility to change it by clicking on the name of
the current directory.
6) Plots It will sequentially report all the graphs produced during the working session.
7) Packages It lists the packages installed on your computer and those that are currently loaded in memory
(indicated by the check-marks next to the package names).
Load data
To load the contents of an .RData file in RStudio, you can:
choose File Open File..., select the file to open and confirm your choice
click on the file name in the Files tab and confirm
use the load(“filename”) function
If the file was in a directory other than the current one, you must first change the working directory: you have to
move to that directory in the Files tab, click on the icon with the gear (“More”) and choose “Set As Working
Directory”.
pf3
pf4
pf5
pf8
pf9

Anteprima parziale del testo

Scarica SUMMARY: RStudio and Radiant e più Appunti in PDF di Statistica solo su Docsity!

R STUDIO

INTRODUCTION

1) Console → It is possible to type in the commands we intend to execute and in which the output produced by the commands is printed. 2) Script → It allows you to write several lines of code (while the console is useful for just few lines).

  • If you want to execute just a single line use Ctrl+Enter
  • To create a new script: File → New File → R Script
  • To save the script in a specific folder: o “More” o “Set as working directory” o Now you can save it (top left corner) 3) Environment → It shows the list of objects created so far with a summary for each of them. To clean up the environment use rm(list=ls()), where ls() is the function to see the list of object created so far. 4) History → You can search for a variable and find all the occurrences where you used that variable. 5) Files → It shows the contents of the current directory, with the possibility to change it by clicking on the name of the current directory. 6) Plots → It will sequentially report all the graphs produced during the working session. 7) Packages → It lists the packages installed on your computer and those that are currently loaded in memory (indicated by the check-marks next to the package names). Load data To load the contents of an .RData file in RStudio, you can:
  • choose File → Open File..., select the file to open and confirm your choice
  • click on the file name in the Files tab and confirm
  • use the load(“filename”) function If the file was in a directory other than the current one, you must first change the working directory: you have to move to that directory in the Files tab, click on the icon with the gear (“More”) and choose “Set As Working Directory”.

Assign a variable<- or =

x <- 7 x [1] 7 Note! The == logical operator, instead, checks if the left side is equal to the right side and returns TRUE or FALSE.

VECTORS

Create a vectorc(elements)

vector_name <- c(4,6, 8 ,12) Selecting elements from vectors[position_of_the_element] vector_name[3] [1] 8 Slicing a vector: x = 0:6 (from 0 to 6) Functions of vectors

  • mean(x)
  • median(x)
  • var(x)
  • sd(x)
  • cov(x, y) → covariance
  • cor(x, y) → correlation coefficient ρ
  • max(x)
  • min(x)
  • log(x)
    • summary(x)
    • quantile(x, probs = c(a, b…)) → it returns the quantiles of order a, b…
    • length(x) → it returns the size of the vector
    • sum(x)
    • cumsum(x)
    • rbind(x1, x2 …) → to combine by columns or rows two or more sequences of vectors
    • seq(start, end, difference_between_values)

DATASET

  • data(“namefile”) → it loads the dataset named “namefile” on the environment
  • head() → it inspect the 1st^ row of a dataset (if the dataset is too long)
  • namefile[1,3] → it finds the element in the 1st^ row and 3rd^ column
  • load(“namefile”) → it loads the dataset, only if contained in the working directory
  • namefile$sub → to extract all the variables of the “sub” category Saving the dataset
  • “…” to retrieve the folder
  • Open the folder
  • Set the folder as a working directory by: o “More” o “Set as working directory” o Now everything you upload/save will be taken/put in that folder
  • Load the dataset by clicking on the name and confirm (or use the load() function)

PACKAGES

To download a package:

  • Go in the “Packages” tab
  • Click “Install”
  • Type the name of the package and ✓ it Note! If you click of the name, you can read all the functions of that package

BINOMIAL DISTRIBUTION

→ PMF:

  • choose(n, k)* pk^ * (1-p)n-k
  • dbinom(k, size=n, prob=p)

NORMAL DISTRIBUTION

PDF : f(x) = dnorm(x, mean, sd) → CDF : F(x) = P(X<x) = pnorm(x, mean, sd) F(x) = P(X>x) = 1- pnorm(x, μ, σ) = pnorm(x, μ, σ, lower.tail=FALSE) Quantiles of order x → Qx = qnorm(x, mean, sd)

  • If I don’t specify μ=0 and σ=
  • To find the z such that 5% of the values are more extreme than z, which means P(X<-z) + P(X>z) = 0.05 → z has to be the quantile of order 0.975, therefore z = qnorm(0.975)

T-STUDENT DISTRIBUTION

Density → PDF: f(x) = dt(x, df) Distribution function → CDF: F(x) = P(X<x) = pt(x, df) F(x) = P(X>x) = 1- pt(x, df) = pt(x, df, lower.tail=FALSE) Quantiles of order x → Qx = qt(x, df)

CHI SQUARE DISTRIBUTION

Density → PDF: f(x) = dchisq(x, df) Distribution function → CDF: F(x) = P(X<x) = pchisq(x, df) F(x) = P(X>x) = 1- pchisq(x, df) = pchisq(x, df, lower.tail=FALSE) Quantiles of order x → Qx = qchisq(x, df)

LINEAR REGRESSION MODEL

Define the model

  • Simple linear regression → m = lm(Y ~ X)
  • Multiple linear regression → m = lm(X ~ X 1 + X 2 ) Note! Copy and paste the tilde ~ symbol with ?tilde Graph
  • Scatterplot → plot(X, Y)
  • Visualize the regression line → abline(m) Anova table Get the variance decomposition SST = SSR + SSE → anova(m) Summary summary(m) If we store it as a variable info = summary(m), then we can recover the information we need by info$... Confidence intervals for coefficients (b0, b1…) confint(m, level) level = 1-α → if I don’t specify level=0.9 5 and α=0. Find Y
  1. We can plug in X 1 =x 1 and X 2 =x 2 in the linear function to obtain the predicted Y: predict(m, data.frame(X 1 =x 1 , X 2 =x 2 ), level) level = 1-α → if I don’t specify level=0.9 5 and α=0.
  2. Or we can find a confidence interval for the predicted Y: a) Confidence intervals for average values of Y predict(m, data.frame(X 1 =x 1 , X 2 =x 2 ), interval = “confidence”, level) b) Confidence interval for the real value of Y predict(m, data.frame(X 1 =x 1 , X 2 =x 2 ), interval = “predict”, level) *add this if it is a multiple linear regression model with more variables Check if assumptions are met through residual graphs
  3. Linearity → plot(m, 1) residuals should be around 0
  4. Homoscedasticity → plot(m, 1) variance of residuals should be constant
  5. Normality → Q-Q Plot plot(m, 2) it should follow the straight line
  6. Multicollinearity → cor(X 1 , X 2 ) it should be < 0.

“BASICS” TAB

PROBABILITY CALCULATOR

  • No dataset required
  • We can choose different distributions (normal, t- Student…)
  • To calculate probabilities → input = values
  • To calculate quantiles → input = probabilities SINGLE MEAN (Hypothesis test case 2)
  • You need a dataset (load it in the Data tab → Manage)
  • To compute the sample mean (mean) of a normal distribution
  • You need to select:
    • The numerical variable for which you want to calculate the mean
    • If H1 is one-sided or two-sided
    • The confidence level
    • The comparison value (= H0) The summary shows:
  • mean → sample mean
  • n → sample size
  • n_missing → number missing data
  • sd → standard deviation
  • se → standard deviation of the sample mean (= σ/√n)
  • me → margin of error, it half of the width of the confidence interval (= se*corresponding quantile)
  • diff → difference between sample mean and population mean under H
  • t.value → t-score
  • p.value → p-value
  • df → degrees of freedom
  • [x% y%] → upper and lower bound of the confidence interval COMPARE MEANS (Hypothesis test case 4 and 5)
  • You need a dataset (load it in the Data tab → Manage)
  • It compares different means Example: the mean wage (2nd^ variable, numerical) depending on ethnicity (1st^ variable, categorical)
  • You need to select: o At least two variables (1st: anything, 2nd: numeric variable) o If H1 is one-sided or two-sided o The confidence level o Sample type → independent (for case 5 ), paired (for case 4 ) o Test type → ALWAYS t-test
  • Assumptions: o Normally distributed populations o Population variances are unknown and possibly different o Independent samples Note! Tick “Show additional statistics” to see se, t.value, df, and the confidence interval

SINGLE PROPORTION (Hypothesis test case 3)

  • You need a dataset (load it in the Data tab → Manage)
  • To compute the sample proportion (p)
  • You need to select:
    • The categorical variable for which you want to calculate the proportion
    • The “level” (= value of the categorical variable for which to calculate the proportion)
    • If H1 is one-sided or two-sided
    • The confidence level
    • The comparison value (= H0)
  • Test type → ALWAYS Z-test GOODNESS OF FIT
  • You need a dataset (load it in the Data tab → Manage)
  • You need to select: o The categorical variable o The probabilities that this variable is expected to follow
  • To check if the distribution of the variable is consistent with the specified distribution The summary shows:
  • “Observed” → to see the observed values
  • “Expected” → to see the expected values, following the specified probabilities
  • “Chi-squared” → contribution to the chi-square score
  • Chi-square score χ²
  • df → degrees of freedom (k-1)
  • p.value Note! Check that 0% of cells have expected values <5! CROSS-TABS
  • You need a dataset (load it in the Data tab → Manage)
  • You need to select the TWO categorical variables
  • To check if the is association between these variables The summary shows:
  • “Observed” → to see the contingency table of the observed values
  • “Expected” → to see the contingency table of the expected values
  • “Chi-squared” → contribution to the chi-square score
  • “Row percentages” → conditional frequency
  • “Column percentages” → conditional frequency
  • “Table percentages” → total relative frequency
  • Chi-square score χ²
  • df → degrees of freedom (r-1)(c-1)
  • p.value Note! Check that 0% of cells have expected values <5!