



Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Riassunto del corso di Data Analysis (IPLE)
Tipologia: Schemi e mappe concettuali
1 / 6
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!




Data analysis is a process of inspecting, cleansing, transforming and modelling data with the goal of
discovering useful information, informing conclusions and supporting decision-making. Data are empirical
material organized into a form that can be analyzed. Primary data analysis : you collect and categorize
(coding process) your own data through interviews, official documents, experiments, surveys, etc.
Secondary data analysis : you use data sets that have been gathered by others and have subsequently been
deposited in databases (: existing archived collections of data). Units of analysis are the objects/subjects to
which the properties investigated pertain A variable is an empirical measurement of a characteristic. Key
features of a variable: a name and at least two values (otherwise it would be a constant). Variation of a
variable may occur in two ways: over time, on the same cases; or among cases, at the same time.
The levels of measurement of variables are three: nominal, ordinal and interval. Nominal variable is one
that has two or more categories, but there is no intrinsic ordering to the categories (discrete and non-
orderable). Ordinal variable is similar to a categorical variable, but there is a clear ordering of the
categories (discrete and orderable). Interval variable is similar to an ordinal variable, except that the
intervals between the values of the variable are equally spaced.
A spreadsheet is an interactive computer application for the organization, analysis, and storage of data in
tabular form. A spreadsheet consists of a table of cells arranged into rows and columns and referred to by
the X and Y locations. A cell is a box for holding data. A single cell is usually referenced by its column and
row. A worksheet is a grid of cells with either raw data, called values, or formulas in the cells. Values are
raw data (general numbers, text, dates). Alternatively, a value can be based on a formula, which might
perform a calculation. A formula is an equation that performs calculations, such as addition, subtraction,
multiplication, and division, on values in a worksheet. To enter a formula in a cell you always need to start
with the equal sign (=). Functions can be built-in functions, such as arithmetic operations (for example,
summations, averages), trigonometric functions, statistical functions, etc. Charts are graphical display of
data.
A cell reference identifies a cell’s location in the worksheet, based on its column letter and row number,
such as A1 (column A, row 1) or E4 (column E, row 4). There are three types of cell references:
content of a cell with a relative reference in another cell the reference changes according to the
distance in number of rows and columns between the first and second cell.
and if you copy-paste the content of a cell with an absolute reference the reference does not
change.
A$1 to fix the row); if you copy-paste the content of a cell with a mixed reference the column or the
row changes.
rows, and data (values) in cells.
recognizable names for variables
and yyyy is a 4 - digit year.
variable, a label or longer name that describes the variable including the units it is
measured in, the codes for any categorical variables, how missing values are coded, and
any further note.
Descriptive statistics summarize the information in a collection of data. Frequency distribution is a listing
of possible values for a variable, together with the number of observations at each value.
Measures of central tendency are statistics that show what a typical observation is like.
Measures of variability are statistics that show the amount of dispersion in a dataset
difference between the largest and the smallest observations
standard deviation, the greater the spread of the data around the mean
This measure describes the spread of the middle half of the observations
Frequency distribution can be absolute or relative. Absolute frequency is the number of observations per
category of a variable. Relative frequency for a category is the proportion or percentage of the observations
that fall in that category. The proportion equals the number of observations in a category divided by the
total number of observations. It is a number between 0 and 1 that expresses the share of the observations
in that category.
Treemap To portray hierarchical data.
Hierarchical data structure with parents
categories and sons.
Map
To show values for each geographic unit (for instance
countries).
For each geographical unit a continuous value is
assigned.
The analysis of the association between two variables is called bivariate analysis. An association exists
between two variables if particular values for one variable are more likely to occur with certain values of
the other variable. When doing bivariate analysis, our first step is to distinguish between the dependent/
response/output variable and the independent/explanatory/input variable.
Independent/Explanatory/Input variable is the variable that influences the dependent variable.
Dependent/Response/Output variable is the variable that gets influenced by the dependent variable. The
dependent variable is the outcome variable on which comparisons are made for different values of the
independent variable. When the explanatory variable is categorical, it defines the groups to be compared
with respect to the response variable. When the explanatory variable is quantitative, we examine how
different values of the explanatory variable relate to changes in the response variable.
value of the independent variable
values of the independent variable at the same value of the dependent variable
Two categorical variables are independent if the population conditional distributions for one of them are
identical at each category of the other. The variables are dependent if the conditional distributions aren’t
identical.
Steps for the chi-squared test of independence
a. Two categorical variable
b. Randomization
c. Expected count ≥ in all cells
a. H 0
: the two variables are independent
b. H a
: the two variables are dependent
a. If H 0
is true, the in a contingency table the frequencies of the dependent variable should be
equal across categories of the independent variable, and should equal the expected cell
count
b. Compute the expected counts for each cell à 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑒𝑙𝑙 𝑐𝑜𝑢𝑛𝑡 =
( "#$ &#&'(
)
/#&'( 0'-1(2 0342
c. Compute the X
2
à 𝑋
!
(#$%&'(&) +,-./ 0 123&+/&) +,-./)
!
123&+/&) +,-./
reject H 0
when P-value ≤ significance level (generally 0,05)
If the association looks linear you can compute the correlation coefficient that summarizes the strength
and direction of the linear association between two quantitative variables. Denoted by r it takes values
between - 1 and +1. In Excel you can use the function =CORREL. The value of the correlation doesn’t depend
on the variables’ unit.
When the relationship has a straight-line pattern, we can analyze the data further by finding an equation
for the straight line that best describes that pattern. The regression line predicts the value for the response
variable y as a straight-line function of the value x of the explanatory variable. Let ŷ (y-hat) denote the
predicted value of y. The equation for the regression line has the form: ŷ = a + bx.
The slope b in the equation ŷ = a + bx equals the amount that ŷ changes when x increases by one unit. The
y-intercept a is the predicted value of y when x = 0. equals a = y ̄− b ( x ̄). The regression line is the optimal
line to fit through the data points by making the residuals as small as possible. This method produces the
line that has the smallest value for the residual sum of squares (residual = y – ŷ). In Excel you can find the
regression line using the function =LINEST or using the Data Analysis Tool pack.
Steps of two-sided significance test about a population slope
a. Relationship in population satisfies regression model μ
y
= α + βx
b. Data gathered using randomization
c. Population y values at each x value have normal distribution, with same standard deviation
at x value
a. H 0
: b = 0
b. H a
: b ≠ 0
($ 05 )
%&
, where software supplies sample slope b and its standard error se
distribution with df = n - 2 (supplied by software)
if P-value ≤ significance
level (such as 0.05).
The R
(R-squared or coefficient of determination) is a percentage of the variability in the response
variable , not a percentage of the response variable. It means that x % of the variability in the dependent
variable can be explained by the independent variable.
If your categorical independent variable has only two categories (binary) then you can perform a t-Test to
test formally whether there is a statistically significant difference in means
Steps of two-sided significance test for comparing two population means
a. A quantitative response variable observed in each of two groups
b. Independent random samples, either from random sampling or a randomized experiment
c. Approximately normal population distribution for each group. Or at least ni ≥ 30 (This is
mainly important for small sample sizes, and even the two-sided test is robust to violations
of this assumptions)
a. H 0
: μ 1
= μ 2
(the population mean of group 1 is equal to population mean of group 2)
b. H a
: μ 1
≠ μ 2
(one-sided H a
: μ 1
μ 2
or H a
:μ 1
< μ 2
( 2 "
0 2 !
) 05
%&
statistic, presuming the null hypothesis is true with difference given by software. Use the Excel
function =T.TEST