Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Riassunto Data Analysis IPLE, Schemi e mappe concettuali di Analisi Dei Dati

Riassunto del corso di Data Analysis (IPLE)

Tipologia: Schemi e mappe concettuali

2022/2023

Caricato il 06/01/2023

bibi2708
bibi2708 🇮🇹

4.3

(19)

168 documenti

1 / 6

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
Prof. Visconti IPLE 2022-2023 first trimester
1
DATA ANALYSIS SUMMARY
BASICS
Data analysis is a process of inspecting, cleansing, transforming and modelling data with the goal of
discovering useful information, informing conclusions and supporting decision-making. Data are empirical
material organized into a form that can be analyzed. Primary data analysis: you collect and categorize
(coding process) your own data through interviews, official documents, experiments, surveys, etc.
Secondary data analysis: you use data sets that have been gathered by others and have subsequently been
deposited in databases (: existing archived collections of data). Units of analysis are the objects/subjects to
which the properties investigated pertain A variable is an empirical measurement of a characteristic. Key
features of a variable: a name and at least two values (otherwise it would be a constant). Variation of a
variable may occur in two ways: over time, on the same cases; or among cases, at the same time.
The levels of measurement of variables are three: nominal, ordinal and interval. Nominal variable is one
that has two or more categories, but there is no intrinsic ordering to the categories (discrete and non-
orderable). Ordinal variable is similar to a categorical variable, but there is a clear ordering of the
categories (discrete and orderable). Interval variable is similar to an ordinal variable, except that the
intervals between the values of the variable are equally spaced.
A spreadsheet is an interactive computer application for the organization, analysis, and storage of data in
tabular form. A spreadsheet consists of a table of cells arranged into rows and columns and referred to by
the X and Y locations. A cell is a box for holding data. A single cell is usually referenced by its column and
row. A worksheet! is a grid of cells with either raw data, called values, or formulas in the cells. Values! are
raw data (general numbers, text, dates). Alternatively, a value can be based on a formula, which might
perform a calculation. A formula is an equation that performs calculations, such as addition, subtraction,
multiplication, and division, on values in a worksheet. To enter a formula in a cell you always need to start
with the equal sign (=). Functions! can be built-in functions, such as arithmetic operations (for example,
summations, averages), trigonometric functions, statistical functions, etc. Charts are graphical display of
data.
A cell reference identifies a cell’s location in the worksheet, based on its column letter and row number,
such as A1 (column A, row 1) or E4 (column E, row 4). There are three types of cell references:
Relative, both row and column references are relative (for example D4). If you copy-paste the
content of a cell with a relative reference in another cell the reference changes according to the
distance in number of rows and columns between the first and second cell.
Absolute, the row and the column references are fixed using the $ (Dollar) sign (for example $C$3)
and if you copy-paste the content of a cell with an absolute reference the reference does not
change.
Mixed, either the row or the column reference is absolute (for example $A1 to fix the column or
A$1 to fix the row); if you copy-paste the content of a cell with a mixed reference the column or the
row changes.
pf3
pf4
pf5

Anteprima parziale del testo

Scarica Riassunto Data Analysis IPLE e più Schemi e mappe concettuali in PDF di Analisi Dei Dati solo su Docsity!

DATA ANALYSIS – SUMMARY

BASICS

Data analysis is a process of inspecting, cleansing, transforming and modelling data with the goal of

discovering useful information, informing conclusions and supporting decision-making. Data are empirical

material organized into a form that can be analyzed. Primary data analysis : you collect and categorize

(coding process) your own data through interviews, official documents, experiments, surveys, etc.

Secondary data analysis : you use data sets that have been gathered by others and have subsequently been

deposited in databases (: existing archived collections of data). Units of analysis are the objects/subjects to

which the properties investigated pertain A variable is an empirical measurement of a characteristic. Key

features of a variable: a name and at least two values (otherwise it would be a constant). Variation of a

variable may occur in two ways: over time, on the same cases; or among cases, at the same time.

The levels of measurement of variables are three: nominal, ordinal and interval. Nominal variable is one

that has two or more categories, but there is no intrinsic ordering to the categories (discrete and non-

orderable). Ordinal variable is similar to a categorical variable, but there is a clear ordering of the

categories (discrete and orderable). Interval variable is similar to an ordinal variable, except that the

intervals between the values of the variable are equally spaced.

A spreadsheet is an interactive computer application for the organization, analysis, and storage of data in

tabular form. A spreadsheet consists of a table of cells arranged into rows and columns and referred to by

the X and Y locations. A cell is a box for holding data. A single cell is usually referenced by its column and

row. A worksheet is a grid of cells with either raw data, called values, or formulas in the cells. Values are

raw data (general numbers, text, dates). Alternatively, a value can be based on a formula, which might

perform a calculation. A formula is an equation that performs calculations, such as addition, subtraction,

multiplication, and division, on values in a worksheet. To enter a formula in a cell you always need to start

with the equal sign (=). Functions can be built-in functions, such as arithmetic operations (for example,

summations, averages), trigonometric functions, statistical functions, etc. Charts are graphical display of

data.

A cell reference identifies a cell’s location in the worksheet, based on its column letter and row number,

such as A1 (column A, row 1) or E4 (column E, row 4). There are three types of cell references:

  • Relative , both row and column references are relative (for example D4). If you copy-paste the

content of a cell with a relative reference in another cell the reference changes according to the

distance in number of rows and columns between the first and second cell.

  • Absolute , the row and the column references are fixed using the $ (Dollar) sign (for example $C$3)

and if you copy-paste the content of a cell with an absolute reference the reference does not

change.

  • Mixed , either the row or the column reference is absolute (for example $A1 to fix the column or

A$1 to fix the row); if you copy-paste the content of a cell with a mixed reference the column or the

row changes.

HOW TO STRUCTURE A DATASET

  1. Put variables in columns (the thing you are measuring, like “age” or “country”), observation in

rows, and data (values) in cells.

  1. Include a unique identifying number for each observation (ID).
  2. Put variable names in the first row.
    • Labels should be: unique, without special characters, and starting with a letter. Choose

recognizable names for variables

  1. Use a separate column for each piece of information.
    • For example, don’t enter data in a column as “Democratic Party-USA”
  2. When entering dates include a 4 - digit year
    • The best format for dates is mm/dd/yyyy, where mm is a 2 - digit month, dd is a 2 - digit day

and yyyy is a 4 - digit year.

  1. Decide on "missingness" conventions.
    • Be sure that it cannot be confused with a ‘real' data value.
  2. Be consistent in your data entry
    • When entering data keep the same format throughout.
  3. Document your database with a codebook.
    • The codebook should include all of the variable names, data type that corresponds to the

variable, a label or longer name that describes the variable including the units it is

measured in, the codes for any categorical variables, how missing values are coded, and

any further note.

DATA MANAGEMEN WITH EXCEL

  • Importing data from texts file ( Data tab)
  • Autofill (drag down the little green square while entering number series)
  • Data validation ( Data tab)
  • Removing duplicates ( Data tab)
  • Filter ( Data tab)
  • Sorting data ( Data tab)
  • Lookup function (=XLOOKUP)

HOW TO DESCRIBE DATA?

Descriptive statistics summarize the information in a collection of data. Frequency distribution is a listing

of possible values for a variable, together with the number of observations at each value.

Measures of central tendency are statistics that show what a typical observation is like.

  • Mode , the most frequently occurring value
  • Median , the middle value, the score that has 50% of the values above and 50% of the values below
  • Mean , the sum of the observations divided by the number of observations

Measures of variability are statistics that show the amount of dispersion in a dataset

  • Range : for categorical variables is the count of categories; while for interval variables is the

difference between the largest and the smallest observations

  • Standard deviation ( s ) is a sort of typical distance of an observation from the mean. The larger the

standard deviation, the greater the spread of the data around the mean

  • Interquartile range is the difference between the upper and lower quartiles ( measures of position ).

This measure describes the spread of the middle half of the observations

Frequency distribution can be absolute or relative. Absolute frequency is the number of observations per

category of a variable. Relative frequency for a category is the proportion or percentage of the observations

that fall in that category. The proportion equals the number of observations in a category divided by the

total number of observations. It is a number between 0 and 1 that expresses the share of the observations

in that category.

Treemap To portray hierarchical data.

Hierarchical data structure with parents

categories and sons.

Map

To show values for each geographic unit (for instance

countries).

For each geographical unit a continuous value is

assigned.

BIVARIATE ANALYSIS

The analysis of the association between two variables is called bivariate analysis. An association exists

between two variables if particular values for one variable are more likely to occur with certain values of

the other variable. When doing bivariate analysis, our first step is to distinguish between the dependent/

response/output variable and the independent/explanatory/input variable.

Independent/Explanatory/Input variable is the variable that influences the dependent variable.

Dependent/Response/Output variable is the variable that gets influenced by the dependent variable. The

dependent variable is the outcome variable on which comparisons are made for different values of the

independent variable. When the explanatory variable is categorical, it defines the groups to be compared

with respect to the response variable. When the explanatory variable is quantitative, we examine how

different values of the explanatory variable relate to changes in the response variable.

BIVARIATE ANALYSIS: CATEGORICAL DATA

  • Create a contingency table
  • Calculate a column or bar chart
  • Conditional proportions/percentages: Distribution of the dependent variable conditional on the

value of the independent variable

  • Difference in proportions/percentages: You have to compare proportions/percentages across

values of the independent variable at the same value of the dependent variable

Two categorical variables are independent if the population conditional distributions for one of them are

identical at each category of the other. The variables are dependent if the conditional distributions aren’t

identical.

Steps for the chi-squared test of independence

  1. Assumptions

a. Two categorical variable

b. Randomization

c. Expected count ≥ in all cells

  1. Hypotheses

a. H 0

: the two variables are independent

b. H a

: the two variables are dependent

  1. Test statistic

a. If H 0

is true, the in a contingency table the frequencies of the dependent variable should be

equal across categories of the independent variable, and should equal the expected cell

count

b. Compute the expected counts for each cell à 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑒𝑙𝑙 𝑐𝑜𝑢𝑛𝑡 =

( "#$ &#&'(

)

  • (+#(,-. &#&'()

/#&'( 0'-1(2 0342

c. Compute the X

2

à 𝑋

!

(#$%&'(&) +,-./ 0 123&+/&) +,-./)

!

123&+/&) +,-./

  1. Compute the P-value for the test statistic (=CHISQ.TEST)
  2. Conclusion: report the chi-squared and P-value and interpret in context. If a decision is needed,

reject H 0

when P-value ≤ significance level (generally 0,05)

BIVARIATE ANALYSIS: CONTINOUS DATA

  • Create a scatter-plot

If the association looks linear you can compute the correlation coefficient that summarizes the strength

and direction of the linear association between two quantitative variables. Denoted by r it takes values

between - 1 and +1. In Excel you can use the function =CORREL. The value of the correlation doesn’t depend

on the variables’ unit.

When the relationship has a straight-line pattern, we can analyze the data further by finding an equation

for the straight line that best describes that pattern. The regression line predicts the value for the response

variable y as a straight-line function of the value x of the explanatory variable. Let ŷ (y-hat) denote the

predicted value of y. The equation for the regression line has the form: ŷ = a + bx.

The slope b in the equation ŷ = a + bx equals the amount that ŷ changes when x increases by one unit. The

y-intercept a is the predicted value of y when x = 0. equals a = y ̄− b ( x ̄). The regression line is the optimal

line to fit through the data points by making the residuals as small as possible. This method produces the

line that has the smallest value for the residual sum of squares (residual = y – ŷ). In Excel you can find the

regression line using the function =LINEST or using the Data Analysis Tool pack.

Steps of two-sided significance test about a population slope

  1. Assumptions

a. Relationship in population satisfies regression model μ

y

= α + βx

b. Data gathered using randomization

c. Population y values at each x value have normal distribution, with same standard deviation

at x value

  1. Hypotheses

a. H 0

: b = 0

b. H a

: b ≠ 0

  1. Test statistic: 𝑡 =

($ 05 )

%&

, where software supplies sample slope b and its standard error se

  1. P-value: Two-tail probability of t test statistic value more extreme than observed, using t

distribution with df = n - 2 (supplied by software)

  1. Conclusions: interpret P-value in context. If a decision is needed, reject H 0

if P-value ≤ significance

level (such as 0.05).

The R

(R-squared or coefficient of determination) is a percentage of the variability in the response

variable , not a percentage of the response variable. It means that x % of the variability in the dependent

variable can be explained by the independent variable.

BIVARIATE ANALYSIS: MIXED DATA

  • IV is categorical and DV is continuous
  • Compute the measures of central tendency and variability by group
  • Box-plot of quantitative variable by group
  • Compare the means

If your categorical independent variable has only two categories (binary) then you can perform a t-Test to

test formally whether there is a statistically significant difference in means

Steps of two-sided significance test for comparing two population means

  1. Assumptions

a. A quantitative response variable observed in each of two groups

b. Independent random samples, either from random sampling or a randomized experiment

c. Approximately normal population distribution for each group. Or at least ni ≥ 30 (This is

mainly important for small sample sizes, and even the two-sided test is robust to violations

of this assumptions)

  1. Hypotheses

a. H 0

: μ 1

= μ 2

(the population mean of group 1 is equal to population mean of group 2)

b. H a

: μ 1

≠ μ 2

(one-sided H a

: μ 1

μ 2

or H a

:μ 1

< μ 2

  1. Test statistic: 𝑡 =

( 2 "

0 2 !

) 05

%&

  1. P-value: Two-tail probability from t distribution of values even more extreme than observed t test

statistic, presuming the null hypothesis is true with difference given by software. Use the Excel

function =T.TEST