Scarica Dispense Data Analysis IPLE e più Dispense in PDF di Analisi Dei Dati solo su Docsity!
DATA ANALYSIS – SUMMARY AND NOTES
What is data analysis
Data are empirical material organized into a form that can be analyzed
- Data analysis is a process of gathering and modelling data to discover useful information
o Primary data analysis: you collect data (survey, interviews...)
o
Secondary data analysis: you use someone else's data
History of spreadsheets
Born in 1979 à VisiCalc by Dan Bricklin released for the Apple II
- 1983: Lotus 1- 2 - 3 by Lotus Software àadded the possibility to create graphs
1985: Excel was born for Macintosh Apple II
- 1987 Excel was born also for Windows
Spreadsheets basic concepts
Has many functions to sort and analyze data
- Excel is a spreadsheet à table of cells with rows (represented by numbers) and columns
(represented by letters)
- Cell: a box; holds data
- Worksheet: grid of cells
- Values: raw data; can be based on a formula which might perform a calculation
- Formulas: they say how to compute new values from existing values
- Functions: arithmetic operations (summation, averages…), trigonometric functions, statistical
functions
Charts: geographical display of data
The Ribbon
- It consists in a series of tabs containing command buttons arranged into groups
- Every tab contains different groups, in which one can find command buttons and dialog box
launcher (containing additional options)
- Home à to create, format and edit a spreadsheet
Insert à to add particular elements (tables, illustrations, charts…)
- Page Layout à to prepare a spreadsheet for printing
Formulas à to add formulas/functions or to check the worksheet for formula errors
- Data à to import, query, outline and subtotal the data placed into a list
- Review à to proof, protect and mark up a spreadsheet for review by others
View à to change the display of the worksheet area and its data
- If you're working with particular objects, Excel can display contextual tools
o if you deselect the object, the contextual tool will disappear from the ribbon
The formula bar
- It displays the cell address in the name box (ex: A1)
- It presents a formula bar button
- It shows you the contents of the cell (even if the worksheet does not)
o
The content can be modified in this section
o Excel displays only the result of a formula in the cell, while here you can deal with the
entire formula and calculation
Values
Numbers that represent quantities (ex: 14 apples)
- Numbers that represent dates (ex: September 19th, 2022) or times (ex: 2pm)
Values align at the right edge of the cells
- To enter a negative quantity, one must begin the entry with the minus sign (-)
- Numbers can be rounded after the decimal point through the Numbers group in the Home tab
Formulas
- A formula is an equation that performs calculations
- To enter a formula, one always must start with the equal sign (=)
- Order of operations in a formula:
- Negative number
- Percent
- Exponentiation
- Multiplication and division from left to right
- Addition and subtraction from left to right
- To override the standard order of operation, one can use parentheses
Relative cell references
- A relative cell reference is one whose references change "relative" to the location where it is copied
or moved
- If you copy/drag the formula relative to the first two cells, the operation is copied (but the data
changes depending on the other cells) à each result references to its two neighboring cells on the
left
Absolute cell references
- An absolute cell reference refers to a specific cell or range of cells regardless of where the formula
is located in the worksheet
- To make an absolute cell reference, you use the dollar sign ($) before the column and row of the
cell you want to reference (ex: $B$4)
- The absolute cell will always maintain its value (both the operation and the data of the absolute cell
is copied à only the other (independent) cell in the operation will change its value)
Mixed cell references
- In a mixed cell reference, a column or a row is absolute, and the other is relative (ex: A$4 or $A4)
Variables
- A variable is an empirical measurement of a characteristic (ex: age, occupation, gender…)
o It can vary among different categories à values
- Variation of a variable may occur in 2 ways
o Over time on the same case (ex: age)
o Among cases at the same time
- A constant is a variable that does not change for all units of analysis
- Variables are classified based on the operations that can be carried out on them
o 3 classes of variables
§ Nominal
§ Ordinal
§ Continuous
- Measures of variability: it is concerned with the spread of the score (range, standard deviation,
interquartile range)
Frequency distribution: categorical variables
è To describe categorical data (nominal + ordinal variables) one lists the answer categories and shows the
frequency for each category
- We can use the excel function =COUNTIF(RANGE; VALUE)
- Relative frequency distribution: use of percentages and proportions instead of absolute frequency
o Proportion à number of observations in a category divided by the total number of
observations
o Percentage à proportion multiplied by 100
Frequency distribution: continuous variables
- There are many different values the variable can take à compute frequency distributions in bins
- Bins è consecutive, non-overlapping intervals of a variable
o Intervals should be exhaustive and mutually exclusive
- We divide the measurement scale into a set of bins and then count the number of observations in
each bin
- We can use the excel function =FREQUENCY(DATA ARRAY; BINS ARRAY)
MEASURES OF CENTRAL TENDENCY
- Measures of central tendency are statistics that show how a typical observation is like
- Continuous variables: mean, mode, median
- Ordinal variables: median and mode
- Nominal variables: mode (exception: binary variables)
Mode
- It’s the value that occurs the most frequently
- It’s appropriate for all kinds of variables
- We can use the excel function =MODE / =MODE.SNGL /
=MODE.MULT
o Arguments must be numbers or ranges
- If no data occurred more than once, the mode function returns #NA
- To compute the mode of ordinal/nominal values, you need to assign a number to each category
- Bimodal distribution à if two distinct mounds occur in the distribution
o This often happens when populations are polarized, responses concentrate on the extreme
values
Median
- It’s the observation that falls in the middle of the ordered sample à data must be listed from
smallest to largest or vice-versa
o Sample of odd observations à median is the middle value
o Sample of even observations à median is the average of the two middle values
- It’s not appropriate for nominal variables
- We can use the excel function =MEDIAN
o Arguments must be numbers or ranges
Mean
- It’s the sum of the observations divided by the number of observations
- It’s appropriate only for quantitative variables
- It’s influenced by outliers à an observation that falls well above or well below the bulk of data
- We can use the excel function =AVERAGE
o Arguments must be numbers or ranges
o It ignores empty cells and cells with text values
- If the data set doesn’t exhibit an excessive skew à you can use the mean
- If the data set exhibits a skew à you should use the median (not affected by outliers)
MEASURES OF VARIABILITY
- Measures of variability describe the spread of the data of a variable
- Range à can be computed for nominal, ordinal and continuous variables
- Standard deviation à can be computed for continuous variables
- Interquartile range à can be computed for quantitative variables
Range
- For ordinal/nominal variables, the range is the count of choices/categories recorded
o For example, an ordinal variable with three responses (not willing, undecided, willing) or a nominal
variable with three answers (Catholic, Protestant, Jews) have both a range of 3 given that subjects
can take up to three different values
- For interval variables, the range is the difference between the largest and the smallest
observation
o If you have a set of data such as 4, 2, 5, 8, 12, 15, the range is the highest number (15) minus the
lowest number (2). The range is 15-2 = 13
- Excel doesn’t have a formula
o Ordinal/nominal variables à look for the
number of categories
o Quantitative variables à =MAX(range)-
MIN(range)
- We can use the excel function =QUARTILE (range; quart)
o quart must be 0, 1, 2, 3, 4 according to the quartile you want
- there is no function for IQR à IQR = upper quartile – lower quartile
CREATING AND IMPORTING DATA
Autofill
- Can help with repetitive tasks like numbering rows or titling columns
- Enter the starting value of a series and then drag the fill handle (+)
- It works also for a text followed by a number (ex: Product 1, Product 2…)
Removing duplicates
- In datasets, each record (each row) must be unique
- In the Data tab, Data Tools group, click on the command Remove Duplicates
- Leave “my data has headers” box checked if the data has the variables names in the first row
Sorting data
- You may want to order values of a dataset in a specific order (ex: alphabetical order, numerical
order…)
- Sort from smallest to largest (ascending order)
o Text is placed in alphabetical order from A to Z
o Values are placed in numerical order from smallest to largest
o Dates are placed in order from oldest to newest
- Sort from largest to smallest (descending order)
o Text is placed in alphabetical order from Z to A
o Values are placed in numerical order from largest to smallest
o Dates are placed in order from newest to oldest
- Sorting can be done on a single field or multiple fields à you add levels to the sorting
- If you select only a portion of the table (only one column), excel will sort only that portion and the
records will be mismatched à Excels always warns you to adjust the sorting
- Excel can sort based on special formats applied to certain cells
Filtering data
- You can filter out rows that don’t pertain to what you’re searching for à you hide them
temporarily
o AutoFiller à headings row becomes a set of control to select the data you want
o Custom AutoFiller à uses a rule you create
o Cell Attributes à applies to the format of the cells (font, color, shade…)
- In the Data tab, click on Filter à a small button will appear in the first row to apply filters
Importing text files
- You can import data from text files like: .txt, .csv, .tsv
- In the Data tab, Get & Transform Data group, select the option From Text/CSV
o Then select the downloaded file and click con Import
o Select the delimiter
o Then select Load
Exporting text files
- You can save your dataset as a .csv file when saving the document
Text to column
- You can separate the contents of one cell in separate columns
- In the Data tab, Data Tools group, click on Text to Columns
o Select the delimiter and then click finish
Concatenate
- You can merge the contents of two or more cells
- Use the Excel function =CONCATENATE (text1; text2)
TABLES AND PIVOT TABLES
Excel tables
- You can transform a dataset into a table à different formatted, top row is frozen, columns are
filterable
- In the Insert tab, click Table , click OK on the dialog box
- You can add a Total Row (check the box in the Table Style Options panel)
o Can compute functions very easily
- To remove the table formatting, click on the Convert to Range button in the Table tab
Excel structured reference
- It’s a special way for referencing tables and their parts that uses a combination of table and
column names instead of cell addresses
- To perform the same calculation in each table row, it’s enough to enter a formula in just one cell à
all other cells in that column are filled automatically
- To compute function using structured references, you type the usual function and then select the
cell à excel will pick up the column names and the whole column will be auto filled
Pivot tables
- It’s a table of statistics that summarizes the data of a more extensive table à it reveals
relationships and trends
- You can easily pivot fields from a row to a column and vice-versa
- In the Insert tab, in the Tables group, click on PivotTable
o A dialog box will appear, the location of the Pivot table is in a new worksheet
- You can select the variables to appear in the table and how to summarize the data
o You can populate the 4 sections by dragging and dropping the variables name
Pivot tables: filtering, sorting and summary calculations
- You can sort the columns: right click in Sort > sort largest to smallest
- You can filter the columns by clicking on the box in the cell
- Excel summarizes data by
o Summing the items for continuous variables
o Counting the items for nominal or ordinal variables
- To change the summary calculation, right click any cell > click on Value Field Setting > choose the
calculation you need
Pivot tables: grouping
- Items can be grouped into different categories
- To create a new group:
o Select the items you want to group
o Right click and click on Group
o Then select the other “left-out” items
o Right click and click on Group
- Excel will summarize the values based on the new groups
- To ungroup: right click > click on Ungroup
- You can group dates by quarters
Group 1
Group 2
color scale thumbnail. They are great for identifying the distribution of values
across a large range of data
Icon Sets opens a palette with different sets of icons that you can apply to the cell
selection to indicate their values relative to each other by clicking the icon set.
Icon sets are great for quickly identifying the different ranges of values in a range
of data
New Rule opens the New Formatting Rule dialog box where you define a custom
conditional formatting rule to apply to the cell selection. You need to specify the
rule that identifies which values or text entries to format, but also the colors and
other aspects included in the formatting
Clear Rules opens a continuation menu where you can remove conditional formatting rules
for the cell selection by clicking the Clear Rules from Selected Cells option, for the
entire worksheet by clicking the Clear Rules from Entire Sheet option, or for just
the current data table by clicking the Clear Rules from This Table option
Manage Rules opens the Conditional Formatting Rules Manager dialog box where you edit and
delete particular rules as well as adjust their rule precedence by moving them up
or down in the Rules list box
Data validation
- It helps to ensure that data gets entered correctly , before it gets processed incorrectly
- You can set up rules based on what you’re entering (ex: ZIP codes have 5 digits, the rules marks
codes with 4 or 6 digits as a mistake)
- Select the cells > in the Data tab, Data Tools group, click on Data Validation
o Choose text length and enter the rule
o Insert the Error message that will appear when a mistake is committed
EXCEL CHARTS
- A chart is a graphical representation of numeric data
Column chart
- Used to compare frequencies across categories
- To create: select cells (% of seats, # of votes…) > cmd > select cells (names
of parties…) > Insert tab > column symbol
- The separated bars emphasize that the variable is categorical and not
continuous
Formatting charts
- Design tab > Add Chart Elements , Quick Layout , Change Chart Type
- Double click on the chart à a pane on the right-hand side will open up
- You can switch the row/column à display data as 2 distinct variables
- You can change the position of the legend
o Right-click on the legend > Format legend
o A pane will open up and you can customize it
- You can add data labels to focus the reader’s attention
o Select the chart > Add chart element > Data Labels
- You can move the chart to a new worksheet
Bar chart
- It’ the horizontal version of a column chart
o When you have long category names
o When the number of categories is great
- To create: select the data > Insert tab > column symbol > bar chart
Pie chart
- Used to show distribution of few categorical variables à show clear
differences of the values
- To create: select data > cmd > select the % > Insert tab > pie symbol
- You can “extract” a value by dragging the portion out of the pie
Histograms
- Used to show the frequency distribution for a continuous variable
- You need to divide the values in intervals, every bar shows the number
of observation per interval
- You can easily spot the min, the max and eventual symmetry
- To create: select the data > Insert tab > histogram symbol
Box plot
- Used to display the central tendency of continuous variables
- To create: select the data > Insert tab > statistic chart symbol > box and whisker
- The x in the box represents the mean
- It shows the interquartile range and the median (horizontal line in the box)
- It shows the minimum and the maximum with the whiskers
- Outliers are represented by small dots
Line chart
- Used to show trends for continuous data
- To create: select data > Insert tab > line symbol
o Click on select data (in Chart design tab)
o Click on Horizontal axis labels and select the data
- You can add lines for other subject by clicking on Select data
Area chart
- Used to present accumulative value change over time
- To create: select data > Insert tab > line symbol > area
- If areas overlap, a Stacked area chart is to be used
Example
- Research question : Do less and more wealthy individuals differ in the proportion of happiness?
- Independent variable (IV) à family income
- Dependent variable (DV) à happiness
- Define how to measure the variables
- Happiness à very happy, pretty happy, not too happy (ordinal variable)
- Income à below average, average, above average (ordinal variable)
- Create a two-way or contingency table
- Rows à list the categories of independent variable
- Columns à list the categories of dependent variable
- The entries are the joint frequencies of the two categorical variables
- Convert frequencies to percentages (called conditional proportions because their formation is
conditional on the level of income)à right-click on one cell and select show values as > % of row
total
- The percentages on the row are called conditional percentages à refer to the distribution of
happiness, conditional on the family income
- The distribution is called the conditional distribution for happiness, given a particular income level
- Proportions are the estimated conditional probabilities
- Create column / bar charts
- Use charts for proportions or percentages
- Each column will show the conditional distribution of happiness for that income category
- You can also use the stacked column chart
- Independence and dependence of association
- Two categorical variables are independent if the conditional distribution are identical at each
category of the other
- The two categorical variables are dependent if the conditional distributions are not identical
- Testing categorical variables for independence
- Define the two hypothesis
o H 0
: the two variables are independent
o H a
: the two variables are dependent
is true, the frequencies of the dependent variable should be equal
across categories of the independent variable, and they should equal
the expected cell count
- The expected cell counts are values that satisfy the null hypothesis of
independence
- the Chi-squared test statistic
- it’s an overall measure of how far the observed cell counts in a contingency table fall from the
expected cell counts for a null hypothesis
2
value, the greater the evidence against H 0
and in support of the H a
- if the P-value is less or equal to the significance level (≤ 0.05) à reject the H 0
(dependent)
- if the P-value is greater than the significance level (> 0.05)à cannot reject H 0
(independent)
- to find the P-value, use the excel formula =CHISQ.TEST (actual_range, expected_range)
Use the pivot table to
create the cross-tabulation
of the data
Our data is different
from the expected cell
count à the variables
are dependent
o actual_range à observed count from pivot table
o expected_range à expected cell counts
- use the IF function
- use the excel formula =IF (logical_test, value_if_true, value_if_false) to have an automatic feedback
of the level of independent
o logical_test à select the cell with the P-value and add <=0.
o value_if_true à “dependent”
o value_if_false à “independent”
BIVARIATE ANALYSIS: QUANTITATIVE DATA
- We use the scatterplot , a graphical display for 2 quantitative variables
o Horizontal axis (x) for the independent variable
o Vertical variable (y) for the dependent variable
- Positive association à x goes up, y tends to go up // x goes down,
y tends to go down
- Negative association à x goes up, y tends to go down // x goes
down, y tends to go up
Correlation
- Linear relationship à data points follow a roughly straight-line trend
- The correlation summarizes the strength and direction of the linear
association between 2 quantitative variables
o It takes values between - 1 and +
o Positive r indicates a positive association, negative r indicates
a negative association
o The closer r is to ±1, the closer the data points fall to a straight
line and the stronger the linear association is
o The closer r is to 0, the weaker the linear association is
- You can use the excel function =CORREL(array1; array2)
- One must always check the correlation with the scatterplot à r could be 0, but the scatterplot
could indicate a different association (ex: U-shaped)
Regression
- We can find an equation for the straight line of the correlation that best describes the pattern
o it can be used to predict the value of the variable designated as the dependent variable
from the value designated as the independent variable
- The regression line predicts the value for the dependent variable y as a straight-line function of the
value x of the independent variable ( ŷ = a + bx )
o a is the y-intercept à predicted value of y when x = 0
o b is the slope
- You can use the excel function =LINEST(known_y’s; [known_x’s];
[const]; [stats])
o known_y’s à range of values of DV
o known_x’s à range of values of IV
o const à logical value specifying whether to force the constant b to equal 0 (If const is TRUE
or omitted, b is calculated normally // If const is FALSE, b is set equal to 0 and the m-values
are adjusted to fit y = mx)
o stats à logical value specifying whether to return additional regression statistics (If stats is
TRUE, LINEST returns the additional regression statistics // If stats is FALSE or omitted,
LINEST returns only the m-coefficients and the constant b)
BIVARIATE ANALYSIS: MIXED ASSOCIATION
- We have a dependent variable that is quantitative and an independent variable that is categorical
Comparing descriptive statistics by group
- Create a box-plot of the DV for each value of the IV
- Compute measures of variability and central tendency for DV across categories of IV
- Compare the statistics to see if there’s variation across the categories of the IV
Example
We are interested in testing whether there is a difference in support for the European Union between
Spanish and Italian citizens
- IV: country of origin; DV: support for Eu integration on a 1 to 11 range
- Create a box-plot
o In Spain, upper whisker (max) coincides with the 3
rd
quartile
o In terms of range, there’s no difference
o The mean is higher in Spain (8) than in Italy (7)
o The median is higher in Spain
t-Test
- If the IV has only 2 categories (binary), then you can perform a t-Test
- It formally tests whether the means of quantitative variables are different and if they are likely to
have come from the same population
- Related/paired test à same population rated the 2 policies
- Two sample equal/unequal variance test à two populations rated the 2 policies
- One-tailed t test à tests that the mean of group 1 is > than the mean of group 2, or that the mean
of group 1 is < than the mean of group 2
- Two-tailed t-test à tests that the mean of group 1 is ≠ from the mean of group 2
- You can use the excel function =T.TEST (array1; array2; tails; type)
o Array1: range of values for the first group of independent variable
o Array2: range of values for the second group of the independent variable
o Tails: if 1, then it runs a one-tailed test; if 2, then it runs a two-tailed test
o Type: if 1, it runs a paired test; if 2, it runs a two-sample equal variance test (assumes they
have equal variance); if 3, it runs a two-sample unequal variance test (assumes they have
unequal variance)
- A p-value <0,05 implies that the means are different (reject the H 0
Example – t-Test for EU integration
- Reorganize the data à in a new worksheet, add only the values for Eu integration for SPA/ITA
2. H
0
: the means are the same = same population
Ha: there is a significant difference in means in the support for EU integration
- Calculate =T.TEST
- The p-value: 4,12 x 10
< 0,05 à reject the null hypothesis (H 0
) and H a
was correct
Another way to perform the t-Test
- You can use the Data Analysis tool in the Data tab
- Select t-Test: Two-sample assuming unequal variances
o Variable 1: DV of the first group
o Variable 2: DV of the second group
o Hypothesized mean difference: 0
o Alpha: 0,05 (p-value)
- We get a two-tailed test à if the P-value is < 0,05 you reject the null hypothesis
Italy is less
supportive
IV with more than 2 categories
- Create a box-plot
- Create the pivot table for descriptive statistics
- Performs a One-Way Analysis Of Variance (ANOVA) à tests whether the average of DV is different
across at least two groups of IV
CAUTIONS IN ANALYZING ASSOCIATIONS
- Extrapolation à it refers to using a regression line to predict y values for x values outside the
observed range of data
o We essentially assume that the future will have the same trend of the past, which is very
risky
- Outliers à it’s an observation that is very high or very low compared to the rest of the values
o They can be very influential, and the regression line may be affected (inflate or deflate the
association)
- Probabilistic nature of statistical association à statistical association aren’t deterministic
o We cannot generalize and we should stick to averages
- Correlation doesn’t imply causation à an association between two variables isn’t enough to imply
a causal connection
o We need a theory to link dependent variables to independent variables
o There may be other variables influencing the correlation, called lurking variables
LOOKUP AND REFERENCES
- The XLOOKUP function allows to find data and then returns the item corresponding to the first
match it finds
- We use the excel function =XLOOKUP(lookup_value, lookup_array, return_array, [if_not_found],
[match_mode], [search_mode])
o Lookup_value: the value to look for
o Lookup_array: array or range to search
o Return_array: array or range to return
PRESENTING THE WORK
- Figures and tables can be used to present data, clarify interpretations and explain concepts
Figures
- They help the reader understand what has been written
- They must always be described
- Things to consider when entering a figure
o Graph format
o Colors à use them only if there’s a specific reason,
otherwise use a greyscale
o Labelling and titling à axes and data series must
always be labelled, and the graph must be titled
o Source à add a caption or note with the reference to
the source
- There are different pasting options when exporting it to Word
à choose the most appropriate to the situation
Tables
- They present a summary of the data
- It will include relative frequencies, the means, standard deviation, statistics…
- Use tables when
o You compare or look up individual values