Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Dispensa Data Analysis (IPLE - 1st year), Dispense di Analisi Dei Dati

Dispensa discorsiva di analisi dati: slide del docente integrate con appunti presi a lezioni.

Tipologia: Dispense

2023/2024

Caricato il 16/02/2025

mtldvlntn
mtldvlntn 🇮🇹

12 documenti

1 / 31

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
Matilda Valentino IPL
DATA ANALYSIS
data are empirical material organized into a form that can be analyzed. Units of analysis are the objects/
subjects to which the properties investigated pertain.
data analysis is a process of gathering inspecting, cleansing, transforming, and modeling data with the
goal of discovering useful information, informing conclusions and supporting decision-making
A. Primary data analysis: you collect and categorize (coding process) your own data through
interviews, official documents, experiments, surveys, etc.
B. Secondary data analysis: you use data sets that have been gathered by others and have subsequently
been deposited in databases (: existing archived collections of data).
VARIABLES
A variable is an empirical measurement of a characteristic (example: age, number of siblings, occupation,
education, gender, party in government, etc…)
Key features of a variable:
A. A name
B. At least two values
A variable can vary among different categories, values, which correspond to different states/categories of
the property
Variation of a variable may occur in two ways:
A. over time, on the same case
B. among cases, at the same time
(example think about age: it changes over time for you, and it changes across all of you)
A constant is a variable that does not change for all units of analysis.
Variables can be classified according to the operations that can be carried out on them based on the logical-
mathematical characteristics of a variable. There are 3 cases of variables:
A. continuous (interval/ratio)
B. categorical
1. ordinal
2. nominal
Nominal variables
A nominal variable is one that has two or more categories, but there is no intrinsic ordering to the
categories.
The characteristic recorded by the variable takes on non-orderable, discrete categories.
A. Discrete: the variable takes on a range of finite categories and there are no inter-mediate categories.
B. Non-orderable: no order or hierarchy can be established.
Classification is the process of assigning categories to each unit of analysis. A classification must be
exhaustive and mutually exclusive.
The only relationships that we can establish among the categories of a nominal variable are those of
‘equality’ and ‘inequality’.
(an example of nominal variable can be the religion. Some categories could be: catholic, jewish, muslim,
protestant, atheist, agnostic, other).
Ordinal variables
An ordinal variable is similar to a categorical variable, but there is a clear ordering of the categories and
communicates relative differences between units of analysis.
The property to be recorded assumes orderable discrete categories.
Orderable: it exists an order, which enables us to establish not only relationships of equality and inequality
among the categories, but also relationships of order - that is ‘greater than’ (>) and ‘less than’ (<).
In an ordinal variable, the distance between one category and the next is not known.
Categories have the ordinal, but not the cardinal properties of numbers
(an example of ordinal variable can be the evaluation of life satisfaction: Some categories could be: very
satisfied; fairly satisfied; not very satisfied; not at all satisfied).
Continuous variables
A continuous variable is similar to an ordinal variable, except that the intervals between the values of the
variable are equally spaced.
The distances between the categories can be determined because we have a reference unit (a measurement
unit or a counting unit).
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Anteprima parziale del testo

Scarica Dispensa Data Analysis (IPLE - 1st year) e più Dispense in PDF di Analisi Dei Dati solo su Docsity!

DATA ANALYSIS

  • data are empirical material organized into a form that can be analyzed. Units of analysis are the objects/ subjects to which the properties investigated pertain.
  • data analysis is a process of gathering inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making A. Primary data analysis: you collect and categorize (coding process) your own data through interviews, official documents, experiments, surveys, etc. B. Secondary data analysis: you use data sets that have been gathered by others and have subsequently been deposited in databases (: existing archived collections of data). VARIABLES
  • A variable is an empirical measurement of a characteristic (example: age, number of siblings, occupation, education, gender, party in government, etc…)
  • Key features of a variable: A. A name B. At least two values
  • A variable can vary among different categories, values, which correspond to different states/categories of the property
  • Variation of a variable may occur in two ways: A. over time, on the same case B. among cases, at the same time
  • (example think about age: it changes over time for you, and it changes across all of you)
  • A constant is a variable that does not change for all units of analysis.
  • Variables can be classified according to the operations that can be carried out on them based on the logical- mathematical characteristics of a variable. There are 3 cases of variables: A. continuous (interval/ratio) B. categorical 1. ordinal 2. nominal Nominal variables
  • A nominal variable is one that has two or more categories, but there is no intrinsic ordering to the categories.
  • The characteristic recorded by the variable takes on non-orderable, discrete categories. A. Discrete: the variable takes on a range of finite categories and there are no inter-mediate categories. B. Non-orderable: no order or hierarchy can be established.
  • Classification is the process of assigning categories to each unit of analysis. A classification must be exhaustive and mutually exclusive.
  • The only relationships that we can establish among the categories of a nominal variable are those of ‘equality’ and ‘inequality’.
  • (an example of nominal variable can be the religion. Some categories could be: catholic, jewish, muslim, protestant, atheist, agnostic, other). Ordinal variables
  • An ordinal variable is similar to a categorical variable, but there is a clear ordering of the categories and communicates relative differences between units of analysis.
  • The property to be recorded assumes orderable discrete categories.
  • Orderable: it exists an order, which enables us to establish not only relationships of equality and inequality among the categories, but also relationships of order - that is ‘greater than’ (>) and ‘less than’ (<).
  • In an ordinal variable, the distance between one category and the next is not known.
  • Categories have the ordinal, but not the cardinal properties of numbers
  • (an example of ordinal variable can be the evaluation of life satisfaction: Some categories could be: very satisfied; fairly satisfied; not very satisfied; not at all satisfied). Continuous variables
  • A continuous variable is similar to an ordinal variable, except that the intervals between the values of the variable are equally spaced.
  • The distances between the categories can be determined because we have a reference unit (a measurement unit or a counting unit).
  • The values of a continuous variable have full numerical meaning.
  • They possess not only ordinal but also the cardinal properties of numbers: A. We can carry out on these variables all mathematical operations and statistical procedures LIST OF BASIC MATH OPERATIONS SUMMARY OF TYPES OF VARIABLES HOW TO STRUCTURE A DATASET The cardinal rules to collect data
  1. Put:
    • variables in columns (the thing you are measuring, like “age” or “country”)
    • observations in rows, and
    • data (values) in cells
  2. Include a unique identifying number for each observation (ID).
  3. Put variable names in the first row.
    • Be sure that each variable name is unique
    • Variable names must start with a letter, and do not include special characters (#, !, ?, %, etc.) or spaces in your variable names.
    • Choose recognizable names for variables - but not too long (<= 16 characters best).
  4. Use a separate column for each piece of information.
    • For example, if you are collecting data on parties don't enter data as “Democratic Party-USA”. Include a separate column for each info: “Democratic Party” and “USA”.
  5. When entering dates include a 4-digit year.
    • The best format for dates is mm/dd/yyyy, where mm is a 2-digit month, dd is a 2-digit day and yyyy is a 4-digit year.
  6. Decide on "missingness" conventions.
    • To enter a missing data value either enter a blank or an easily recognizable character code or number (e.g., NA). Be sure that it cannot be confused with a ‘real' data value.
  7. Be consistent in your data entry
    • When entering data keep the same format throughout the process.
    • 2 types of examples:

A. we first divide the measurement scale for the variable into a set of bins (intervals). B. then we count the number of observations in each bin.

  • For example let’s compute the frequency distribution for the variable age
  • First, let’s define the bins by specifying the upper levels of each interval in the range G2:G
  • The upper levels in column G are equivalent to the intervals specified in column F
  • Then you need to select the range H2:H7 (extra cell), enter the FREQUENCY function =FREQUENCY(D2:D21,G2:G6) and finish by pressing CTRL+SHIFT+ENTER on a Windows computer and ENTER on a Mac Descriptive statistics:
  • Measures of central tendency
  • Measures of variability and position Central tendency
  • Measures of central tendency: they are statistics that show what a typical observation is like. A. Mode: the most frequently occurring value. B. Median: the middle value, the score that has 50% of the values above and 50% of the values below. If the sample size is even, the median represents the average score of the two middle values. C. Mean: the mean is the sum of the observations divided by the number of observations
  • According to the level of measurement of variables we choose different measures of central tendency A. Continuous variables: mean, mode, median are all informative. B. Ordinal variables: mean only loosely interpretable, median and mode have more precision. C. Nominal variables: mean and median have no meaning, mode remains useful. (exception: binary variables) Mode
  • The mode is the value that occurs most frequently.
  • The mode is most commonly used with highly discrete variables, such as with categorical data.
  • Properties of the mode A. It is appropriate for all types of data. B. example: you may measure the mode for religion in Australia (nominal scale), for the grades given by a teacher (ordinal scale), or for the number of years of education completed by Germans (interval scale). How to find the mode?
  1. You can look at a frequency distribution and search for the most frequent value.
  2. Alternatively, Excel can compute the mode using the =MODE(), =MODE.SNGL(), =MODE.MULT()
    • Syntax =MODE (number1, [number2], …) =MODE.SNGL(number1, [number2], …) =MODE.MULT(number1, [number2], …)
  • Arguments number1 - A number or cell reference that refers to numeric values. number2 - [optional] A number or cell reference that refers to numeric values.
  • Numbers can be supplied as numbers (0,1,2,3.5, etc.), ranges, (F2:F20) to compute the mode for all values included in the range F2 to F20. cell references that contain numeric values (F2,F3,F4,F10,etc.) Mode warnings!!!
  • If no data value occurred more than once, the MODE function returns #NA, this is often the case with continuous data.
  • If at least two data values occurred with the same frequency, Excel will decide which value to report based on the argument position: it will return the first value listed.
  • If you want to compute the mode of an ordinal or nominal variable you will need first to recode each category with a specific number otherwise Excel won’t be able to compute it and will return #NA. Properties of the mode
  • It is appropriate for all types of data. For example, you may measure the mode for religion in Australia (nominal scale), for the grades given by a teacher (ordinal scale), or for the number of years of education completed by Germans (interval scale).
  • A frequency distribution is called bimodal if two distinct mounds occur in the distribution. Bimodal distributions often occur with attitudinal variables when populations are polarized, with responses tending to concentrate on the extreme values.
  • The mean, median, and mode are identical for a unimodal, symmetric distribution, such as a bell-shaped distribution.
  • Bimodal distribution: A frequency distribution is called bimodal if two distinct mounds occur in the distribution. Bimodal distributions often occur with attitudinal variables when populations are polarized, with responses tending to concentrate on the extreme values. Median
  • The median is the observation that falls in the middle of the ordered sample (data listed from smallest to largest or from largest to smallest).
  • If a sample contains an odd number of observations, the median is the observation that has as many observations below it as above it. (example: if we have a variable with five values 3, 7, 11, 15, 17 the median will be 11)
  • If a sample includes an even number of observations, you can simply average the two middle observations. (example: if we have a variable with six values 3, 7, 11, 13, 15, 17 the median will be 12, the average between 11 and 13) Properties of the median
  • The median is appropriate for interval and ordinal*-scale data. It is not appropriate for nominal-scale data, since the observations cannot be ordered.
  • For symmetric distributions the mean and the median are identical. How to find the median?
  1. Excel can compute the median with the function =MEDIAN()
  2. The MEDIAN function returns the median (middle number) in a group of supplied numbers. For example, =MEDIAN(1,2,3,4,5) returns 3
  • Syntax =MEDIAN(number1, [number2], …)
  • Arguments number1 - A number or cell reference that refers to numeric values. number2 - [optional] A number or cell reference that refers to numeric values.
  • Numbers can be supplied as numbers, ranges, or cell references that contain numeric values.
  • Cells that contain text, logical values, or no value are ignored. Mean
  • The mean (or average) is the sum of the observations divided by the number of observations. (example, the arithmetic mean of five values: 4, 36, 45, 50, 75 is 210/5=42). Properties of the mean
  • The mean is appropriate only for quantitative variables
  • It is not sensible to compute the mean for observations on a nominal scale (for instance, for religion measured with categories such as Protestant, Catholic, Muslim, Jewish, Other, the mean religion does not make sense).
  • The mean can be highly influenced by an observation that falls well above or well below the bulk of the data, called an outlier.
  • The mean is pulled in the direction of the longer tail of a skewed distribution, relative to most of the data How to find the mean?
  1. Excel can compute the mean with the function =AVERAGE()
  2. The AVERAGE function calculates the average of numbers provided as arguments. To calculate the average, Excel adds the numbers together and divides by the total number of values supplied. For example, =AVERAGE(2,4,6) returns 4.
  • Syntax =AVERAGE(number1, [number2], …)
  • Arguments

number2 - [optional] A number or cell reference that refers to numeric values.

  • Numbers can be supplied as numbers, ranges, or cell references that contain numeric values. Standard deviation
  • example: the three distributions in the figure below all have the same mean ($25,000) and range ($50,000), but they differ in variability about the centre.
  • In terms of distances of observations from the mean, nation A has the most variability, and nation B the least. The incomes in nation A tend to be farthest from the mean, and the incomes in nation B tend to be closest.
  • This is why we also use another measure of variability for continuous variables: the standard deviation.
  • It’s one of the most important measures of spread (/dispersion/variability) of a variable from its mean together with the range, and the variance. The standard deviation s of n observations is:
  • Where:
  • yi is the of value of a single observation i.
  • ӯ is the mean
  • (yi - ӯ) is the deviation of observation i from the mean.
  • The expression Σ(yi - ӯ)2 in these formulas is called a sum of squares. It represents squaring each deviation and then adding those squares.
  • The larger the deviations, the larger the sum of squares and the larger s tends to be.
  • The formula might look complicated, the most basic interpretation of the standard deviation s is quite simple: s is a sort of typical distance of an observation from the mean. So the larger the standard deviation s , the greater the spread of the data around the mean. How to find SD?
  1. Excel can compute the standard deviation with the function "=STDEV( range )" to find the standard deviation of the data on versions of Excel before 2010, and either "=STDEV.S( range )" or "=STDEV.P( range )" on newer versions of Excel.
  2. The "S" version is for data that represents only a sample of the whole population, such as if you want to draw conclusions about the whole school's scores but only have data from one class.
  3. The "P" version is for data from the whole population, when you're interested only in your class, or if you have all the scores from the school.
  • Syntax =STDEV(number1, [number2], …) =STDEV.S(number1, [number2], …) =STDEV.P(number1, [number2], …)
  • Arguments number1 - A number or cell reference that refers to numeric values. number2 - [optional] A number or cell reference that refers to numeric values
  • Numbers can be supplied as numbers, ranges, or cell references that contain numeric values. —> example: consider these two samples of quiz scores for two small samples of students that have both mean of 5 and range of 10: Sample 1: 0, 4, 4, 5, 7, 10 Sample 2: 0, 0, 1, 9, 10, 10 By inspection sample 1 shows less variability about the mean than sample 2. Most scores in sample 1 are near the mean of 5, whereas all the scores in sample 2 are quite far from 5. The standard deviation for sample 1 is equal to 3.34, while for sample 2 to 5.1. Since 3.34 ≤ 5.14, the standard deviations tell us that sample 1 is less variable than sample 2. SD properties
  • s ≥ 0
  • s = 0 only when all observations have the same value (for instance if the ages for a sample of five students are 19, 19, 19, 19, 19, then the sample mean equals 19, each of the five deviations equals 0, and s = 0. This is the minimum possible variability).
  • The greater the variability about the mean, the larger is the value of s.
  • The reason for using (n - 1), rather than n , in the denominator of s is a technical one regarding inference about population parameters. When we have data for an entire population, we replace (n - 1) by the actual population size; the population variance is then precisely the mean of the squared deviations. In that case, the standard deviation can be no larger than half the range.
  • If the data are rescaled, the standard deviation is also rescaled. For instance, if we change annual incomes from dollars (such as 34,000) to thousands of dollars (such as 34.0), the standard deviation also changes by a factor of 1000 (such as from 11,800 to 11.8). Position
  • Measures of position tell us the point at which a given percentage of the data fall below (or above) that point.
  • The most typical measures of position used in statistics are percentiles and quartiles.
  • Some measures of position describe centre (the median) and some describe variability (the interquartile range). (example: the 62nd percentile is the data value for which 62% of the data fall below, and the third quartile is the data value for which 3/4 of the data fall below).
  • The median is a measure of position: splitting the values in two, with half the data falling below it and half above it. Position/ Percentile
  • The pth percentile is the point such that p% of the observations fall below or at that point and (100 - p)% fall above it.
  • Substituting p = 50 in this definition gives the 50th percentile. This is the median. The median is larger than 50% of the observations and smaller than the other (100—50)=50%.
  • Two other commonly used percentiles are the lower and the upper quartile.
    1. The 25th percentile is called the lower quartile. One quarter of the data fall below the lower quartile.
    2. The 75th percentile is called the upper quartile. One quarter of the data fall above the upper quartile. Position/ Interquartile Range/ Quartile and other percentage
  • The difference between the upper and lower quartiles is called the interquartile range, denoted by IQR.
  • This measure describes the spread of the middle half of the observations.
  • An advantage of the IQR over the ordinary range or the standard deviation is that it is not sensitive to outliers.
  • The quartiles together with the median split the distribution into four parts, each containing one-fourth of the observations. Percentiles
  • To compute the pth percentile of a distribution of a continuous variable you can use the built-in function: =PERCENTILE(range; pth)
  • The first argument of the function specifies the range of values on which the computation should be performed.
  • The second argument of the PERCENTILE function must be a decimal number between 0 and 1. This represents the proportion of your interest. (example: if you want to compute the 30th percentile of a set of data you should write in a cell: =PERCENTILE(range; 0.3) Quartiles
  • To compute the quartiles of a distribution of a continuous variable you can use the built-in function =QUARTILE(range, quart)
  • The first argument of the function specifies the range of values on which the computation should be performed.
  • The second argument of the QUARTILE function must be 0, 1, 2, 3 or 4 according to the quartile you want.

- 0 will return the minimum value of the distribution

- 1 the lower quartile (25%)

- 2 the median (50%)

- 3 the upper quartile (75%)

- 4 the maximum value of the distribution

Interquartile range

  1. Select the range with full names.
  2. On the Data tab, in the Data Tools group, click Text to Columns.
  3. The following dialog box appears. Choose Delimited and click Next.
  4. Clear all the check boxes under Delimiters except for the Comma and Space check box.
  5. Click Finish. NB This example has commas and spaces as delimiters. You may have other delimiters in your data. Experiment by checking and unchecking the different check boxes. You get a live preview of how your data will be separated. CONCATENATE
  • What instead if you need to merge the content of two or more cells?
  • You can use the function =CONCAT(text1, [text2], …) or =CONCATENATE(text1, [text2], …) REMOVING DUPLICATES How to remove duplicates?
  • On the Data tab, in the Data Tools group, click Remove Duplicates. The Remove Duplicates dialog box appears.
  • In the Columns list:
  1. check all the columns if you want Excel to find and remove records that contains the same value/text in each variable/field.
  2. alternatively, check only one or specific columns if you want Excel to keep only records with unique values in the specified variable(s)/field(s).
  • Leave the “My data has headers” box checked if your data has headers (variable names in the first row). This way, Excel won’t treat the row with variable names as though it contains data.
  • Click OK. Excel responds with a dialog box stating n duplicate values were removed.
  • Click OK to dismiss the dialog box. How to remove all duplicates from a dataset?
  • Click on Remove Duplicates button in the Data tab.
  • Tick all columns and click on Ok. How to find the unique values of one variable?
  • Click on Remove Duplicates button in the Data tab.
  • Tick only the column that refers to the variable you are interested in, and click Ok. FILTERING DATA
  • With Excel, there’s a way for you to filter out just those rows that don’t pertain to what you’re searching for.
  • Similarly with any database, when you make a request or a query for just the records that meet particular criteria, you expect to see only the relevant data, and for irrelevant or non-matching data to be filtered out.
  • This does not change the database, and you don’t delete any rows with a filter. You just hide them temporarily.
  • There are different ways of filtering data in Excel:

- Using AutoFilter: This converts the headings row of your table into a set of controls, which you then

use to choose your criteria and then select the data you want to see.

- Creating a Custom AutoFilter: A custom AutoFilter uses a rule that you create, instructing Excel how to

evaluate the entries in each row. The result of that evaluation determines whether rows are displayed or filtered out.

- Filtering Data Using Cell Attributes: If cell’s formatting is applied to a column of cells, that formatting

is something that Excel can recognize when filtering data. So you can easily have a filter hide rows where cells in a column don’t have a particular format, such as a shaded background or a font color, or have a particular icon. How to use the filter in Excel?

  • First, add the filter drop-down buttons to each column of your dataset after the field names by:
    1. Moving the cell cursor to a cell of the database;
    2. Go to the Data tab and click on the Filter icon;
  • Now click on the filter drop-down of the field you want to use to filter the data and select the records that are relevant to you, or define a custom filter, or filter based on cell formatting.
  • Excel then displays only those records that contain the value(s) or formatting you selected in that field, while all other records are hidden temporarily. How to use custom filters?
  • To create a custom filter for a field, you click the field’s AutoFilter button and then highlight Text Filters, Number Filters, or Date Filters (depending on the type of field) on the drop-down list and then click the Custom Filter option at the bottom of the continuation list. When you select the Custom Filter option, Excel displays a Custom AutoFilter dialog box.
  • In this dialog box, you select the operator that you want to use in the first drop-down list box.
  • See Table below for operator names and what they locate.
  • Then enter the value (text or numbers) that should be met, exceeded, fallen below, or not found in the records of the database in the text box to the right. How to filter data using cell attributes?
  • You can also filter rows where cells in a column have a particular format, such as a shaded background or a font color. Copy-pasting filtered data? After you filter a data list so that only the records you want to work with are displayed, you can copy those records for instance in another worksheet in the workbook.
  • Simply select the cells, then click the Copy button on the Home tab or press Ctrl+C/Cmd+C, move the cell cursor to the first cell where the copied records are to appear, and then press Enter.
  • After copying the filtered records, you can then redisplay all the records in the database by clearing the filter clicking on the Clear button. Applying multiple filters If you find that filtering the data list by selecting a single value in a field drop-down list box gives you more records than you really want to contend with, you can further filter the database by selecting another value in a second field’s drop-down list. SORTING DATA
  • When you enter records for a new data list, you generally enter them in the order in which you retrieve their records, that is you add new records at the bottom of the database by adding a new row.
  • What about those times when you need to see the records in another, special order?
  • Every data list you put together in Excel will have some kind of preferred order for maintaining and viewing the records.
  • Depending on the data, you may want to see the records: A. in alphabetical order (for example clients listed by company name); B. in numerical order (for example based on a specific date)
  • To have Excel correctly sort the records of a dataset, you must specify which field’s values determine the new order of the records. These fields are technically known as the sorting keys. Further, you must specify what type of order you want to create using the information in these fields. Choose from two possible orders: A. Sort smallest to largest (ascending order): text entries are placed in alphabetical order from A to Z, values are placed in numerical order from smallest to largest, and dates are placed in order from oldest to newest. B. Sort largest to smallest (descending order): text entries are placed in alphabetical order from Z to A, values are placed in numerical order from largest to smallest, and dates are placed in order from newest to oldest. Sorting on a single field/one criterion
  • When you need to sort the data on only one particular field (such as the Record Number, or the Last Name, or the Company field), you simply: A. Move the cursor to a cell of the field you want sort the data on. B. In the Ribbon you go to the Data tab, and in the Sort & Filter group, click the Sort Smallest to Largest button (with A on top of Z, and an arrow pointing down). The list is now sorted in ascending numerical order by the field/variable, which was the column you clicked in before performing the sort.
  1. Click the Insert tab.
  2. Click Table.
  3. Click OK in the Create Table dialog box. Notice how the range is automatically entered. Alternatively move the cursor to a populated cell and then top CTRL+T (Cmd+T on Mac).
  • After adding the row, I can do the additional math on the columns such as determining the average inauguration age. If you click the triangle, you’ll see a list of math options.
  • Apart from the usability features, Excel Tables make math easier. To start, you can add something called a Total Row by checking the box in the Table Style Options panel. There are a number of built-in math functions you can choose among.
  • To remove the Table formatting and go back to a simple spreadsheet, click on the Convert to Range button in the Table tab. Excel structured reference
  • A structured/table reference is a special way of referencing tables and their parts that uses a combination of table and column names instead of cell addresses.
  • This special syntax is required because Excel tables are very powerful and resilient, and normal cell references cannot adjust dynamically as data is added or removed from a table.
  • For example, to sum the values in cells B2:B5, you use the SUM function with a usual range reference: =SUM(B2:B5)
  • To add up the numbers in the "Sales" column of Table1, you use a structured reference: =SUM(Table1[Sales]) Key features of structured references
  • To add structured references to your formula, you simply select the table cells you want to refer to. The knowledge of a special syntax is not required.
  • When you rename a column, references are automatically updated with the new name, and a formula does not break. Moreover, as you add new rows to the table, they are immediately included in the existing references, and the formulas calculate the full set of data.
  • Structured references can be used in formulas both inside and outside an Excel table, which makes locating tables in large workbooks easier.
  • To perform the same calculation in each table row, it is enough to enter a formula in just one cell. All other cells in that column are filled automatically. How to create a structured reference in Excel
  • Start typing a formula as usual, beginning with the equality sign (=).
  • When it comes to the first reference, select the corresponding cell or range of cells in your table. Excel will pick up the column name(s) and create an appropriate structured reference for you automatically.
  • Type the closing parenthesis and press Enter. If the formula is created inside the table, Excel automatically fills the entire column with the same formula. PIVOT TABLES (examples and exercises are in class5.pdf)
  • Pivot table: a table of statistics that summarizes the data of a more extensive table.
  • This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.
  • They enable you to quickly summarize large amounts of data revealing inherent relationships and trends.
  • Pivot tables are great for summarizing particular values (frequencies, percentages, descriptive statistics) in a data set because they do their magic without making you create formulas to perform the calculations.
  • You can also use pivot tables to cross-tabulate variables to assess their joint distribution (useful for nominal and ordinal variables).
  • Why is a PivotTable called a PivotTable? Because you can easily pivot fields from a row to a column and vice versa to create a different layout.
  • You might think of a pivot table as a custom created summary table of your spreadsheet where you can switch your columns and rows. But it also has elements of Excel Tables.
  • You create the table by defining which fields to view and how the information should display. Based on your field selections, Excel organizes the data so you see a different view of your data.
  • To insert a pivot table, execute the following steps:
    1. Click any single populated cell inside the data set
    2. On the Insert tab, in the Tables group, click PivotTable
    3. The dialog box will appear. Excel automatically selects the data for you. The default location for a new pivot table is New Worksheet, but if you want you can place the PivotTable in an existing worksheet by specifying the location of the first cell of the PivotTable.
    4. Click OK. On the right you have the PivotTable Fields where you can select the variables to appear in the table and how to summarize the data.
    5. You can populate the four sections of the PivotTable Fields pane (Filters, Columns, Rows, and Values) by dragging and dropping the variable names.
    6. By default, Excel summarizes your data by either summing (for continuous variables) or counting (for nominal or ordinal variables) the items.
  1. Available summary functions:
  • Sum (SOMMA), to show the sum of the records for a particular category
  • Count (CONTEGGIO), to show the count of the records for a particular category
  • Average (MEDIA), to calculate the average for the values in the field for the current category and page filter
  • Max (MASSIMO), to display the largest numeric value in that field for the current category and page filter
  • Min (MINIMO), to display the smallest numeric value in that field for the current category and page filter
  • Product (PRODOTTO), to display the product of the numeric values in that field for the current category and page filter
  • Count Numbers (CONTA NUMERI), to display the number of numeric values in that field for the current category and page filter
  • StdDev(DEVIAZIONE STANDARD), to display the standard deviation for the sample in that field for the current category and page filter
  • StdDevp (DEVIAZIONE STANDARD POPOLAZIONE), to display the standard deviation for the population in that field for the current category and page filter
  • Var (VARIANZA), to display the variance for the sample in that field for the current category and page filter
  • Varp (VARIANZA POPOLAZIONE), to display the variance for the population in that field for the current category and page filter
  • Exercises with pivot tables: create filters for pivot table and change them according to your needs, sort from smallest to largest and vice versa, change calculation formulas, how to remove a field/variable, group, group dates, fields with multiple rows, multiple value fields, report multiple filter fields. Frequency distribution (examples and exercises are in class5.pdf) you can use pivot tables to easily create a frequency distribution in Excel:
  1. Frequency distribution of a categorical variable
  2. Frequency distribution of a quantitative variable Two-dimensional (two-way) table (examples and exercises are in class5.pdf)
  • A two-way or contingency table is a statistical table that shows the observed number or frequency for two variables, the rows indicating the values of one variable and the columns indicating the values of the other variable.

Conditional formatting (examples and exercises are in class5.pdf)

  • In the Styles group of the Home tab you can find the Conditional Formatting button, a tool that may help you answering such questions by highlighting pertinent cells or ranges of cells.
  • It enables you to specify how cells that meet a given condition should be displayed. It applies a specific formatting according to rules.
  • You can apply a single rule or multiple rules, and you can use the preset rules provided by Excel or customize them for your particular needs.
  • From the Conditional Formatting drop-down menu you can select different options:
    1. Highlight Cells Rules, to highlight the cells in the cell selection that contain certain values, text, or dates; that have values greater or less than a particular value; or that fall within a certain ranges of values.
    2. Top/Bottom Rules, to highlight the top and bottom values, percentages, and above and below average values in the cell selection.
    3. Data Bars opens a palette with different color data bars that you can apply to the cell selection to indicate their values relative to each other by clicking the data bar thumbnail. They are great for helping you quickly spot the lower and higher values within a large range of data.
    4. Color Scales opens a palette with different two- and three-colored scales that you can apply to the cell selection to indicate their values relative to each other by clicking the color scale thumbnail. They are great for identifying the distribution of values across a large range of data.
    5. Icon Sets opens a palette with different sets of icons that you can apply to the cell selection to indicate their values relative to each other by clicking the icon set. Icon sets are great for quickly identifying the different ranges of values in a range of data.
    6. New Rule opens the New Formatting Rule dialog box where you define a custom conditional formatting rule to apply to the cell selection. You need to specify the rule that identifies which values or text entries to format, but also the colors and other aspects included in the formatting.
    7. Clear Rules opens a continuation menu where you can remove conditional formatting rules for the cell selection by clicking the Clear Rules from Selected Cells option, for the entire worksheet by clicking the Clear Rules from Entire Sheet option, or for just the current data table by clicking the Clear Rules from This Table option.
    8. Manage Rules opens the Conditional Formatting Rules Manager dialog box where you edit and delete particular rules as well as adjust their rule precedence by moving them up or down in the Rules list box. DATA VALIDATION How to ensure data integrity?
  • When typing new data in a workbook it’s easy for anyone to type the wrong digits or characters, especially in a field where a single character denotes a type, such as a senior citizen or a child, or such as a dog or a cat. For example, should “C” in a vet clinic stand for “cat” or “canine?”
  • Validation helps to ensure that data gets entered correctly before it gets processed incorrectly.
  • Excel’s data validation tools can help you set up rules that keep you or anyone else from entering invalid or unusable data, or from failing to enter data when it’s required. Restricting Cell Entries to Certain Data Types
  • For example, when collecting data on clients’ ZIP code, you know that it has five digits. You can set up a rule in Excel that pings the user whenever he/she types a four- or six-digit code by mistake.
  • How to do that?
    1. Select the cell range or a column where you want to validate data.
    2. Click the Data tab, and then in the Data Tools group, click Data Validation. The Data Validation dialog box opens.
    3. Click the Settings tab.
    4. In the Allow list box, choose Text length. This is the first step in the creation of a rule governing how many characters each new entry should contain.
    5. In the Data list box, choose equal to.
    6. Click the Length box and type 5. Input and error message
  • To insert an Input message (a message appearing when you select a cell in the validation range):
    1. Click the Input Message tab
    2. Click the Title box and type the rule
  • To insert and Error message (an error alert message appearing when a user attempts to enter data that is invalid): 1. Click the Error Alert tab. 2. Click the Title box and type Data Entry Error. 3. Click the Error message box and type “Only five-digit ZIP codes are recognized”. This message is displayed in a dialog box whenever an invalid entry is made. 4. Click OK. WARNING! Excel’s validation rules pertain only to new data that is entered into the workbook, not to data that existed in the workbook prior to creating the rules. Allowing Only Specific Values to Be Entered in Cells
  • A typical piece of information you’ll find in a database is often a single letter that represents a characteristic (such as gender or political party affiliation). In data entry, it’s easy for someone to slip and enter an invalid character. You can preempt events like this by building a rule that restricts entry to a handful of valid characters.
  • How to do that?
    1. Select the range or the column where you want to restrict entries to specific values (for instance M for males and F for females, Non-binary, and Other).
    2. Click on Data Validation on the Data tab and in the Data Validation dialog box, click the Settings tab.
    3. In the Allow list box, choose List. The Source box appears at the bottom of the dialog box.
    4. Click the Source box. Type M;F;Non-binary;Other (include the semi-colon or the comma depending on your locale settings)
    5. Click the Input Message tab. Click in the Input message box and type: Gender.
    6. Click OK. Now anyone entering a new record into the database must specify the gender from a drop-down list in the cell. Change or remove Data Validation rules How to do that?
    7. Click one of the cells containing the Data Validation rule you want to change/delete.
    8. Click on Data Validation on the Data tab.
    9. In the Settings tab of the Data Validation dialog box tick Apply these changes to al other cells with the same
    10. settings (if you want to apply changes to all other cells with the same settings.
    11. Click on Clear All to remove the Data Validation rule.
    12. Or change the Data Validation according to your new preferences and click OK. EXCEL CHARTS
  • A chart is a graphical representation of numeric data in a worksheet.
  • Data values are represented by graphs with combinations of lines, vertical or horizontal rectangles (columns and bars), points, and other shapes.
  • You can use charts to compare groups, show the composition, show a distribution, show a relationship between variables.
  • To determine which chart is best suited for you, first, you must answer a few questions: What type of variables do you want to graph? (categorical or continuous?) How many variables do you want to show in a single chart? One, two, three, many? How many items (data points) will you display for each variable? Only a few or many? What do you want to show? A comparison, a trend, a distribution, or an association?
  • Column and Bar charts are good for comparisons.
  • Line charts work better for trends.
  • Scatter plot charts are good for relationships and distributions.
  • Pie charts should be used only for simple compositions — never for comparisons or distributions.
  • In Excel you can choose from many chart types with numerous subtypes and
  • combo charts.
  • The Insert tab contains the command groups you’ll use to create charts in Excel.
  • To create a basic chart in Excel that you can modify and format later, start by
  • entering the data for the chart on a worksheet.
  • Then, you select that data and choose a chart type to graphically display the data.
  1. Select the chart;
  2. Right-click on the legend and click on Format legend.
  3. On the right-hand side pane that opens-up select Right as the legend position.
  • You can use data labels to focus your readers' attention on a single data series or data point. To add data labels:
  1. Select the chart;
  2. Click the Add chart element button in the Chart Design tab or the + button on the right side of the chart and click the check box next to Data Labels.
  • You can move a chart to a separate sheet that only contains a chart (a chart sheet) with the following steps:
  1. Select the chart.
  2. On the Design tab, in the Location group, click Move Chart.
  3. The Move Chart dialog box appears.
  4. Click New sheet and enter a name.
  5. Click OK. Bar chart
  • A bar chart is the horizontal version of a column chart.
  • If you have long category names, it is best to use bar charts because they give you more space for long text.
  • You should also use bar charts, instead of column charts, when the number of categories is great.
  • To create a bar chart, execute the following steps:
  1. Select the range A1:D9.
  2. On the Insert tab, in the Charts group, click the Column symbol.
  3. Click Clustered Bar. Pie chart
  • Pie charts charts are among the most frequently used and also misused charts.
  • Pie charts are used to display the contribution of each value (slice) to a total (pie). Pie charts always use one data series.
  • When using pie charts make sure: the total sum of all segments equals 100 percent; to use it only if you don’t have too many categories (ideally, there should be only two); don’t use a pie chart if the category values are almost identical; 3D or blow apart effects may reduce comprehension and show incorrect proportions. Therefore, evaluate whether it makes sense to use them.
  • To create a pie chart of the 2017 data series, execute the following steps:
  1. Select the range A1:A4, hold CTRL/Cmd down and select the range D1:D4.
  2. On the Insert tab, in the Charts group, click the Pie symbol.Click on the pie to select the whole pie.
  3. Click on a slice to drag it away from the centre. Histogram chart
  • A graph of a (relative) frequency distribution for a quantitative variable is called histogram.
  • For a continuous variable or a discrete variable with many possible values, you need to divide the all possible values into mutually exclusive intervals.
  • Each interval has a bar over it, with height representing the number of observations in that interval.
  • With histograms you can: evaluate a distribution of values, understand the range of values and identify where the most common values fall.
  • How to create an histogram chart:
  1. Select the range of data to plot from K1:K1197.
  2. On the Insert tab, in the Charts group, click the Histogram symbol.
  3. Click HistogramsEnter a title by clicking on Chart Title.
  4. For example, Party position on Left-Right.
  • Choosing intervals for frequency distributions and histograms is primarily a matter of common sense.
  • If too few intervals are used, they are so wide that the information presented is meaningless.
  • Choosing intervals for frequency distributions and histograms is primarily a matter of common sense.
  • If too many intervals are used, they are so narrow that the information presented is difficult to digest, and the histogram may be irregular and the overall pattern of the results may be obscured. Box plot
  • When you have continuous variables you can use a boxplot to display the central tendency and spread, also by group.
  • The box of a box plot contains the central 50% of the distribution, from the lower quartile to the upper quartile.
  • The length of the box is the interquartile range (IQR) = 75th percentile – 25th percentile.
  • The median is marked by a line drawn within the box.
  • The lines extending from the box are called whiskers. These extend to the maximum and minimum, except for outliers, which are marked separately.
  • In a box-plot: The middle line of the box represents the median, the x in the box represents the mean, the whiskers (vertical lines) extend from the ends of the box to the minimum value and maximum value.
  • Excel considers any data value to be an “outlier” if it is 1.5 times the IQR larger than the third quartile or 1.5 times the IQR smaller than the first quartile.
  • Outlier if: values < Q1 - (IQR1.5), values > Q3 + (IQR1.5)
  • Create a box plot and whisker in a very easy way:
    1. Select the range of data values to plot from J1:J398, go to the Insert tab, and then in the Charts group, click the Statistic Chart symbol.
    2. And from there click on Box and Whisker Excel lets you add another data series in a box-plot in a very easy way:
    3. Click on the chart and then in the Chart Design tab click on Select Data. And from there click on Add a new series.
    4. Then select as Series name cell I1 and as Series values the range I2:I398 and click on OK. Line chart
  • Line charts are among the most frequently used chart types when you have continuous (interval/ratio) data.
  • These are best suited for showing trends in data over a period of time.
  • With line charts, the emphasis is on the continuation or the flow of the values (a trend), but there is still some support for single value comparisons, using data markers.
  • To create a line chart with Excel:
    1. Select the range of data from B1:B122.
    2. On the Insert tab, in the Charts group, click the Line symbol.
    3. Click Line.
    4. Enter a title by clicking on Chart Title.
  • To add the horizontal axis labels:
    1. Click on Select data in the Chart Design tab
    2. Click on Edit to add Horizontal axs labels
    3. Select the range A2:A122 and click on Ok.
  • To add lines for other subjects
    1. Click on Select Data.
    2. Then click on Add series and insert each series name and values in turn.
    3. Once you have finished click on Ok.
  • By clicking on chart elements you can customize it from the Format Chart menu that opens on the right- hand side of the window. Area chart
  • An area chart is a line chart with the areas below the lines filled with colors.
  • Area charts are primarily used when the summation of quantitative data (DV) is to be communicated (rather than individual data values). The area underneath the line(s) helps in graphically depicting quantitative progression over time.
  • Area charts will fill up the area below the line, so the best use for this type of chart is for presenting accumulative value changes over time, like item stock, number of employees, or a savings account.
  • Do not use area charts to present fluctuating values, like the stock market or prices changes, because areas will cross each other.
  • To create an area chart, execute the following steps: (use the Line chart worksheet) 1. Select the range E1:M74. 2. On the Insert tab, in the Charts group, click the Line symbol. 3. Click Area.
  • Stacked area charts are best used to show changes in composition over time.
  • Stacked area charts might be colorful and fun, but you should use them with caution, because they can quickly become a mess.
  • Don’t use them if you need an exact comparison and don’t stack together too many categories. Change the chart's subtype to Stacked Area (the one next to Area).