Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


DATA ANALYSIS - ANALISI DEI DATI - FRANCESCO VISCONTI - IPLE, UNIMI, Appunti di Analisi Dei Dati

COMPLETE, STEP BY STEP GUIDE FOR EXCEL

Tipologia: Appunti

2021/2022

In vendita dal 08/01/2023

valentinanovelli1
valentinanovelli1 🇮🇹

4.5

(2)

19 documenti

1 / 86

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
LESSON 1
Data: are empirical material organised into a form that can be analysed =!
in the worksheet!
Data analysis: is a process of gathering, inspecting, cleansing, transforming,
and modelling data with the goal of discovering useful information, informing
conclusions and supporting decision-making.!
Primary data analysis: you collect and categorise (coding process) your
own data through interviews, ocial documents, experiments, surveys..!
Secondary data analysis: you use data sets that have been gathered by
others and have subsequently been deposited in databases (existing!
archived collections of data). !
Units of analysis the objects/subjects to which the properties investigated!
pertain
Spreadsheet
foglio elettronico: is an interactive computer application for the organisation,!
analysis, and storage of data in tabular form it consists of!
a table of cells arranged into rows (Y= numbers 1,2,3..) !
and columns (X = letters, a, b, c..) What makes a!
spreadsheet software program unique is its ability to!
calculate values using mathematical formulas and the data!
in cells
Spreadsheet basic concepts
Cell: box for holding data. A single cell is referenced by its column and row !
Worksheet/Sheet: a grid of cells with either raw data (values), or formulas!
Values: raw data (numbers, text, dates) / formula, might be a calculation.!
Formulas: say how to mechanically compute new values from existing
values. !
Functions: built-in functions, such as arithmetic operations (for example,
summations, averages), trigonometric functions, statistical functions, etc.. !
Charts: graphical display of data. !
History of Spreadsheets$
!
VisiCalc by Dan Bricklin released in 1979 =
the first spreadsheet computer program for personal computers, originally!
released for the Apple II = the application that turned the microcomputer
Page of 1 86
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56

Anteprima parziale del testo

Scarica DATA ANALYSIS - ANALISI DEI DATI - FRANCESCO VISCONTI - IPLE, UNIMI e più Appunti in PDF di Analisi Dei Dati solo su Docsity!

LESSON 1

Data : are empirical material organised into a form that can be analysed = in the worksheet Data analysis: is a process of gathering, inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions and supporting decision-making.

  • Primary data analysis : you collect and categorise (coding process) your own data through interviews, official documents, experiments, surveys..
  • Secondary data analysis : you use data sets that have been gathered by others and have subsequently been deposited in databases (existing archived collections of data).
  • Units of analysis the objects/subjects to which the properties investigated pertain Spreadsheet foglio elettronico : is an interactive computer application for the organisation, analysis, and storage of data in tabular form it consists of a table of cells arranged into rows (Y= numbers 1,2,3..) and columns (X = letters, a, b, c..) What makes a spreadsheet software program unique is its ability to calculate values using mathematical formulas and the data in cells Spreadsheet basic concepts Cell : box for holding data. A single cell is referenced by its column and row Worksheet/Sheet: a grid of cells with either raw data (values), or formulas Values : raw data (numbers, text, dates) / formula, might be a calculation. Formulas: say how to mechanically compute new values from existing values. Functions: built-in functions, such as arithmetic operations (for example, summations, averages), trigonometric functions, statistical functions, etc.. Charts: graphical display of data. History of Spreadsheets
  • VisiCalc by Dan Bricklin released in 1979 = the first spreadsheet computer program for personal computers , originally released for the Apple II = the application that turned the microcomputer

from a hobby for computer enthusiasts into a serious business tool. VisiCalc is considered the Apple II's killer app. It sold over 700,000 copies in six years, and as many as 1 million copies over its history.

  • Lotus 1-2-3 by Lotus Software 1983 = added the possibility of creating graphs
  • Microsoft Excel in 1985 for Macintosh Apple II and in 1987 on Windows By the early 1990s, Excel had started to outsell Lotus 1-2-3 and helped Microsoft achieve its position as a leading PC software developer The last version of Excel is of 2019 Microsoft Excel/Start screen When you launch Excel, a Start screen opens. From here you can:
  • start a new BLANK WORKBOOK
  • select any of the other TEMPLATES
  • reopen recently opened files for further editing
  1. BLANK WORKBOOK: when you select the Blank workbook template from the Start screen, the program opens an initial worksheet (named Sheet1 ) in a new workbook file (named Book1 ) inside a program window. File button = a menu on the left that contains all the document and file related commands: Info, New, Open , Save, Save As, Print, Share, Export , and Close. Account option with User and Product information and an Options item that enables you to change default settings (e.g., language).
  • Info panel : shows stats about the workbook file, is divided into two panes.
    • left pane enables you to modify the workbook’s protection status, check the document before publishing, and manage its versions.
    • right pane contains a list of fields detailing the workbook’s properties.
  • Below the Info option, you find the commands ( New, Open, Save, Save As, Print, Share, Export, and Close ) you commonly need for working with Excel workbook files.
  • Account : shows user, connection, and Microsoft Office account info.
  • Options : change the program’s default settings.
  • Select the Open option to open an Excel workbook you’ve previously worked on for more editing.
  1. TEMPLATES separated into two panes
    • left pane , Excel lists the names of recently edited workbooks

Formulas : command buttons used to add formulas and functions to a spreadsheet or checking a worksheet for formula errors , arranged into the Function Library, Defined Names, Formula Auditing, and Calculation groups. Data : command buttons used when importing, querying, outlining , and subtotalling the data placed into a worksheet’s data list, arranged into the Get External Data, Connections, Sort & Filter, Data Tools, and Outline groups. Review : command buttons used when proofing, protecting, and marking up a spreadsheet for review by others , arranged into the Proofing, Language, Comments, and Changes groups. View : command buttons normally used when changing the display of the Worksheet area and the data it contains, arranged into the Workbook Views, Show, Zoom, Window, and Macros groups.

  • Excel can display contextual tools when you’re working with a particular object that you select in the worksheet , such as a graphic image or a chart or PivotTable.
  • Chart Tools contextual tab = the name of the contextual tool for the selected object appears immediately above the tab or tabs associated with the tools. When you deselect the object, the contextual tool disappears from the Ribbon Formula Bar it displays the cell address ( a column letter(s) followed by a row number ) and the contents of the current cell. It has three sections:
    • Name box = the left‐most section , displays the address of the current cell address.
  • Formula bar buttons = cancel (an X), enter (a checkmark) the Insert Function button (Fx).
  • Cell contents = section of the Formula bar it always shows you the contents of the cell even when the worksheet does not ex: when dealing with a formula, Excel displays only the calculated result in the cell in the worksheet and not the formula by which that result is derived. you can also edit the contents of the cell in this area

Worksheet Area

  • It contains the cells of the worksheet identified by column using letters along the top and row using numbers along the left edge;
  • Tabs for selecting new worksheets;
  • Horizontal and vertical scroll bars to move left/right up/down
  • To enter or edit data in a cell, that cell must be current. Excel indicates it in three ways:
    • the cell cursor — the dark green border surrounding the cell’s entire perimeter — appears in the cell.
    • the address of the cell appears in the Name box of the Formula bar.
    • the cell’s column letter(s) and row number are shaded Methods for moving the cell cursor around the worksheet:
  1. Click on the desired cell
  2. Click the Name box, then type the address of the desired cell and press the Enter key.
  3. Press F5 to open the Go To dialog box, type the address of the desired cell into its Reference text box, and then click OK.
  4. Use the cursor keys to move the cell cursor to the desired cell.
  • Excel keystrokes move the cell cursor to a new cell. When you use one of these keystrokes, the program automatically scrolls a new part of the worksheet into view, if this is required to move the cell pointer. IMPORTANT!! CTRL+FRECCE portano in alto e in basso alle ultime celle occupate e non
  • If Excel finds that the entry is a formula, the program automatically calculates the formula and displays the computed result in the worksheet cell (you continue to see the formula itself, however, on the Formula bar).
  • If Excel does not qualify the entry as a formula, the program the determines whether the entry is a text or a value, it aligns text entries with the left edge of the cell and values with the right Text = alight at the left of the cell
  • To re-enter long texts into certain cells, select the cells you want to cancel and then using “Wrap Text” and “Merge and Center” Values = align at the right of the cell
  • numbers that represent quantities (1,2,3..)
  • numbers that represent dates (September 14th, 2011) or times (2 p.m) =
    • dates and time entered as values can be used in formulas (November 6, 2012 = 6-Nov-12 or 11/6/ 3 AM/PM = 3 A/P = 15:00)
    • if you enter two dates as values, you can then set up a formula that subtracts the more recent date from the older date and returns the number of days between (=DATEDIF(A2,B2,”d”) years between (=DATEDIF(A2,B2,”y”) months between (=DATEDIF(A2,B2,”m”) weeks between (=DATEDIF(A2,B2,”d”) /7)
  • when entering numeric values with decimal places, use the period as the decimal point ( or the comma depending on the your computer settings).
  • you can automatically round numbers after the decimal point through the Numbers group in the Home tab.
  • When you enter a date, the date is converted into a serial number that represents the number of elapsed days starting with 1 for January 1, 1900. For example, July 5, 1998, converts the date to the serial number 35981.
  • Earliest date allowed for calculation: January 1, 1900
  • Latest date allowed for calculation: December 31, 9999
  • Largest amount of time that can be entered: 9999:59: Formulas
    • A formula is an equation that performs calculations, such as addition, subtraction, multiplication, and division, on values in a worksheet.
  • To enter a formula in a cell you always need to start with the equal sign (=) In a formula values are separated by the following mathematical operators:
  • + (plus sign) for addition
  • (minus sign or hyphen) for subtraction
  • ***** (asterisk-times) for multiplication
  • / (slash) for division
  • ^ (caret) for raising a number to an exponential power Cell References
  • A cell reference identifies a cell’s location in the worksheet, based on its column letter and row number, such as A1 (column A, row 1)
  • Formulas will change their calculated results automatically to match changes in the cells referenced by the formulas.
  • There are three types of cell references :
    • Relative : one whose references change “relative” to the location where it is copied or moved , you use it when you want the reference to automatically adjust when you copy or fill the formula across rows or down columns in ranges of cells example: - Create a formula in cell D2 that multiplies a value in B2 by a value in C2 , enter the following formula in cell D2: = B2C2.*
  • Select cell D2, click on the lower right corner of cell D2 and drag it down to cell D5 and so: D3 references cell B3C D4 references cell B4C4. D5 references cell B5*C5. in other words: each cell references its two neighbours on the left.
    • Absolute one whose refers to a specific cell or range of cells regardless of where formula is located in the worksheet. to make an absolute cell reference, use the dollar sign ($) before the column and row of the cell you want to reference, for example $B$3. example: - Create an absolute reference to cell H3,( $H$3 )
  • it changes across all of you (tutti compiono gli anni ogni giorno)
  • A constant is a variable that does not change for all units of analysis. Levels of measurement
  • Variables can be classified according to the operations that can be carried out on them based on the logical-mathematical characteristics There are three classes of variables: High level of measurement • Continuous (interval/ratio)
    • Ordinal ( clear ordering (<);(>)) Low level of measurement • Nominal ( no ordering) Nominal Variable = 2 or more categories = no intrinsic ordering
  • the characteristic are non-orderable discrete categories:
    • Discrete : the variable takes on a range of finite categories and there are no inter-mediate categories.
    • Non-orderable : no order or hierarchy can be established
  • the relationship that we can establish are of “ equality ” and “ inequalityexample : religion: catholic, muslim, jewish, atheist, protestant - the classification must be exaustive and exclusive = you should be able to categorize all units example: 1 → Catholic Religion Respondent nr. 1 = nr 3 2 → Muslim Religion Respondent nr. 1 nr 2 3 → Catholic 4 → Atheist other exemples : countries, preferred sports.. Ordinal Variable = clear ordering of categories
  • the characteristic to be recorded assumes orderable discrete categories
  • Orderable : it exists an order, which enables us to establish not only relationships of equality and inequality among the categories, but also relationships of order that is ‘ greater than ’ (>) and ‘ less than ’ (<).
  • the distance between one category and the next is not known categories have the ordinal properties of numbers. example: evaluation of life satisfaction : very satisfied; fairly satisfied; not very satisfied; not at all satisfied. example: 1 → Catholic → Not very Satisfied 2 → Muslim → Fairly Satisfied 3 → Catholic → Very Satisfied 4 → Atheist → Fairly Satisfied Respondent nr. 1 satisfaction nr. 2 satisfaction Respondent nr. 1 satisfaction < nr. 2 satisfaction other exemples : religious practice (never, often, sometimes) Continuous Variable = intervals between values are equally spaced
  • distance between categories determined because we have a reference unit (measurement unit or a counting unit)
  • the values have full numerical meaning, ordinal and cardinal properties = we can carry out on these variables all mathematical operations/statistical example: age example: 1→ Catholic → Not very Satisfied → 54 2 → Muslim → Fairly Satisfied → 36 3 → Catholic → Very Satisfied → 44 4 → Atheist → Fairly Satisfied → 41 Respondent nr. 1 age nr. 2 satisfaction Respondent nr. 1 age > nr. 2 satisfaction Respondent nr. 1 age - nr. 2 satisfaction = 18 other exemples: temperature, hours of study, years of education How to Structure a Dataset
  1. Put:
  • variables in columns (the thing you are measuring, “age” or “country”),
  1. Frequency distributions a listing of possible values for a variable , together with the number of observations at each value. (bread: 200 piecies, pasta:450 piecies..)
  2. Measures of central tendency Concern the variable’s typical score. Measures of central tendency unify the observations, providing them with a summary meaning.
  • Mode
  • Median
  • Mean
  1. Measures of variability/dispersion Concern the spread of the scores. They tell how different the observations are, one from the other.
  • **Range
  • Standard deviation
  • interquartile range (IQR)**
  1. Measures of position
  • percentile, quartile Frequency distribution The most common way of describing categorical data is to list the answer categories and show the frequency (the number of observations) for each category with a table
  • A frequency distribution is a listing of possible values for a variable, together with the number of observations at each value. Categorical Variables Frequency Distribution
  • To construct a frequency distribution for a categorical variable, list the categories and count the number of observations in each. For categorical variables we can use the Excel function =COUNTIF(RANGE;VALUE)
  • The sum of the counts should be equal to the number of observations.
  • When a table shows the proportions or percentages instead of the absolute frequency, it is called a relative frequency distribution.
  • proportions or percentages , also called relative frequencies , make it easier to compare different categories.
  • the proportion equals the number of observations in a category divided by the total number of observations = it is a number between 0 and 1 that expresses the share of the observations in that category. the sum of proportions equals 1 ; (=COUNT/TOTALCOUNT)
  • Then we can compute the relative frequency of each interval.

LESSON 3

Measures of Central Tendency = statistics that show what a typical observation is like.

  • Mode : the most frequently occurring value
    • Median : the middle value, the score that has 50% of the values above and 50% of the values below. If the sample size is even, the median represents the average score of the two middle values
    • Mean : is the sum of the observations divided by the number of observations.
  • Continuous variables : mean, mode, median are all informative.
  • Ordinal variables : mean only loosely interpretable, median and mode have more precision
  • Nominal variables : mean and median have no meaning, mode remains useful (exception: binary variables) Mode
  • The mode is the value that occurs most frequently It is appropriate for all types of data but it is most commonly used with highly discrete variables, such as with categorical data.
  • For example, you may measure the mode for religion in Australia (nominal scale), for the grades given by a teacher (ordinal scale), or for the number of years of education completed by Germans (interval scale).

How to find the mode?

  1. Look at a frequency distribution and search for the most frequent value.
  2. Alternatively, Excel can compute the mode using the =MODE(), =MODE.SNGL(), =MODE.MULT(). Mode warnings!
  • If no data value occurred more than once, the MODE function returns #NA, this is often the case with continuous data.
  • If at least two data values occurred with the same frequency, Excel will decide which value to report based on the position: it will return the first value listed.
  • If you want to compute the mode of an ordinal or nominal variable you will need first to recode each category with a specific number otherwise Excel won’t be able to compute it and will return #NA. =IFS(B2="Great", 4, B2="Good", 3, B2="OK", 2, B2="Bad", 1)
  • A frequency distribution is called Bimodal if two distinct mounds occur in the distribution The mean, median and mode are identical for a unimodal, symmetric distribution, such a bell-shaped distribution
    • Bimodal distribution
  • Example #1: Peak restaurant hours If you created a graph to visualize the distribution of customers at a certain restaurant by hour, you’d likely find that it follows a bimodal distribution with a peak during lunch hours and another peak during dinner hours Median
  • The median is the observation that falls in the middle of the ordered sample (data listed from smallest to largest or from largest to smallest).
  • If a sample contains an odd number of observations, the median is the observation that has as many observations below it as above it.

https://blog.datawrapper.de/weekly-chart-income/ Measures of Variability

  • While useful, measures of central tendency do not tell the full story about a variable. They describe a typical value, but not the spread of the data about that typical value. example: the citizens of nation A and the citizens of nation B have the same mean annual income ($25,000).
  • still, the distributions of those incomes differ fundamentally, however, nation B being much less variable. An income of $30, is extremely large for nation B, but not especially large for A. Range (can be computed for nominal, ordinal, and continuous variables)
  • For categorical variables (ordinal and nominal) the range is better understood as the count of the choices, or categories recorded. - For example an ordinal variable with three responses (not willing, undecided, willing) or a nominal variable with three answers (Catholic, Protestant, Jews) have both a range of 3 given that subjects can take up to three different values.
  • For continuous variables the difference between the largest and smallest observations is the simplest way to describe variability. maximum value – minimum value
  • For example in a set of data such as 4, 2, 5, 8, 12, 15 , the range is the highest number (15) minus the lowest number (2) in this case = 15-
  • For nation A from the figure the range of income values is about $50,000 — 0 = $50,.
  • For nation B , the range is about $30,000 — $20,000 = $10,. Nation A has therefore greater variability of incomes.
  • A wide range suggests that observations differ greatly in the variable of interest , and vice versa. How to find the range?
  • Excel doesn’t have a built-in range formula to compute it, but there are very simple workarounds.
  • For ordinal and nominal variables you can simply look for the number of categories a variable takes up.
  • For quantitative variables you can compute the range as the difference between the minimum and the maximum values of variable that you can find with the formulas =MAX(range)-MIN(range). Standard Deviation (meaningful for continuous variable) Standard Deviation is a sort of typical distance of an observation from the mean , so the larger the standard deviation s, the greater the spread of the data around the mean, yi is the of value of a single observation i. ӯ is the mean. (yi - ӯ) is the deviation of observation i from the mean.