Download Data Terminology, Types, and Sampling Methods: A Comprehensive Guide and more Exercises Business Statistics in PDF only on Docsity!
Chapter 2
2.1 VARIABLES AND DATA
Data Terminology
▪ An observation: a single member of a collection of items that we want to study, such as a person, firm, or region. Ex: an employee, or an invoice mailed last month ▪ A variable: a characteristic about the items that we want to study (e.g., student name, Gender, DOB). Ex: an employee’s income or an invoice amount. ▪ Data set: all the values of all of the variables for all of the observations we chose. Data usually are entered into a spreadsheet or database as an n X m matrix
Categorical and Numerical Data
A data set may contain a mixture of data types. Two broad categories:
- Categorical (qualitative) data : values that are described by words rather than numbers - nonnumerical values - Verbal label. Values of the categorical variable might be represented using numbers - Coded
- Numerical (quantitative) data: arise from counting, measuring something, or some kind of mathematical operation. Two types: Discrete (integers) , Continuous (physical measurements, financial variables)
Time Series Data and Cross-Sectional Data
- Time series Data: observation in the sample represents a different equally spaced point in time (years, months, days). The periodicity is the time between observations. → trends and patterns over time Ex: a firm’s sales, market share, debt/ equity ratio, employee absenteeism, inventory turnover, and product quality ratings
- Cross-sectional Data: observation represents a different individual unit (e.g., a person, firm, geographic area) at the same point in time. → variation among observations and relationships Ex: daily closing prices of a group of 20 stocks recorded on December 1, 2015. Combine the two data types to get pooled cross-sectional and time series data. Ex: monthly unemployment rates for the 13 Canadian provinces or territories for the last 6 0 months 2.2 LEVEL OF MEASUREMENT Four levels of measurement for data: nominal, ordinal, interval, and ratio.
Nominal Measurement
Nominal data: the weakest level of measurement and the easiest to recognize, identify a category. “Nominal” data are the same as “qualitative”, “categorical” or “classification” data. The only permissible mathematical operations are counting (e.g., frequencies). ➔ No ordering
Ordinal Measurement
Ordinal data codes connote – imply - a ranking of data values. There is no clear meaning to the distance between. Like nominal data, ordinary data lack the properties that are required to compute many statistics, such as the average. Ordinal data can be treated as nominal, but not vice versa. ➔ Ordering, but differences have no meaning.
Rule of Thumb: A population may be treated as infinite when the population size N is at least 20 times the sample size n (i.e., when N/n ≥ 20)
Parameters and Statistics
- A parameter is a specific characteristic of a population
- A statistic is a specific characteristic of a sample From a sample of n items, chosen from a population, we compute statistics that can be used as estimates of parameters found in the population. Population mean = μ Population proportion = π Sample mean = 𝐱̅ Sample proportion = p
Target Population
- The target population contains all the individuals in which we are interested
- The sampling frame is the group from which we take the sample
2.4 SAMPLING METHODS Two main categories: random sampling and non-random sampling
Random Sampling Methods
Simple Random Sample
We denote the population size by N and the sample size by n. In a simple random sample, every item in the population of N items has the same chance of being chosen in the sample of n items. Ex: select one student at random from a list of 15 students Sampling without replacement : once an item has been selected to be included in the sample, it cannot be considered for the sample again. Problem when our sample size n is
close to our population size N → bias/ tendency to overestimate/ underestimate
A finite population is effectively infinite if the sample is less than 5 percent of the population (if n/N < .05) Sampling with replacement: the same random number could show up more than once. Duplicates are unlikely when n is much smaller than N
Systematic Sample
Systematic sample: choose every k th item from a sequence or list, starting from a randomly chosen entry among the first k items on the list.
- Decide on sample size: n
- Divide frame of N individuals into n groups of k individuals: k=N/n
- Randomly select one xth^ individual from the first group
- Select every xth^ individual in other groups thereafter
2.6 SURVEYS
SURVEY
- Step 1: State the goals of the research
- Step 2: Develop the budget (time, money, staff)
- Step 3: Create a research design (target population, frame, sample size).
- Step 4: Choose a survey type and method.
- Step 5: Design a data collection instrument (questionnaire).
- Step 6: Pretest the survey instrument and revise as needed.
- Step 7: Conduct the survey.
- Step 8: Code the data and analyze the data
Questionnaire Design
- Begin with short, clear instructions.
- State the survey purpose.
- Assure anonymity.
- Instruct on how to submit the completed survey.
- Break survey into naturally occurring sections
- Let respondents bypass sections that are not applicable (e.g., “if you answered no to question 7, skip directly to Question 15”).
Chapter 3
Graphical Presentation of Data
- Data in raw form are usually not easy to use for decision making
- Some type of organization like graph or table is needed
- The type of graph to use depends on the variable being summarized
Tables and Charts for Categorical Data
Summary table
Column/Bar and Pie Chart
- Column/Bar charts and Pie charts are often used for qualitative/categorical data (categories or nominal scale)
- Pies or Bars/Columns represent categories
- Height of bar or size of pie slice shows the frequency or percentage for each category
Column and Bar chart
- A column chart is a vertical display of data
- A bar chart is a horizontal display of data A Column chart display easier to read, but a Bar chart can be useful when the axis labels are long or when there are many categories.
For two-digit or three-digit integer data, the stem is the tens digit of the data, and the leaf is the ones digit ❖ Completed stem-and-leaf diagram: Data in ordered array: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Using the 100’s digit as the stem: ❖ Completed stem-and-leaf display:
Dot Plots
A dot plot is another simple graphical display of n individual values of numerical data, The basic steps in making a dot plot are to
- Make a scale that covers the data range
- Mark axis demarcations and label them
- Plot each data value as a dot above the scale at its approximate location If more than one data value lies at approximately the same X-axis location , the dots are piled up vertically
- Easy to understand
- Show variability
- Show the center and where the midpoint lies
- Reveal some things about the shape of the distribution
- Not good for large samples (e.g., > 5,000).
Dot plots have limitations.
- Don’t reveal very much information about the data set’s shape when the sample is small
- Become awkward when the sample is large (what if you have 100 dots at the same point?)
- When have decimal data. Tabulating Numerical Data
Frequency and Cumulative Distributions
- A table
- Grouping n data values into k classes called bins (based on values of the data)
- The bin limits are cutoff points that define each bin.
- Bins have equal interval widths and their limits cannot overlap ❖ The basic steps for constructing a frequency distribution
- Sort the data in ascending order ➔ Find Smallest and Largest Data Values
- Choose the number of bins ➔ Sturges’ Rule: k = 1 + 3.3.log (n)
- Set the bin limit:
𝐱𝐦𝐚𝐱−𝐱𝐦𝐢𝐧 𝐤
- Put the data values in the appropriate bin ➔ Count the Data Values in Each Bin
- Create the table. ➔ Show only the absolute frequencies or include the relative frequencies and the cumulative frequencies
Multivariate Categorical Data
Contingency table
Side-by-side bar charts
Scatter Plots A scatter plot shows n pairs of observations (x1, y1), (x2, y2),.. ., (xn, yn) as dots (or some other symbol) on an X-Y graph
- Investigate the relationship between two variables → association between two variables
- Convey patterns in data pairs that would not be apparent from a table.
Time Series Plot
- Used to study patterns in the values of a variable over time.
- One variable is measured on the X axis
- The time period is measured on the Y axis.
- Can display several variables at once Log Scales Useful for time series data : be expected to grow at a compound annual percentage rate (e.g., GDP, the national debt, or your future income). Reveal whether the quantity is growing at an
- increasing percent (concave upward),
- constant percent (straight line)
- declining percent (concave downward). Deceptive Graphs Error 1: Nonzero Origin: A nonzero origin will exaggerate the trend
Error 8: Complex Graphs: Complicated visual displays make the reader work harder. Error 11: Area Trick : Simultaneously enlarging the width of the bars as their height increases → bar area misstates the true proportion
CHAPTER 4
4.1 NUMERICAL DESCRIPTION
Descriptive measures derived from:
- a sample (n items): statistics
- a population (N items or infinite): parameters Three key characteristics: center, variability, and shape.
4.2 MEASURES OF CENTER
- Used to measure the rate of change of a variable over time 𝐗𝐆 = ( 𝐱𝟏 𝐱𝟐 𝐱𝟑 … 𝐱𝐧) 𝟏 𝐧
- Geometric mean of rate of return o Measures the status of an investment over time 𝐑𝐆 = [(𝟏 + 𝐑𝟏)(𝟏 + 𝐑𝟐)(𝟏 + 𝐑𝟑) … (𝟏 + 𝐑𝐧)] 𝟏 𝐧 (^) − 𝟏 Where Ri is the rate of return in time period i SUMMARY TABLE Quartiles The quartiles ( denoted Q1, Q2, Q3 ): scale points that divide the ordered data into four groups of approximately equal size : the 25th, 50th, and 75th percentiles
- Q 1 is the value for which 25% of the observations are smaller and 75% are larger
- Q 2 is the same as the median (50% are smaller, 50% are larger)
- Only 25% of the observations are greater than Q 3 The first quartile Q1 is the median of the data values below Q2 , and the third quartile Q3 is the median of the data values above Q
Find a quartile by determining the value in the appropriate position (𝒙𝒏) in the ordered
data
First quartile position: Position of Q 1 = (N+1)/
Second quartile position: Position of Q 2 = (N+1)/2 (the median position)
Third quartile position: Position of Q 3 = 3(N+1)/
where N is the number of observed values The value of Q1, Q2, Q3 is the value between 𝒙𝒏−𝟏 and 𝒙𝒏+𝟏 : =
Box and Whisker Plot
Box-and-Whisker Plot: A Graphical display of data using 5-number summary 𝑥𝑚𝑖𝑛 , Q1, Q2, Q3, 𝑥𝑚𝑎𝑥 A box plot shows
- center (position of the median Q2)
- variability ( width of the “box” defined by Q1 and Q3 and the range between 𝑥𝑚𝑖𝑛 and 𝑥𝑚𝑎𝑥)
- shape ( skewness if the whiskers are of unequal length and/or if the median is not in the center of the box)