Data Terminology, Types, and Sampling Methods: A Comprehensive Guide, Exercises of Business Statistics

A comprehensive overview of data terminology, types, and sampling methods. It covers key concepts such as variables, data sets, categorical and numerical data, time series and cross-sectional data, levels of measurement, and different sampling techniques. The document also includes examples and explanations to illustrate these concepts.

Typology: Exercises

2023/2024

Uploaded on 10/09/2024

huong-trinh-nguyet
huong-trinh-nguyet 🇻🇳

1 document

1 / 55

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 2
2.1 VARIABLES AND DATA
Data Terminology
An observation: a single member of a collection of items that we want to study,
such as a person, firm, or region.
Ex: an employee, or an invoice mailed last month
A variable: a characteristic about the items that we want to study (e.g., student
name, Gender, DOB).
Ex: an employee’s income or an invoice amount.
Data set: all the values of all of the variables for all of the observations we chose.
Data usually are entered into a spreadsheet or database as an n X m matrix
Categorical and Numerical Data
A data set may contain a mixture of data types. Two broad categories:
Categorical (qualitative) data: values that are described by words rather than
numbers - nonnumerical values - Verbal label. Values of the categorical variable
might be represented using numbers - Coded
Numerical (quantitative) data: arise from counting, measuring something, or some
kind of mathematical operation. Two types: Discrete (integers), Continuous
(physical measurements, financial variables)
Time Series Data and Cross-Sectional Data
Time series Data: observation in the sample represents a different equally spaced
point in time (years, months, days). The periodicity is the time between observations.
trends and patterns over time
Ex: a firm’s sales, market share, debt/ equity ratio, employee absenteeism, inventory
turnover, and product quality ratings
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37

Partial preview of the text

Download Data Terminology, Types, and Sampling Methods: A Comprehensive Guide and more Exercises Business Statistics in PDF only on Docsity!

Chapter 2

2.1 VARIABLES AND DATA

Data Terminology

An observation: a single member of a collection of items that we want to study, such as a person, firm, or region. Ex: an employee, or an invoice mailed last monthA variable: a characteristic about the items that we want to study (e.g., student name, Gender, DOB). Ex: an employee’s income or an invoice amount.Data set: all the values of all of the variables for all of the observations we chose. Data usually are entered into a spreadsheet or database as an n X m matrix

Categorical and Numerical Data

A data set may contain a mixture of data types. Two broad categories:

  • Categorical (qualitative) data : values that are described by words rather than numbers - nonnumerical values - Verbal label. Values of the categorical variable might be represented using numbers - Coded
  • Numerical (quantitative) data: arise from counting, measuring something, or some kind of mathematical operation. Two types: Discrete (integers) , Continuous (physical measurements, financial variables)

Time Series Data and Cross-Sectional Data

  • Time series Data: observation in the sample represents a different equally spaced point in time (years, months, days). The periodicity is the time between observations. → trends and patterns over time Ex: a firm’s sales, market share, debt/ equity ratio, employee absenteeism, inventory turnover, and product quality ratings
  • Cross-sectional Data: observation represents a different individual unit (e.g., a person, firm, geographic area) at the same point in time.variation among observations and relationships Ex: daily closing prices of a group of 20 stocks recorded on December 1, 2015. Combine the two data types to get pooled cross-sectional and time series data. Ex: monthly unemployment rates for the 13 Canadian provinces or territories for the last 6 0 months 2.2 LEVEL OF MEASUREMENT Four levels of measurement for data: nominal, ordinal, interval, and ratio.

Nominal Measurement

Nominal data: the weakest level of measurement and the easiest to recognize, identify a category. “Nominal” data are the same as “qualitative”, “categorical” or “classification” data. The only permissible mathematical operations are counting (e.g., frequencies). ➔ No ordering

Ordinal Measurement

Ordinal data codes connote – imply - a ranking of data values. There is no clear meaning to the distance between. Like nominal data, ordinary data lack the properties that are required to compute many statistics, such as the average. Ordinal data can be treated as nominal, but not vice versa. ➔ Ordering, but differences have no meaning.

Rule of Thumb: A population may be treated as infinite when the population size N is at least 20 times the sample size n (i.e., when N/n ≥ 20)

Parameters and Statistics

  • A parameter is a specific characteristic of a population
  • A statistic is a specific characteristic of a sample From a sample of n items, chosen from a population, we compute statistics that can be used as estimates of parameters found in the population. Population mean = μ Population proportion = π Sample mean = 𝐱̅ Sample proportion = p

Target Population

  • The target population contains all the individuals in which we are interested
  • The sampling frame is the group from which we take the sample

2.4 SAMPLING METHODS Two main categories: random sampling and non-random sampling

Random Sampling Methods

Simple Random Sample

We denote the population size by N and the sample size by n. In a simple random sample, every item in the population of N items has the same chance of being chosen in the sample of n items. Ex: select one student at random from a list of 15 students Sampling without replacement : once an item has been selected to be included in the sample, it cannot be considered for the sample again. Problem when our sample size n is

close to our population size N → bias/ tendency to overestimate/ underestimate

A finite population is effectively infinite if the sample is less than 5 percent of the population (if n/N < .05) Sampling with replacement: the same random number could show up more than once. Duplicates are unlikely when n is much smaller than N

Systematic Sample

Systematic sample: choose every k th item from a sequence or list, starting from a randomly chosen entry among the first k items on the list.

  • Decide on sample size: n
  • Divide frame of N individuals into n groups of k individuals: k=N/n
  • Randomly select one xth^ individual from the first group
  • Select every xth^ individual in other groups thereafter

2.6 SURVEYS

SURVEY

  • Step 1: State the goals of the research
  • Step 2: Develop the budget (time, money, staff)
  • Step 3: Create a research design (target population, frame, sample size).
  • Step 4: Choose a survey type and method.
  • Step 5: Design a data collection instrument (questionnaire).
  • Step 6: Pretest the survey instrument and revise as needed.
  • Step 7: Conduct the survey.
  • Step 8: Code the data and analyze the data

Questionnaire Design

  • Begin with short, clear instructions.
  • State the survey purpose.
  • Assure anonymity.
  • Instruct on how to submit the completed survey.
  • Break survey into naturally occurring sections
  • Let respondents bypass sections that are not applicable (e.g., “if you answered no to question 7, skip directly to Question 15”).

Chapter 3

Graphical Presentation of Data

  • Data in raw form are usually not easy to use for decision making
  • Some type of organization like graph or table is needed
  • The type of graph to use depends on the variable being summarized

Tables and Charts for Categorical Data

Summary table

Column/Bar and Pie Chart

  • Column/Bar charts and Pie charts are often used for qualitative/categorical data (categories or nominal scale)
  • Pies or Bars/Columns represent categories
  • Height of bar or size of pie slice shows the frequency or percentage for each category

Column and Bar chart

  • A column chart is a vertical display of data
  • A bar chart is a horizontal display of data A Column chart display easier to read, but a Bar chart can be useful when the axis labels are long or when there are many categories.

For two-digit or three-digit integer data, the stem is the tens digit of the data, and the leaf is the ones digit ❖ Completed stem-and-leaf diagram: Data in ordered array: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Using the 100’s digit as the stem: ❖ Completed stem-and-leaf display:

Dot Plots

A dot plot is another simple graphical display of n individual values of numerical data, The basic steps in making a dot plot are to

  1. Make a scale that covers the data range
  2. Mark axis demarcations and label them
  3. Plot each data value as a dot above the scale at its approximate location If more than one data value lies at approximately the same X-axis location , the dots are piled up vertically
  • Easy to understand
  • Show variability
  • Show the center and where the midpoint lies
  • Reveal some things about the shape of the distribution
  • Not good for large samples (e.g., > 5,000).

Dot plots have limitations.

  • Don’t reveal very much information about the data set’s shape when the sample is small
  • Become awkward when the sample is large (what if you have 100 dots at the same point?)
  • When have decimal data. Tabulating Numerical Data

Frequency and Cumulative Distributions

  • A table
  • Grouping n data values into k classes called bins (based on values of the data)
  • The bin limits are cutoff points that define each bin.
  • Bins have equal interval widths and their limits cannot overlapThe basic steps for constructing a frequency distribution
  1. Sort the data in ascending order ➔ Find Smallest and Largest Data Values
  2. Choose the number of bins ➔ Sturges’ Rule: k = 1 + 3.3.log (n)
  3. Set the bin limit:

𝐱𝐦𝐚𝐱−𝐱𝐦𝐢𝐧 𝐤

  1. Put the data values in the appropriate bin ➔ Count the Data Values in Each Bin
  2. Create the table. ➔ Show only the absolute frequencies or include the relative frequencies and the cumulative frequencies

Multivariate Categorical Data

Contingency table

Side-by-side bar charts

Scatter Plots A scatter plot shows n pairs of observations (x1, y1), (x2, y2),.. ., (xn, yn) as dots (or some other symbol) on an X-Y graph

  • Investigate the relationship between two variables → association between two variables
  • Convey patterns in data pairs that would not be apparent from a table.

Time Series Plot

  • Used to study patterns in the values of a variable over time.
  • One variable is measured on the X axis
  • The time period is measured on the Y axis.
  • Can display several variables at once Log Scales Useful for time series data : be expected to grow at a compound annual percentage rate (e.g., GDP, the national debt, or your future income). Reveal whether the quantity is growing at an
  • increasing percent (concave upward),
  • constant percent (straight line)
  • declining percent (concave downward). Deceptive Graphs Error 1: Nonzero Origin: A nonzero origin will exaggerate the trend

Error 8: Complex Graphs: Complicated visual displays make the reader work harder. Error 11: Area Trick : Simultaneously enlarging the width of the bars as their height increases → bar area misstates the true proportion

CHAPTER 4

4.1 NUMERICAL DESCRIPTION

Descriptive measures derived from:

  • a sample (n items): statistics
  • a population (N items or infinite): parameters Three key characteristics: center, variability, and shape.

4.2 MEASURES OF CENTER

  • Used to measure the rate of change of a variable over time 𝐗𝐆 = ( 𝐱𝟏 𝐱𝟐 𝐱𝟑 … 𝐱𝐧) 𝟏 𝐧
  • Geometric mean of rate of return o Measures the status of an investment over time 𝐑𝐆 = [(𝟏 + 𝐑𝟏)(𝟏 + 𝐑𝟐)(𝟏 + 𝐑𝟑) … (𝟏 + 𝐑𝐧)] 𝟏 𝐧 (^) − 𝟏 Where Ri is the rate of return in time period i SUMMARY TABLE Quartiles The quartiles ( denoted Q1, Q2, Q3 ): scale points that divide the ordered data into four groups of approximately equal size : the 25th, 50th, and 75th percentiles
  • Q 1 is the value for which 25% of the observations are smaller and 75% are larger
  • Q 2 is the same as the median (50% are smaller, 50% are larger)
  • Only 25% of the observations are greater than Q 3 The first quartile Q1 is the median of the data values below Q2 , and the third quartile Q3 is the median of the data values above Q

Find a quartile by determining the value in the appropriate position (𝒙𝒏) in the ordered

data

First quartile position: Position of Q 1 = (N+1)/

Second quartile position: Position of Q 2 = (N+1)/2 (the median position)

Third quartile position: Position of Q 3 = 3(N+1)/

where N is the number of observed values The value of Q1, Q2, Q3 is the value between 𝒙𝒏−𝟏 and 𝒙𝒏+𝟏 : =

Box and Whisker Plot

Box-and-Whisker Plot: A Graphical display of data using 5-number summary 𝑥𝑚𝑖𝑛 , Q1, Q2, Q3, 𝑥𝑚𝑎𝑥 A box plot shows

  • center (position of the median Q2)
  • variability ( width of the “box” defined by Q1 and Q3 and the range between 𝑥𝑚𝑖𝑛 and 𝑥𝑚𝑎𝑥)
  • shape ( skewness if the whiskers are of unequal length and/or if the median is not in the center of the box)