Descriptive Statistics: Measures of Location and Variability, Slides of Probability and Statistics

A chapter from a statistics textbook focusing on descriptive statistics. It covers the concepts of populations, samples, and processes, pictorial and tabular methods, measures of location (mean and median), and measures of variability (range and variance). The document also includes examples and exercises.

Typology: Slides

2023/2024

Uploaded on 01/25/2024

john-dow-2
john-dow-2 🇺🇸

1 document

1 / 57

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 1: Overview and Descriptive Statistics
Instructor: Shengwen Guo
Department of Mathematics and Statistics, UNC Charlotte
January 11, 2024
1 / 57
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39

Partial preview of the text

Download Descriptive Statistics: Measures of Location and Variability and more Slides Probability and Statistics in PDF only on Docsity!

Chapter 1: Overview and Descriptive Statistics

Instructor: Shengwen Guo

Department of Mathematics and Statistics, UNC Charlotte

January 11, 2024

Contents

(^1) Populations, Samples, and processes

(^2) Pictorial and Tabular Methods in Descriptive Statistics

(^3) Measures of Location

4 Measures of Variability

Sample: Constraints on time, money, and other scarce resources usually make a census impractical or infeasible. Instead, a subset of the population—a sample—is selected in some prescribed manner.

Example

approximately 5.3 million passenger cars were sold in the U.S. in 2018. We randomly select some cars (say 200), and find the average price of these 200 cars.

Characteristics: We are usually interested only in certain characteristics of the objects in a population:

salary of a software engineer. gender of an engineer graduate price of a car

A characteristic may be categorical e.g. gender, race, hair color. A characteristic may be numerical e.g. age = 23, height = 6 feet, weight = 160 lb.

Variable: A variable is any characteristic whose value may change from one object to another in the population. We shall initially denote variables by lowercase letters from the end of our alphabet. Examples include

x = brand of calculator owned by a student y = number of visits to a particular Website during a specified period z = braking distance of an automobile under specified conditions

Examples

  • (^) We might determine the type of transmission, automatic (A) or manual (M), on each of ten automobiles recently purchased at a certain dealership, resulting in the categorical data set:

M A A A M A A M A A

  • (^) Our data set might consist of a (height, weight) pair for each basketball player on a team, with the first observation as ( 72 , 168 ), the second as ( 75 , 212 ), and so on.
  • (^) A research physician might determine the systolic blood pressure, diastolic blood pressure, and serum cholesterol level for each patient participating in a study.

Branches of Statistics

Two branches: descriptive statistics and inferential statistics. Descriptive statistics: An investigator who has collected data may wish simply to summarize and describe important features of the data. Some of these methods are graphical in nature; histograms, boxplots, and scatter plots. Other descriptive methods involve calculation of numerical summary measures; means, standard deviations, and correlation coefficients.

Pictorial and Tabular Methods in Descriptive Statistics

Descriptive statistics can be divided into two general subject areas. Visual displays: stem-and-leaf display, histogram, dotplot. Numerical summary measures: mean, median, standard deviation.

Notations

  • (^) The number of observations in a single sample, that is, the sample size, will often be denoted by n.
  • (^) If two samples are simultaneously under consideration, either m and n or n 1 and n 2 can be used to denote the numbers of observations.
  • (^) Given a data set consisting of n observations on some variable x, the individual observations will be denoted by x 1 , x 2 , · · · , xn. The subscript bears no relation to the magnitude of a particular observation.

Stem-and-Leaf Displays

Consider a numerical data set x 1 , x 2 , x 3 , · · · , xn. A quick way to obtain an informative visual representation of the data set is to construct a stem-and-leaf display.

Interpretation

The first observation in the top row of the display is 5.0, corresponding to a stem of 5 and leaf of 0, and the last observation at the bottom of the display is 10.6. Note that in the absence of a context, without the identification of stem and leaf digits in the display, we wouldn’t know whether the observation with stem 7 and leaf 9 was .79, 7.9, or 79. The leaves in each row are ordered from smallest to largest; this is commonly done by software packages but is not necessary if a display is created by hand. The display suggests that a typical or representative sleep time is in the stem 8L row, perhaps 8.1 or 8.2.

Dotplots

  • (^) A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values.
  • (^) Each observation is represented by a dot above the corresponding location on a horizontal measurement scale.
  • (^) When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically. As with a stem-and-leaf display, a dotplot gives information about location, spread, extremes, and gaps.

Figure 1.6 shows a dotplot of the data. There is clearly a great deal of state-to-state variability.

The largest value, for D.C., is obviously an extreme outlier, and four other values on the upper end of the data are candidates for mild outliers (MA, MN, NY, and ND). There is also a cluster of states at the low end, primarily located in the South and Southwest.

Histograms

Definition (Discrete and Continuous Variable)

A numerical variable is discrete if its set of possible values either is finite or else can be listed in an infinite sequence (one in which there is a first number, a second number, and so on). A numerical variable is continuous if its possible values consist of an entire interval on the number line.

A discrete variable x almost always results from counting, in which case possible values are 0, 1, 2, 3,... or some subset of these integers. Continuous variables arise from making measurements. For example, if x is the pH of a chemical substance, then in theory x could be any number between 0 and 14: 7.0, 7.03, 7.032, and so on.

The relative frequency of a value is the fraction or proportion of times the value occurs:

relative frequency of a value = number of times the value occurs number of observations in the data set

Example

Suppose that our data set consists of 200 observations on x = the number of courses a college student is taking this term. If 70 of these x values are 3, then relative frequency of the x value 3 is 20070 = .35.

Multiplying a relative frequency by 100 gives a percentage; in the college-course example, 35% of the students in the sample are taking three courses.

A frequency distribution is a tabulation of the frequencies and/or relative frequencies.

Constructing a Histogram for Discrete Data

(^1) determine the frequency and relative frequency of each x value. (^2) mark possible x values on a horizontal scale. (^3) above each value, draw a rectangle whose height is the relative frequency of that value.

This construction ensures that the area of each rectangle is proportional to the relative frequency of the value. Thus if the relative frequencies of x = 1 and x = 5 are .35 and .07, respectively, then the area of the rectangle above 1 is five times the area of the rectangle above 5.