Download Descriptive and Inferential Statistical Methods | STAT 515 and more Lab Reports Data Analysis & Statistical Methods in PDF only on Docsity! STAT 515 --- STATISTICAL METHODS Statistics: The science of using data to make decisions and draw conclusions Two branches: Descriptive Statistics: The collection and presentation (through graphical and numerical methods) of data Tries to look for patterns in data, summarize information Inferential Statistics: Drawing conclusions about a large set of individuals based on data gathered on a smaller set Example: Some Definitions Population: The complete collection of units (individuals or objects) of interest in a study Variable: A characteristic of an individual that we can measure or observe Examples: Sample: A (smaller) subset of individuals chosen from the population In statistical inference, we use sample data (the values of the variables for the sampled individuals) to make some conclusion (e.g., an estimate or prediction) about the population. Example: How reliable is this generalization to the population? For inference to be useful, we need some genuine measure of its reliability. Types of Data: Quantitative (Numerical) Data: Measurements recorded on a natural numerical scale (can perform mathematical operations on data). Qualitative (Categorical) Data: Measurements classified into one of several categories. Two ways: Graphs and Plots Numerical Statistics Describing Qualitative Data Data are categorized into classes The number of observations (data values) in a class is the class frequency Class Relative Frequency = class frequency / n The CRF’s of all classes add up to 1 Example: Graphical Displays: Bar graph: Height of bars indicates frequencies for each category (see p. 32) Pie chart: Area of “pie slices” indicates relative frequency for each class (see p. 32) Describing Quantitative Data To detect and summarize patterns in a set of numerical data, we use: Dot Plots: these represent each data value with a dot along a numerical scale. When data values repeat, the dots pile up vertically at that value. Stem-and-leaf Display (good for small data sets): Separate each number in a data set into a stem and a leaf (usually the last digit). There is a column of all the stems in the data set, and at each stem, the corresponding leaf digits line up to the right. Example: See p. 42 for another example. Histogram: Numerical data values are grouped into measurement classes, defined by equal-length intervals along the numerical scale. Like a bar graph, a histogram is a plot with the measurement classes on the horizontal axis and the class frequencies (or relative frequencies) on the vertical axis. For each measurement class, the height of the bar gives the frequency (or RF) of that class in the data. Example: Guidelines for Selecting Measurement Class Intervals: Use intervals of equal width Each data value must belong to exactly one class Commonly, between 5 and 12 classes are used Example: Mark McGwire’s Home Run totals (1987- 1998). Ordered Data: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70. Two years with 9 home runs are outliers (unusual values) due to injury and a players’ strike. What if we delete these years? Which measure was more affected by the outliers? Shapes of Distributions When the pattern of data to the left of the center value looks the same as the pattern to the right of the center, we say the data have a symmetric distribution. Picture: If the distribution (pattern) of data is imbalanced to one side, we say the distribution is skewed. Skewed to the Right (long right “tail”). Picture: Skewed to the Left (long left “tail”). Picture: Comparing the mean and the median can indicate the skewness of a data set. Other measures of central tendency Mode: Value that occurs most frequently in a data set. In a histogram, the modal class is the class with the most observations in it. A bimodal distribution has two separated peaks: The most appropriate measure of central tendency depends on the data set: Skewed? Symmetric? Categorical? Numerical Measures of Variability Knowing the center of a data set is only part of the information about a variable. Also want to know how “spread out” the data are. Example: “Mound-shaped” distributions: (roughly symmetric, peak in middle) Special rule that applies to data having a mound-shaped distribution: Empirical Rule: For data having a mound-shaped distribution, About 68% of the data fall within 1 standard deviation of the mean (between X - s and X + s for samples, or between μ – σ and μ + σ for populations) About 95% of the data fall within 2 standard deviations of the mean (between X - 2s and X + 2s for samples, or between μ – 2σ and μ + 2σ for populations) About 99.7% of the data fall within 3 standard deviations of the mean (between X - 3s and X + 3s for samples, or between μ – 3σ and μ + 3σ for populations) Picture: Example: Suppose IQ scores have mean 100 and standard deviation 15, and their distribution is mound- shaped. Example: The rainfall data have a mean of 34.9 inches and a standard deviation of 13.7 inches. What if the data may not have a mound-shaped distribution? Chebyshev’s Rule: For any type of data, the proportion of data which are within k standard deviations of the mean is at least: In the general case, at least what proportion of the data lie within 2 standard deviations of the mean? What proportion would this be if the data were known to have a mound-shaped distribution? Rainfall example revisited: Numerical Measures of Relative Standing These tell us how a value compares relative to the rest of the population or sample. Percentiles are numbers that divide the ordered data into 100 equal parts. The p-th percentile is a number such that at most p% of the data are less than that number and at most (100 – p)% of the data are greater than that number. Well-known Percentiles: Median is the 50th percentile. Lower Quartile (QL) is the 25th percentile: At most 25% of the data are less than QL; at most 75% of the data are greater than QL. Upper Quartile (QU) is the 75th percentile: At most 75% of the data are less than QU; at most 25% of the data are greater than QU. The “box” extends from the lower quartile QL to the upper quartile QU. The length of this box is called the Interquartile Range (IQR) of the data. IQR = QU – QL The “whiskers” extend to the smallest and largest data values, except for outliers. Defining an outlier: If a data value is less than QL – 1.5(IQR) or greater than QU + 1.5(IQR), then it is considered an outlier and given a separate mark on the boxplot. We generally use software to create boxplots. Interpreting boxplots A long “box” indicates large variability in the data set. If one of the whiskers is long, it indicates skewness in that direction. A “balanced” boxplot indicates a symmetric distribution. Outliers should be rechecked to determine their cause. Do not automatically delete outliers from the analysis --- they may indicate something important about the population. Assessing the Shape of a Distribution -- A normal distribution is a special type of symmetric distribution characterized by its “bell” shape. Picture: How do we determine if a data set might have a normal distribution? Check the histogram: Is it bell-shaped? More precise: Normal Q-Q plot (a.k.a. Normal probability plot). (see p. 261-263) Plots the ordered data against the z-scores we would expect to get if the population were really normal. If the Q-Q plot resembles a straight line, it’s reasonable to assume the data come from a normal distribution. If the Q-Q plot is nonlinear, data are probably not normal.