Download Summary of Descriptive Statistics and more Summaries Descriptive statistics in PDF only on Docsity! Computer Science Topic 1. Descriptive Statistics 1)The Mean: Overview of Mean โ The mean is a measure of central tendency in a population or sample. โ It is denoted by the Greek letter mu and is calculated by summing all the observations and dividing by their total number. โ For samples, the lowercase n is used instead of an uppercase N. Mean in Action โ To find the mean of a sample, add all the observations and divide by the total number of observations. โ For example, the mean of the sample 10, 28, 33 and 54 is 30.6. โ As a convention, the mean is provided to one more decimal place than the original data set was given in. Advanced Mean โ To calculate a weighted mean, multiply each observation by its frequency and divide by the total number of observations. โ This can also be expressed as summing the products of the observations and their frequencies, divided by the sum of the frequencies. โ The mean can also be calculated for categorical variables. โ Finally, a challenge question can be posed to test understanding of the mean. Formula for calculating mean โ A formula for calculating the mean looks ugly but should be quite intuitive. โ It involves multiplying all the values of X by their frequencies and summing them up, then dividing by the sum of all frequencies. Calculating mean of categorical data set โ It is possible to find the mean of a categorical data set by defining all the females as ones and males as zeros, making it a numerical data set. โ To calculate the mean, add up all of those values and divide by the total number of values. โ The mean of a binary variable gives you the proportion of the category that we defined as 1. Georgia's challenge โ Georgia's challenge is to calculate her weighted average mark that she's got from her statistics degree thus far. โ She has done 4 subjects and has marks and credit points for each of them. โ To calculate her weighted average mark, she must incorporate the credit point information. โ This is done by flipping the rates, so that they are no longer distance over time but time over distance, and then flipping it back once the average is found. Calculating the Harmonic Mean โ Using the formula, the harmonic mean of 2 and 3 is found to be 2.4 km/h, which is closer to 2 km/h than 3 km/h due to the fixed distance. Understanding the Harmonic Mean โ The harmonic mean is named after the idea of harmonic overtones in music. โ The wavelengths of the overtones form a harmonic series, and the mean of the wavelength 1 over 2 and 1 over 4 is 3. โ This is due to the inverting that happens between frequency and wavelength, which is why the harmonic average or mean is used. 3)The Median: What is median? โ Median is the middle number of a series when ordered, which is represented by the letter 'M'. โ In the example series of 10, 28, 33 and 54, the median is 28. How to calculate median? โ When the number of observations is odd, the median is the middle observation. โ When the number of observations is even, the median is the mean of the two middle observations. Mean vs Median โ Symmetric distributions (e.g. uniform distributions and bell curves) have the same mean and median. โ However, when the distribution is skewed, the mean and median will differ. In this case, the median is more reliable. Symmetric Distribution โ When the distribution is symmetric, the mean and median are roughly equal. โ If the distribution is perfectly symmetric, the mean will equal the median precisely. Non-Symmetric Distribution โ When the distribution is not symmetric, the mean will be slightly higher than the median. โ This happens because the mean is calculated by incorporating the values of all the observations, so an outlier observation will pull the mean up. Median vs. Mean โ The median is often a better measure of the central tendency when the data set is skewed, because the median is more robust and doesn't budge as far. โ Examples of scenarios where the median is preferred are house prices and other scenarios where there are extreme values on one side. 4)The Mode: Definition of the Mode โ The mode is the observation with the highest frequency, coming from the Latin word modus or mood which has a musical connotation. โ Mode is also used in French for fashion, meaning something is very fashionable. โ In small samples the mode can be fickle, and can change with the addition or removal of a single observation. โ In larger samples the mode is the most reliable measure of central tendency. Mode in Practice โ The mode in this data set is 28, as it has the highest frequency. โ It could change if an observation is changed from 28 to 27. โ In this case there would be five modes with a frequency of one. โ The mode is more reliable in larger samples, such as a census. โ In this example, the mode is two children per family as it has the highest frequency. Mean and Median โ The median of this distribution is two children per family, as it is the 2.5 millionth observation when the data is ordered from smallest to largest. โ The mean of this distribution is 2.26, which is found by multiplying each number of children by the number of families, summing them up and dividing by five. โ The mode, median and mean can all be different from each other in certain distributions. Relationship between measures of central tendency โ When dealing with a symmetric distribution (e.g. a bell curve), the mean, median and mode are all equal and exist in the center of the distribution. 6)Interquartiles Range(IQR) | Box and Whisker Plot: Definition of Range and IQR โ Range is a measure of the spread of a data set, and is the maximum value minus the minimum value. โ The interquartile range (IQR) is the third quartile minus the first quartile, or the number three quarters into the data set minus the number one quarter into the data set. Advantages of IQR โ The interquartile range is not affected by outliers, as it does not include the maximum and minimum values. โ This makes it a better measure of the spread of data than the range, which can be affected by outliers. Box and Whisker Plot โ The box and whisker plot is a visual way to inspect the range and IQR. โ The whiskers extend from the minimum to the maximum value, while the box in the middle is bounded by three horizontal lines. โ The top line is the third quartile, the bottom line is the first quartile, and the median (or second quartile) lies in the middle. Understanding Vertical Configuration of Box and Whisker Plots โ A box and whisker plot is a vertical configuration of data, with quartile 1 at the bottom of the box, the median in the middle strip and quartile 3 at the top part of the box. โ The whiskers indicate the maximum and minimum, but some statistical packages will exclude outliers from the maximum and minimum values. Calculating the Range and Interquartile Range โ Range is the difference between the maximum and minimum values, which in this case is 13.9. โ Interquartile range is the difference between the third and first quartiles, which in this case is 4.4. Knowing that 25% of Data Exists in Each Section โ 25% of the data exists in the top whisker, another 25% between the median and third quartile, another 25% between the first quartile and the median, and 25% of the data down at the bottom of the box. Summarizing โ A box and whisker plot gives a useful representation of the spread of data, and can be used to calculate the range and the interquartile range. 7)Variance and Standard Deviation: Why divide by n - 1?: Calculating the Mean โ The mean is the sum of all observations divided by 12. Calculating the Variance โ The variance is calculated by taking the individual deviations from the mean for each observation and squaring them. โ The result is then divided by n minus 1. Calculating the Standard Deviation โ The standard deviation is calculated by taking the square root of the variance. โ This gives a number that is more intelligible and in line with the scale of the data set. Why Bother with the Variance? โ The variance is used to describe the spread of a data set. โ It is difficult to assess whether it is high or low because it is squared, but it is more statistically advanced than measuring the range of the data set. Why Divide by n-1? โ Dividing by n-1 for the variance calculation and therefore also the standard deviation helps to incorporate all observations in the data set. โ This helps to give a more accurate measure of the spread of the data set, as it is less susceptible to outliers. Estimating Population Mean โ The sample mean lies in the middle of the data set, and the negative and positive deviations from the mean will equal out. โ Taking the average squared deviation from the mean is used instead. โ Squaring a deviation from the mean makes it positive, and all the squared deviations are added together and the average is taken. โ Taking the absolute value of the deviations from the mean is not done because it does not work with higher order moments. Why Divide by n-1? โ The variance calculation is the average squared deviation from the population mean. โ The population mean is unknown, so the sample mean is used to estimate it. โ The sample mean is making the calculation the smallest possible, and it does not matter where the population mean is. Subtracting one from the denominator โ Zedstatistics encourages viewers to subscribe to the channel if they like the videos, and if he sees them at the pub, he will buy them a non-alcoholic beer. Conclusion โ Zedstatistics wraps up the video and points viewers to his other videos on zedstatistics.com. โ He ends the video by saying he'll see them around and plays music. 9)What is Skewness? A detailed explanation: Overview of Skewness โ Zedstatistics is introducing the concept of skewness and will discuss how skewness can be viewed in a distribution. โ Pearson's calculation investigates differences in the mean, median, and mode, while moment based calculation is a more statistically rigorous approach. โ Visuals of different magnitudes of skewness in a distribution will be shown and the video will end with a challenge question. Identifying Skewness โ To identify skewness, it is important to look at the tail of the distribution to see which direction it is pointing. โ If it is pointing to the left, it is negative (or left) skew, and if it is pointing to the right, it is positive (or right) skew. Symmetrical Distribution โ A symmetrical distribution has no skew and the mean, median, and mode are equal and located at the center of the distribution. Positive Skew โ In a positively skewed distribution, the mode is the highest peak of the distribution and the median is dragged up due to more observations on the right side of the mode. โ The mean is pulled up further due to extreme values on the positive side. Negative Skew โ In a negatively skewed distribution, the mode is the lowest peak of the distribution and the median is dragged down due to more observations on the left side of the mode. โ The mean is pulled down further due to extreme values on the negative side. Calculating Skewness โ Karl Pearson proposed a measure for skewness which is calculated by taking the mean minus the mode and dividing by the standard deviation. โ A higher value indicates more skewness, and the further apart the mean, median, and mode are, the more skewed the dataset is. Introduction โ Skewness is the measure of how much the distribution of a dataset deviates from symmetrical. โ It is an indication of how far a value is from the mean. Calculating skewness โ Skewness can be calculated by subtracting the mean from the mode, and multiplying it by three. โ This calculation can be unreliable in smaller datasets, so a more robust calculation is to use the median instead of the mode. Moment based calculation โ Moments are a way of defining a dataset by summing up values, such as the sum of squared values (second moment), or the sum of cubed values (third moment). โ The second centralized moment is the sum of squared distances from the mean, and the third centralized moment is the sum of cubed distances from the mean. โ To get the population variance and population skew, this third moment is divided by sigma cubed. Adjusting for sample values โ When using the sample mean instead of the population mean for estimation, we need to account for the differing degrees of freedom. โ This adjustment is made by replacing the population mean and variance with the sample mean and variance, resulting in a cumbersome equation that takes into account the sample size. Calculating skewness โ Most statistical packages use the equation for skewness calculation. โ Microsoft Excel uses this formula to calculate skew in a dataset, by using the equals skew command in one of the cells. โ This equation is a sampling adjustment, accounting for the fact that the population mean and variance are replaced with the sample mean and variance. Visualizing skewness โ Skewness of 0 is approximately symmetric. โ Skewness of 0.5 to 1 is moderately skewed. โ Skewness of 1.5 or higher is highly skewed. โ Visualizing skewness shows the data set with a long tail to the right for positive skewness, and a long tail to the left for negative skewness. โ Westfall wrote a paper called "Kurtosis as Peakedness: 1905 to 2014 R.I.P" to explain why the peakedness doesn't matter, and that it's all about the tails. Exploring the effect of outliers โ The numerator of the kurtosis formula is sum of (x-ฮผ)^4, which means that any outliers from the mean greatly contribute to the summation. โ The denominator (ฯ^4) doesn't fully compensate for the increase in the numerator, as the power of 4 is so strong. โ Peter Westfall's idea is that the kurtosis is formed by the outliers, as observations close to the mean have a minimal effect on the calculation of kurtosis. No longer referring to peakedness โ Kurtosis is no longer related to the peakedness, but to the thickness of tails or presence of outliers and how far away they tend to be. โ While mathematics might not be straightforward, it's clear from the formula that the power of 4 is so strong on the top, but the denominator does not compensate for the increase in the numerator. Other topics โ The video series will look at other topics such as the moments and standard error of the sample mean, and boutique measures of central tendency and spread. 11)Standard Error (of the sample of mean) | Sampling | Confidence Intervals | Proportions: Definition of Standard Error โ Standard Error of the Sample Mean (SE) is an output from Microsoft Excelโs Descriptive Statistics package. โ It is not typically associated with Descriptive Statistics, but acts as a bridge between basic descriptive measures and advanced statistics. โ SE is calculated from two other measures: the standard deviation (s) and the sample size (n). The formula is SE = s/โn. How does Standard Error relate to confidence? โ Standard Error only relates to the sample mean. โ If a small sample is used to estimate the population mean, it is not possible to be confident about the estimate. โ As the sample size increases, the variation around the sample mean decreases, and the confidence in the estimate increases. โ The Standard Error is a measure of this uncertainty in the sample mean, and it decreases as n increases. An example โ Suppose the goal is to estimate the average IQ of statistics students. โ Five students take an IQ test, and the highest IQ is 127, the lowest is 94. The average IQ is 112. โ The uncertainty of this estimate is high, since the sample size is small. โ If 50 students take the IQ test, the average IQ is 115.3, and the uncertainty is lower. โ If 500 students take the IQ test, the average IQ is 114.7, and the confidence in the estimate is much higher. Standard Error Calculations โ The standard error of the sample mean is calculated by taking the standard deviation of the dataset and dividing by the square root of the number of observations. โ In this example, the standard deviation of the dataset is 12.72 and the number of observations is 5, resulting in a standard error of 5.69. Probability Distribution โ An interval is constructed where the true population mean is expected to lie, with the sample mean being the middle of the interval. โ The interval is based on a probability distribution, with the population mean being more likely to be closer to the sample mean than further away. โ This probability distribution is a t-distribution, which assumes the population is normally distributed. Confidence Interval โ To find the 95% confidence interval, the sample mean 01:52 is added and subtracted from the standard error of the sample mean (5.69), multiplied by the appropriate point on the t-distribution (0.975). โ This point is found using the Excel function TINV, with 4 degrees of freedom (n-1). Calculating Standard Error โ The most basic of software can now make these calculations, and using Excel, the standard error of the sample mean can be calculated as 2.78. โ Adding the product of this standard error to the sample mean will give the upper limit of the 95% confidence interval, and subtracting the product will give the lower limit. Higher order moments โ The crude third and fourth moments can be calculated by taking all the distances from 0 and to the power of 4 and finding the average of those. โ By subtracting the mean each time, the centered second moment can be used to get an indication of the spread of the dataset. โ The skewness is related to the third moment, having netted out the effect of the first and second moments, and the kurtosis is related to the fourth moment. Netting out the second and first moments โ We are getting four measures, which we can call the mean, variance, skewness and kurtosis. โ There are equations for kurtosis that adjust for skewness. โ The concept of kurtosis doesn't require a skewness, as it can be calculated for a symmetrical distribution. Calculating moments โ We need the population mean (mu) and the population standard deviation (Sigma) for these calculations. โ When taking a sample, we only have the sample mean and the sample variance. โ The degrees of freedom change when estimating mu, so the denominator is n minus 1. โ For the third moment, we need to estimate two things (mu and Sigma), so there is an n over minus 1 and minus 2 out the front. โ For the fourth moment, we need to adjust for the degrees of freedom and use n terms on the front and back. Summary โ The sample mean is given by the sum of X over n. โ The sample variance is provided by the sum of x minus bar squared over n minus 1. โ The sample skewness has a cubed involved with it, plus the sample adjustment. โ The kurtosis is the standardized fourth moment, adjusted for the fact that we only have a sample and do not have the population parameter values. 13)What is Covariance? What is Correlation? What is covariance and correlation? โ Covariance and correlation both describe the relationship between numerical variables; for example, temperature and ice cream sales. โ As temperature increases, ice cream sales may also increase, indicating a positive correlation between the two variables. โ Pneumonia presentations may decrease as temperature increases, indicating a negative correlation between the two variables. โ Stock market movements are likely to have little to no correlation with temperature. Calculating covariance and correlation from a sample โ Calculate the mean of each variable (x, y). โ Find the deviations from the respective means for each variable. โ Multiply the deviations together to calculate a numerical measure of the relationship between the variables. โ Positive numbers indicate a positive correlation, while negative numbers indicate a negative correlation. Calculating Covariance โ Sigma subscript XY is the sum of the final column divided by n minus 1, which is a mean for the final column. โ If the numbers in the final column are mostly negative, the covariance will be negative. โ Dividing by n minus one is to do with degrees of freedom, and is explained further in a video linked in the description. Comparing Covariance to Variance โ Covariance and variance formulas look almost the same, with the difference being that covariance is done with two variables instead of one. Calculating Correlation โ Correlation is given the symbol Rho XY and is the covariance divided by the product of the two standard deviations. โ A correlation of 1 means that the two variables are perfectly positively correlated, a correlation of -1 means they are perfectly negatively correlated, and a correlation of 0 means they are completely independent. โ A correlation of 0.8200 is quite high and suggests a strong positive relationship between the two variables. โ This can be calculated using Excel formulas. Introduction โ To use some cool Excel formula, X and Y have their respective outcomes, and there is a negative covariance or correlation between their values. Calculating the Mean โ To calculate the expected value of X, the formula sums the product of X and the probability, which weighs each of the values by their probability of occurrence. โ Similarly, the expected value of Y is found by summing the product of Y and the probability. Calculating the Deviations โ To find the deviations from the mean, each value of X or Y is compared to its expected value to determine how much it differs.