




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Summary notes for Statistics and a reference guide.
Typology: Schemes and Mind Maps
1 / 108
This page cannot be seen from the preview
Don't miss anything!





























































































Statistics is a branch of mathematics that involves collecting, analysing, interpreting, and presenting data. It provides tools and methods to understand and make sense of large amounts of data and to draw conclusions and make decisions based on the data.
Descriptive Inferential Descriptive Statistics :Descriptive statistics deals with the collection, organization, analysis, interpretation, and presentation of data. It focuses on summarizing and describing the main features of a set of data, without making inferences or predictions about the larger population Inferential statistics : Inferential statistics deals with making conclusions and predictions about a population based on a sample. It involves the use of probability theory to estimate the likelihood of certain events occurring, hypothesis testing to determine if a certain claim about a population is supported by the data, and regression analysis to examine the relationships between variables
Population :Population refers to the entire group of individuals or objects that we are interested in studying. It is the complete set of observations that we want to make inferences about. For example, the population might be all the students in a particular school or all the cars in a particular city. Sample : A sample, on the other hand, is a subset of the population. It is a smaller group of individuals or objects that we select from the population to study. Samples are used to estimate characteristics of the population, such as the mean or the proportion with a certain attribute. For example, we might randomly select 100 students.
Measure of Central Tendency A measure of central tendency is a statistical measure that represents a typical or central value for a dataset. It provides a summary of the data by identifying a single value that is most representative of the dataset as a whole. Mean : The mean is the sum of all values in the dataset divided by the number of values
Measure of Dispersion : A measure of dispersion is a statistical measure that describes the spread or variability of a dataset. It provides information about how the data is distributed around the central tendency (mean, median or mode) of the dataset. Range : The range is the difference between the maximum and minimum values in the dataset. It is a simple measure of dispersion that is easy to calculate but can be affected by outliers. Variance :The variance is the average of the squared differences between each data point and the mean. It measures the average distance of each data point from the mean and is useful in comparing the dispersion of datasets with different means.
Standard Deviation :The standard deviation is the square root of the variance. It is a widely used measure of dispersion that is useful in describing the shape of a distribution Coefficient of Variation (CV) : The CV is the ratio of the standard deviation to the mean expressed as a percentage. It is used to compare the variability of datasets with different means and is commonly used in fields such as biology, chemistry, and engineering. The coefficient of variation (CV) is a statistical measure that expresses the amount of variability in a dataset relative to the mean. It is a dimensionless quantity that is expressed as a percentage. The formula for calculating the coefficient of variation is: CV = (standard deviation / mean) x 100 %
Quantiles are statistical measures used to divide a set of numerical data into equal-sized groups, with each group containing an equal number of observations. Quantiles are important measures of variability and can be used to: understand distribution of data, summarize and compare different datasets. They can also be used to identify outliers. There are several types of quantiles used in statistical analysis, including: Quartiles : Divide the data into four equal parts, Q 1 ( 25 th percentile), Q 2 ( 50 th percentile or median), and Q 3 ( 75 th percentile). Deciles : Divide the data into ten equal parts, D 1 ( 10 th percentile), D 2 ( 20 th percentile), ..., D 9 ( 90 th percentile).
interquartile range :The interquartile range (IQR) is a measure of variability that is based on the five-number summary of a dataset. Specifically, the IQR is defined as the difference between the third quartile (Q 3 ) and the first quartile (Q 1 ) of a dataset.
A box plot, also known as a box-and-whisker plot, is a graphical representation of a dataset that shows the distribution of the data. The box plot displays a summary of the data, including the minimum and maximum values, the first quartile (Q1), the median (Q2), and the third quartile (Q3).
Covariance measures the direction of the relationship between two variables. A positive covariance means that both variables tend to be high or low at the same time. A negative covariance means that when one variable is high, the other tends to be low
In [1]: In [ 2 ]: In [3]: In [ 4 ]: import pandas as pd import random import seaborn as sns import pandas as pd import matplotlib.pyplot as plt import numpy as np df = pd.DataFrame() from re import X x = pd.Series([ 12 , 25 , 68 , 42 , 113 ]) y = pd.Series([ 11 , 29 , 58 , 121 , 100 ]) df['x'] = x df['y'] = y
Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It's a common tool for describing simple relationships without making a statement about cause and effect.
In [8]: In [ 9 ]:
A probability distribution is an idealized frequency distribution. A frequency distribution describes a specific sample or dataset. It's the number of times each possible value of a variable occurs in the dataset. The number of times a value occurs in a sample is determined by its probability of occurrence. Out[ 8 ]: (^) Text(0.5, 1.0, 'Correlation - 0.6185423626205997') Out[ 9 ]: (^) Text(0.5, 1.0, 'Correlation - 0.6185423626205997') fig, (ax 1 , ax 2 ) = plt.subplots( 1 , 2 , figsize = ( 10 , 3 )) # Plot scatterplots on each axes ax 1 .scatter(df['x'], df['x']) ax 2 .scatter(df['x'], df['y'],color = 'orange') ax 1 .set_title("Correlation - " + str(df['x'].corr(df['x']))) ax 2 .set_title("Correlation - " + str((df['x']).corr(df['y']))) fig, (ax1, ax2) = plt.subplots( 1 , 2 , figsize = ( 10 , 3 )) # Plot scatterplots on each axes ax1.scatter(df['x'], df['y'] , color = 'red') ax2.scatter(df['x'] ***** 2 , df['y']) ax1.set_title("Correlation - " + str(df['x'].corr(df['y']))) ax2.set_title("Correlation - " + str((df['x'] ***** 2 ).corr(df['y'] ***** 2 )))
PMF stands for Probability Mass Function. It is a mathematical function that describes the probability distribution of categorical a discrete random variable. examples of discrete data Nominal (e.g., gender, ethnic background, religious or political affiliation) Ordinal (e.g., extent of agreement, school letter grades) Quantitative variables with relatively few values (e.g., number of times married) The PMF of a discrete random variable assigns a probability to each possible value of the random variable. The probabilities assigned by the PMF must satisfy two conditions: a.The probability assigned to each value must be non-negative (i.e., greater than or equal to zero). b. The sum of the probabilities assigned to all possible values must equal 1. In [ 10 ]: In [ 11 ]: In [ 12 ]: In [13]: Out[ 12 ]: (^10000) Out[ 13 ]: (^) [2, 4, 5, 1, 6] import pandas as pd import random l = [] for i in range( 10000 ): l.append(random.randint( 1 , 6 )) # Here we take values from 1 - 6 len(l) l[: 5 ]
In [14]: In [15]: In [16]: In [17]: In [18]: Out[ 14 ]: (^2 ) 5 1682 3 1677 4 1670 6 1657 1 1586 dtype: int Out[15]: (^2) 0. 5 0. 3 0. 4 0. 6 0. 1 0. dtype: float Out[17]: (^) pd.Series(l).value_counts() pd.Series(l).value_counts() / pd.Series(l).value_counts().sum() s = pd.Series(l).value_counts() / pd.Series(l).value_counts().sum() s.plot(kind = 'bar') # Now rollig two dices and Add the both a and b l = [] for i in range( 10000 ): a = (random.randint( 1 , 6 )) b = (random.randint( 1 , 6 )) l.append(a + b)
In [24]: In [25]:
The cumulative distribution function (CDF) F(x) describes the probability that a random variable X with a given probability distribution will be found at a value less than or equal to x F(x) = P(X <= x) In [26]: In [27]: Out[25]: (^) Out[27]: (^2) 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 0. 10 0. 11 0. 12 1. dtype: float s = (pd.Series(l).value_counts() / pd.Series(l).value_counts().sum()).sort_index s.plot(kind = 'bar') import numpy as np np.cumsum(s) # Total = 1
In [28]: What is the difference between PMF AND CDF PMF : Gives the probability of the particular point (X) CDF : Gives the probability of the All the points up to (x)
PDF stands for Probability Density Function. It is a mathematical function that describes the probability distribution of a continuous random variable. Out[28]: np.cumsum(s).plot(kind = 'bar')
In [30]: In [31]: In [32]: In [33]: Out[30]: (^) array([61.23026396, 48.88218776, 53.37395358, 48.23859345, 38.82775714, 50.71547702, 43.55170549, 59.59037734, 44.07646131, 51.2291823 , 49.50129918, 41.87482911, 42.89516605, 49.83430018, 50.42984119, 48.02929317, 43.66796728, 47.26094984, 48.14397298, 45.30814864, 51.62220736, 44.8594848 , 41.84895438, 47.33320696, 53.85271785, 41.98071327, 51.7260448 , 52.96282035, 50.05249076, 50.19805956, 49.07434682, 53.07117544, 53.86269107, 56.04279745, 40.90454341, 51.77673487, 43.2966207 , 51.75137924, 40.90094009, 54.19609498, 55.26453117, 55.4609019 , 54.4989116 , 54.82385172, 43.10395288, 41.8218877 , 46.58957669, 51.41111365, 52.52098661, 49.19831397, 48.99275682, 62.61191008, 43.65576003, 55.21987601, 44.79056897, 51.78870203, 46.37564734, 51.40565093, 45.28410534, 45.75391457, 49.09427446, 53.86831982, 52.71599378, 44.36834434, 50.74665809, 53.72739067, 49.49842696, 47.32866333, 43.56592226, 46.80359987, 49.25437035, 56.58358384, 60.41122227, 48.2555872 , 52.34446159, 58.39865208, 49.96656899, 60.76236236, 47.51901699, 44.65567849, 49.36043701, 43.655939 , 56.18308073, 47.10868464, 47.8524103 , 44.90322859, 48.05274478, 46.16966608, 49.40622809, 39.38872195, 53.57422306, 50.30220825, 50.92156859, 45.35959403, 44.0693628 , Out[31]: (^) 50. Out[32]: (^) 4. Out[33]: (^) (array([ 3., 28., 92., 180., 282., 205., 139., 52., 15., 4.]), array([35.20388838, 38.31826811, 41.43264784, 44.54702757, 47.6614073 , 50.77578703, 53.89016676, 57.0045465 , 60.11892623, 63.23330596, 66.34768569]), ) sample sample.mean() # average sample.std() _# scale -- standard deviation
plt.hist(sample,bins = 10 )
In [34]: In [35]: In [36]: In [37]: In [38]: In [39]: In [40]: Out[37]: (^) 66. Out[38]: (^) 35. Out[40]: (^) [] # calculate sample mean and sample std dev sample_mean = sample.mean() sample_std = sample.std() # fit the distribution with the above parameters from scipy.stats import norm dist = norm(sample_mean, sample_std) values = np.linspace(sample.min(),sample.max(), 100 ) sample.max() sample.min() probability_density = [dist.pdf(value) for value in values] # plot the histogram and pdf plt.hist(sample,bins = 10 ,density =True ) plt.plot(values,probability_density)