




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
C797 data science and analytics Study Guide (1).docx
Typology: Exams
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Column charts - ✔Use to compare data across categories( requires categorical or ordinal data, displays variables vertically) Line graphs - ✔Use to display continuous data over time Histogram - ✔Use to visually display normality ( normal distribution of continuous/quantitative /frequency of data around the mean) for a set of data points Pie charts - ✔Use when you have one row or column of data and you want to know how much one data point is in relation to the whole. Especially useful when displaying portions or percentages bar chart - ✔Similar to column charts but better to use when your labels are long ( categorical or ordinal data, there are spaces between bars and it is displayed horizontally) x-y scatter plot - ✔Shows the relationship among numeric values in several data series, or plot two groups of numbers as one series of XY coordinates ( requires continuous data) donut chart - ✔Show the relationship of parts to a whole like a pie chart but can contain more than one data series( can be continuous or categorical sections of a whole) Bubble charts - ✔Are continuous data that are arranged in columns on a worksheet so that the X values are listed in the first column and the corresponding Y values and bubble size values s are listed in adjacent columns. Geospatial maps - ✔Visually depict the prevalence and occurrence of a condition or disease geographically using polygons. The are 2 types of data used in mapping- vector data and raster data Raster Data - ✔A grid-based format for storing location-based data in a geographic information system in which each equally-sized cell or pixel contains a value that represents geographic data such as land. Vector Data - ✔A format for storing location-based data in a geographic
information system that uses latitude and longitude coordinates to represent geographic features with points, lines, and other complex shapes.
(? is larger than deviation) standard deviation - ✔the square root of the variance=?
positively skewed distribution (right-skewed) - ✔A distribution where the scores pile up on the left side and taper off to the right. (Mode<median<mean) negatively skewed distribution (left-skewed) - ✔A distribution in which most scores pile up at the right end of the scale. (Mean <median<mode) Leptokurtic distribution - ✔a frequency distribution that has a tendency toward peakedness (leaping curve) mesokurtic distribution - ✔normal distribution curve Platykurtic distribution - ✔Flatter and more spread out than a normal curve. (Memory: 'Plat' sounds like 'flat') sample statistic - ✔A measurable characteristic of a sample. ( same mean "Xbar ") population parameter - ✔A characteristic or measure of a population. ( often can't be calculated directly due to population size) ( population mean-mew) confidence interval - ✔statistical range, with a given probability, that takes random error into account Lower Quartile (Q1) - ✔The median of the lower half of a set of data. Upper Quartile (Q3) - ✔the median of the upper half of the data Confidence interval formula - ✔mean +- z*SE
sampling distribution - ✔a distribution of statistics obtained by selecting all the possible samples of a specific size from a population ANOVA (analysis of variance) - ✔an inferential statistical test for comparing the means of three or more groups
t-test - ✔a statistical test used to evaluate the size and significance of the difference between two means correlation test - ✔Non-parametric test that compares the strength of a relationship with with two variables. The closer to 1 the stronger the correlation. The closer to 0 the weaker the correlation. Since 0 has a flat slope, when the line is horizontal there is no correlation F statistic - ✔a ratio of two measures of variance to compare critical value to determine significance of results and draw conclusions about the hypothesis. Used with ANOVA and levene's test Non-parametric tests - ✔Not normal distribution, examples : chi- squared, fisher exact probability, Mann-Whitney, wilcoxon and Kirkland-wallos parametric tests - ✔Sample is representative of population and is normally distributed. Uses interval or ratio data only. Have more statistical power. Examples t-test and ANOVA Linear regression - ✔Y=a+b(x)+ e finding the best-fitting line by finding the slope multiple regression - ✔Y= a+b1(X11) + b2(x2)+ e regression model that estimates the relationship between the dependent variable and two or more independent variables