



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
C797 data science and analytics Study Guide
Typology: Exams
1 / 5
This page cannot be seen from the preview
Don't miss anything!




Column charts - ✔Use to compare data across categories( requires categorical or ordinal data, displays variables vertically) Line graphs - ✔Use to display continuous data over time Histogram - ✔Use to visually display normality ( normal distribution of continuous/quantitative /frequency of data around the mean) for a set of data points Pie charts - ✔Use when you have one row or column of data and you want to know how much one data point is in relation to the whole. Especially useful when displaying portions or percentages bar chart - ✔Similar to column charts but better to use when your labels are long ( categorical or ordinal data, there are spaces between bars and it is displayed horizontally) x-y scatter plot - ✔Shows the relationship among numeric values in several data series, or plot two groups of numbers as one series of XY coordinates ( requires continuous data) donut chart - ✔Show the relationship of parts to a whole like a pie chart but can contain more than one data series( can be continuous or categorical sections of a whole) Bubble charts - ✔Are continuous data that are arranged in columns on a worksheet so that the X values are listed in the first column and the corresponding Y values and bubble size values s are listed in adjacent columns. Geospatial maps - ✔Visually depict the prevalence and occurrence of a condition or disease geographically using polygons. The are 2 types of data used in mapping- vector data and raster data Raster Data - ✔A grid-based format for storing location-based data in a geographic information system in which each equally-sized cell or pixel contains a value that represents geographic data such as land. Vector Data - ✔A format for storing location-based data in a geographic information system that uses latitude and longitude coordinates to represent geographic features with points, lines, and other complex shapes.
Nominal Data - ✔Categorical : gender, type of pet, hair color, eye color ordinal data - ✔an arbitrary numerical scale where the exact numerical value has no significance other than to rank a set of data points. Deals with the order or position of items such as words, letters, symbols or numbers arranged in a hierarchical order : 1st place 2nd place, big to small Interval Data - ✔Data comprised of consistent units or intervals but doesn't always have a true zero. Higher numbers mean more of something while lower numbers mean less of something : temperature, height , weight, time Ratio Data - ✔data that is similar to interval data, except that they have a meaningful zero point and the ratio of two data points is meaningful. : age- you can't be younger than age 0 Measures of Center (Central Tendency) - ✔mean, median, mode interquartile range - ✔The difference between the upper and lower quartiles. ( find the median of the upper and lower numbers on either side of the median to get your first and third quartiles) Midrange - ✔the sum of the lowest and highest data values, divided by 2 range - ✔the difference between the highest and lowest scores in a distribution standard deviation - ✔a quantity calculated to indicate the extent of deviation for a group as a whole. proportion - ✔The part of or something that is expressed as a ratio or percentage Frequency - ✔The count of something, not the sum (10+10+15=35 but the count is 3) Slope equation - ✔m=y2-y1/x2-x slope-intercept form - ✔y=mx+b, where m is the slope and b is the y-intercept of the line. ( when the line intercepts Y axis, x=O so Y=b) box and whisker plot - ✔ sample variance - ✔Standard deviation squared=? (? is larger than deviation) standard deviation - ✔the square root of the variance=?
null hypothesis - ✔the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. failing to reject the null hypothesis - ✔indicates that there is not a statistically significant difference between the means of the groups in the study and that the means are equal reject the null hypothesis - ✔when you have enough statistical strength to show a difference or an association ( p value is more than the alpha) alternative hypothesis - ✔The hypothesis that states there is a difference between two or more sets of data. Empirical Rule (68- 95 - 99.7) Rule - ✔Only works with a normal distribution-68% of data lands between 1 standard deviation on either side of the mean, 95% of data will be between 2 SD on either side of the mean. 99.7% of data will be between 3 SD on either side of the mean. Independent Variable (IV) - ✔the variable that a researcher actively manipulates, and if the hypothesis is correct, will cause a change in the dependent variable Dependent Variable (DV) - ✔The measured outcome of a study; the responses of the subjects in a study. Descrete data (variables) - ✔Non-continuous, categorical variable- no relationship between each variable - nominal, chi-square. Weak, limited to number, percent and mode when using categorical data. Chi-square test - ✔A statistical method of testing for an association between two categorical variables. Specifically, it tests for the equality of two frequencies or proportions. continuous data - ✔Data that can take any value (within a range) has mean, SD and range Likert Scale - ✔a numerical scale used to assess attitudes; includes a set of possible answers with labeled anchors on each extreme ( strongly disagree, disagree....strongly agree)- best shown with bad graphs sampling distribution - ✔a distribution of statistics obtained by selecting all the possible samples of a specific size from a population ANOVA (analysis of variance) - ✔an inferential statistical test for comparing the means of three or more groups
t-test - ✔a statistical test used to evaluate the size and significance of the difference between two means correlation test - ✔Non-parametric test that compares the strength of a relationship with with two variables. The closer to 1 the stronger the correlation. The closer to 0 the weaker the correlation. Since 0 has a flat slope, when the line is horizontal there is no correlation F statistic - ✔a ratio of two measures of variance to compare critical value to determine significance of results and draw conclusions about the hypothesis. Used with ANOVA and levene's test Non-parametric tests - ✔Not normal distribution, examples : chi- squared, fisher exact probability, Mann-Whitney, wilcoxon and Kirkland-wallos parametric tests - ✔Sample is representative of population and is normally distributed. Uses interval or ratio data only. Have more statistical power. Examples t-test and ANOVA Linear regression - ✔Y=a+b(x)+ e finding the best-fitting line by finding the slope multiple regression - ✔Y= a+b1(X11) + b2(x2)+ e regression model that estimates the relationship between the dependent variable and two or more independent variables