









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Introduction to Data Science Final Exam
Typology: Exams
1 / 15
This page cannot be seen from the preview
Don't miss anything!










"normalize" data - ANSWER-Process of structuring a relational database in accordance with a series of normal forms in order to reduce data redundancy and improve data integrity Normalization makes sure that all of your data looks and reads the same way across all records Accuracy - ANSWER-Most intuitive performance measure and it is a ratio of correctly predicted observations to the total observation A great measure with symmetric datasets where values of false positive and false negatives are almost the same Algorithm - ANSWER-Set of instructions designed to perform a specific task Created as functions Finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation All discrete probability distributions have - ANSWER-discrete random variable values that can have zero probability All sampling distributions require - ANSWER-random sampling All statistics are - ANSWER-functions of sample data Analytical Aspirations - ANSWER-Executives commit to analytics by aligning resources and setting a timetable to build a broad analytical capability Analytical company - ANSWER-Enterprise-wide analytics capability under development; top executives view analytic capability as a corporate priority Analyticall Impaired - ANSWER-Company has some data and management interest in analytics Analytically competitive organization - ANSWER-The company routinely reaps benefits of its enterprise-wide analytics capability and confuses on continuous analytics review ANOVA - ANSWER-Analysis of Variance Compares mean values of a contributes variable for multiple categories/groups Bag of words approach - ANSWER-A text is represented as the bag of its words Way of extracting features from text for us in modeling Simple and flexible Vocab of known words and measure of the presence of known words Has nothing to do with the order or structure of words
Word count Balanced scorecard approach - ANSWER-A top-down management system that organizations can use to clarify their vision and strategy and transform them into action Basic elements of a properly done survey sample - ANSWER-Randomly selected elements from a list A list of the units or elements of the population A method to assure that key elements of the population are represented in the sample Bayesian model - ANSWER-it is fundamentally all about modifying conditional probabilities - it uses prior distributions for unknown quantities which it then updates to posterior distributions using the laws of probability Big data technologies - ANSWER-Utilized software that incorporates data mining, data storage, data sharing, and data visualization, the comprehensive term embraces data, data framework including tools and techniques used to investigate and transform data. Binary file - ANSWER-A file containing data or instructions written in zeros and ones (computer language). Business rules helps set up the __ - ANSWER-Conceptual data model Business understanding phase in CRISP-DM - ANSWER-the data scientists first starts by identifying the business problem and business objectives Categorical data - ANSWER-Data that consists of names, labels, or other nonnumerical values Characteristics of analytical leaders? - ANSWER-Passionate and driven and their initiatives are aimed at substantial results Classical confidence intervals - ANSWER-vary from one sample to the next Classical model - ANSWER-uses techniques such as Ordinary Least Squares and Maximum Likelihood - this is the conventional type of statistics that you see in most textbooks covering estimation, regression, hypothesis testing, confidence intervals, etc. Classification/Decision Tree - ANSWER-flowchart structure that includes chance event outcomes Each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules. Example: Titanic success rate: can break down data to show the characteristics that lead to survival or death
CRISP-DM process - ANSWER-Common approach by data mining experts Widely used analytics model provides a uniform framework for planning and managing a project CRM based research - ANSWER-Customer relationship management; The combination of practices, strategies and technologies that companies use to manage and analyze customer interactions and data throughout the customer lifecycle. The goal is to improve customer service relationships and assist in customer retention and drive sales growth CSV - ANSWER-File format for transferring data, which stores fields and records in a plain text file, separated by commas. Comma-separated values Customer centered management - ANSWER-Ideas and theories where customers are at the forefront by maximizing service and/or product offerings and building relationships Customer related processes - ANSWER-Pursuing a range of tactics that enable them to attract and retain customers more effectively, engage in "dynamic pricing", optimize their brand management, translate customer interactions into sales, manage customer life cycles, and differentiate their products by personalizing them across multiple channels Dashboard? - ANSWER-A graphical user interface that organizes and summarizes information vital to the user's role and the decisions that user makes. Data adaptive approach - ANSWER-Begin with data and search through those data to find useful predictors Give litte thought to theories or hypotheses prior to running the analysis Adapt to the available data, representing the nonlinear relationships and interactions among variables Data determine the model Data Driven Decision - ANSWER-Refers to the practice of basing decisions on the analysis of data rather than purely on intuition and is a process that involves collecting data based on measurable goals or KPls, analyzing patterns and facts from these insights, and utilizing them to develop strategies and activities that benefit thebusiness in a number of areas Data Mining - ANSWER-The process of extracting usable data from a larger set of any raw data in order to identify patterns Data processing - ANSWER-Data discretization and cleaning are part of data preprocessing
Data science capability - ANSWER-embeds and operationalizes data science across an enterprise such that it can deliver the next level of organizational performance and return on investment Data warehousing - ANSWER-type of data management system that is designed to enable and support business intelligence activities, especially analytics. perform queries and analysis and often contain large amounts of historical data System that stores data from a company's operational databases as well as external sources Different from operational databases because they store historical information, making it easier for business leaders to analyze data over a specific period of time Sort data based on different subject matter Important Ensure consistency Make better business decisions Improve the bottom line Davenport & Harris's levels of analytic competition - ANSWER-Analytically impaired Localized Analytic Analytical Aspirations Analytical Company Analytical Competition Declaration of Geneva? - ANSWER-An oath for physicians in response to the participation of physicians in crimes against humanity in Nazy Germany Safeguards the ethical principles of the medical profession Deliberate sample - ANSWER-sample members from a larger population are selected according to a random starting point but with a fixed, periodic interval. This interval, called the sampling interval, is calculated by dividing the population size by the desired sample size. Discrete data - ANSWER-Numerical data values that can be counted, only take in certain values, countable in finite amount of time, numeric and categorical Enterprise-level approach - ANSWER-Which activities have the biggest impact on business performance How do we know whether we are executing against our strategy Large scale implementation Error - ANSWER-Deviations in the sample or methods from the true measures of the population Estimated Sample Statistics - ANSWER-previous studies, statistic from pilot study , educated best guess
Human Subjects Review purpose - ANSWER-Set of basic regulations governing the protection of human subjects Ethical guidelines intended to assist researchers and those charged with ensuring that research on human subjects follows both legal and ethical requirements. If a number represents the geographic location of a business using the zip code, then the level of data represented by the number - ANSWER-Nominal Imputation - ANSWER-is a statistical method used by statisticians, survey researchers, and other scientists to replace missing data from respondent's dataset to improve the accuracy of the data sets. Informed Consent - ANSWER-Researchers must inform their respondents of the general purpose of the research and any effects the research may have on the respondent and obtain written consent from the respondent Internal Process - ANSWER-i. Look into their own strategies and how they perform their internal analytics ii. Lie completely in the organizations control Examples: financial analytics, mergers and acquisitions analytics, operational analytics, research and development analytics, and human resources analytic Techniques: Activity based costing Bayesian inference Combinatorial optimization Constraint Analysis Experimental Design Future-value analysis Genetic algorithms Monte Carlo simulation Multiple regression analysis Neural Network Analysis Simulation Textual Analysis Yield Analysis Interval - ANSWER-know how far apart the difference is between the two levels Order of values + ability to quantify the difference between each Json - ANSWER-JavaScript Object Notation (JSON) - a popular data interchange format, JSON is a technology standard often used to format data when being sent or received via APIs. Localized Analytics - ANSWER-Functional management builds analytics momentum and executives' interest through applications of basic analytics
Machine learning - ANSWER-is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed Market basket analysis - ANSWER-Looking for combinations of items that occur together frequently in transactions There are relationships between the items that people buy Used to make purchase suggestions to their customers and predict future purchase decisions of customers Market Basket Analysis Example - ANSWER-Result in a retailer deceasing to locate 6- packs of beer at the ned of the infant diapers aisle Meaning of "N" - ANSWER-population size Meaning of "n" - ANSWER-sample size Measurement Error - ANSWER-The degree to which a survey statistic differs from the "true" value due to the way the statistic is collected Method - ANSWER-A way, technique, or process for doing something Model dependent research - ANSWER-Begins with the specification of a model and uses that model to generate data, predictions, or recommendations. Simulations and mathematical programming methods, primary tools of operations research Models are improved by comparing generated data with real data Model dependent research - ANSWER-It begins with the specification of a model and uses that model to generate data, predictions, or recommendations. Simulations and mathematical programming methods, primary tools of operations research, are examples of model-dependent research Nearest neighbor - ANSWER-algorithm from graph theory that can be used to find a pathway from spot a to spot.; Cluster the data Classify the new data to find what it is closest to Nearest annotated cells and if the The k designates how many nearest neighbors to look at and then it is classified by the group with the most nearest neighbors New Deal on Data - ANSWER-Rebalancing of ownership of data in favor of the individual whose data is collected
Socioeconomic status Quality of democracy Peer reviewed article - ANSWER-Articles are written by experts and are reviewed by several other experts in the field before it is published in the journal Population distributions - ANSWER-Have distribution of random variables Precision - ANSWER-Quantifies the number of positive class predictions that actually belong to the positive class Ratio of correctly predicted positive observations to the total predicted positive observations Predictive analytics - ANSWER-Seeks to determine what is likely to happen in the future Predictive model - ANSWER-Involves searching for meaningful relationships among variables and representing those relationships in models Response variables: what we are trying to predict Explanatory variables or predictors: what we observe, manipulate or control Focuses on a set of indicators that correlate in some way with a quantity of interest Probability Theory - ANSWER-Allows us to estimate the chance that our answer from a sample is not a good estimate of the true population statistics Properly form survey questions - ANSWER-Keep it Simple Make questions exhaustive and mutually exclusive Try using different ways to get data Question order is important Always pretest questionnaire Purposive sample - ANSWER-a sample selected on the basis of the researcher's judgment regarding the "best" participants to select for research purposes Qualitative data - ANSWER-Non-statistical Typically unstructured or semi-structured Isn't measured using hard numbers but is categorized based on properties, attributes, labels, and other identifiers Answer: why? Theories, interpretations, developing hypotheses etc. Open for exploration Quantitative data - ANSWER-information and quantities and numbers Statistical Structured: rigid and defined Numbers and values
Rare population - ANSWER-Less than 3% of a total population Ratio - ANSWER-Ultimate-order, interval values, plust the ability to calculate ratios since a "true zero" can be defined Recall metrics - ANSWER-Quantifies the number of positive class predictions made out of all positive examples in the dataset Ratio or correctly predicted positive observations to the all observations in actual class Regression (linear regression) - ANSWER-The technique that specifies the dependence of the response variable on the explanatory variable. It's finding the line that best fits the pattern of the linear relationship. Relational database - ANSWER-Type of database that stores and provides access to data points that are related to one another The columns of the table hold attributes of the data and each record has a value for each attribute, making it easy to establish the relationships among data points Reliability - ANSWER-Consistent If one person takes the test several times and always receives the same result The same every time Response bias - ANSWER-When respondents choose to "opt-out" If there is a consistent direction of the response deviations over trial Sample statistic (the point estimate) - ANSWER-Statistic taken from a sample that is used to estimate a population parameter Only as good as the representativeness for its sample Sampling distribution of a statistic - ANSWER-the probability distribution for the statistic based on all possible random samples from a population Necessary for making confidence statements about an unknown population parameter depends on the nature of the population being samples Sampling random from a normal population- sample average is a normal distribution Sampling Error - ANSWER-An estimate of how much a survey statistic differs from the "true" statistic because of the sample that was selected Sampling error - ANSWER-Occurs when the sample is not representative of the population in selection process Systematic failure to observe some elements because of the sample design Sampling Frame - ANSWER-A list of all unites in a target population Sampling methods for underrepresented groups - ANSWER-Probability Sampling- Targeted Geographic areas, cluster, stratified
Stratified sample - ANSWER-the population is divided into strata and a random sample is taken from each stratum, an reduce sampling error, because sample will more closely math the population More costlty than a simple random sample Strata are usually chosen based on available information about the population Within each group - homogeneity Between each group- heterogeneity Stratified Sampling - ANSWER-Minimize design effects Structured data - ANSWER-Data mining algorithms use and can be classified as categorical or numeric Structured data - ANSWER-Data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion. Supply chain? - ANSWER-The network of all the individuals, organizations, resources, activities and technology involved in the creation and sale of a product, from the delivery of source materials from the supplier to the manufacturer, through to its eventual delivery to the end user Symmetrical simulation - ANSWER- Test set and training set - ANSWER-In a dataset a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Data points in the training set are excluded from the test (validation) set. Text analysis - ANSWER-analyzes unstructured data to find trends and patterns in words and sentences Text files are easier to inspect than binary files by loading the file or viewing it with text editor - ANSWER-True The data scientist can use programming languages such as Python and R to visualize the data - ANSWER-True The Kish Method - ANSWER-Method for randomly selecting which household member to sample in a household with multiple members Starts with a single question about the number of eligible persons in the housed so they can be listed The New Deal on Data - ANSWER-The people deserve to have the rights to their data They should know when/where/why their data is being used The ownership of data belongs to the individual whose data is collected
Give people the ability to see what's being collected and opt in and opt out so there is transparency TXT files - ANSWER-A computer file that only contains text and has no special formatting such as bold text, italic text, images etc.; Microsoft Windows text files are identified with the .txt file extension Type of Machine Learning algorithms - ANSWER-Clustering Classification Regression Association Unstructured data - ANSWER-No pre-defined format or organization, stored in native form, requires more work to process and understand, schema on read Unsupervised Learning - ANSWER-Using unlabeled data, allows a model to discover patterns and information that was previously undetected Unsupervised segmentation - ANSWER- Use of Machine Learning - ANSWER-a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data What data science technologies would you use in a promotional campaign? - ANSWER- Scatterplot and box plot first to assess the data and see the obvious relationships Build a linear model for predicting whatever the campaign is trying to cover Employ a training-and-test regimen to provide an evaluation of the models predictive performance and ANOVA Traditional regression models What is a "traditional" research model - ANSWER-Approach to research, statistical inference and modeling begins with the specification of a theory or model Linear regression and logistic regression, estimate parameters for linear predictions Which of these does Korzybski's statement that "The map is not the territory, and the name is not the thing named" relate? - ANSWER-Construct Validity Applies to the statistical survey due to the mismatch between a statistical construct or the elements of information that are sought after by an analyst and its associated measure the gap is the validity Why sample? - ANSWER-Gather data from a population that can't be census Make statistical inferences Deal with issues of bias and error Deal with populations that shift and change, that might have significant variations