


















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
It contain Questions and Answers from Data Mining, Databases, and Machine Learning Fundamentals
Typology: Exams
1 / 26
This page cannot be seen from the preview
Don't miss anything!



















th
Example of discrete, qualitative, ordinal attributes Bronze, Silver, Gold medals as awarded at the Olympics Text files are easier to inspect than binary files by loading the file or viewing it with text editor True Structured data Data mining algorithms use and can be classified as categorical or numeric The data scientist can use programming languages such as Python and R to visualize the data True Noise objects are always outliers True Business understanding phase in CRISP-DM the data scientists first starts by identifying the business problem and business objectives Predictive analytics Seeks to determine what is likely to happen in the future Data processing Data discretization and cleaning are part of data preprocessing
Foundation of Relational Databases A collection of excel type data tables with rows as data lines and columns as attributes OLAP Is ideal for long-term decision making NOSQL Document databases Business rules helps set up the __ Conceptual data model Data Mining The process of extracting usable data from a larger set of any raw data in order to identify patterns Unsupervised Learning Using unlabeled data, allows a model to discover patterns and information that was previously undetected Type of Machine Learning algorithms Clustering Classification Regression Association Market Basket Analysis Example Result in a retailer deceasing to locate 6-packs of beer at the ned of the infant diapers aisle Sampling Frame
An estimate of how much a survey statistic differs from the "true" statistic because of the sample that was selected Coverage error The degree to which statistics are off because the sample doesn't properly represent the underlying population Basic elements of a properly done survey sample Randomly selected elements from a list A list of the units or elements of the population A method to assure that key elements of the population are represented in the sample The Kish Method Method for randomly selecting which household member to sample in a household with multiple members Starts with a single question about the number of eligible persons in the housed so they can be listed Rare population Less than 3% of a total population How to best find the members in a household Asking for the initials of everyone who slept in the residence the night before the survey Statistical inference Inferring from a sample to a population, based on probability theory Probability Theory Allows us to estimate the chance that our answer from a sample is not a good estimate of the true population statistics
New Deal on Data Rebalancing of ownership of data in favor of the individual whose data is collected Human Subjects Protection Applies to individual rights on the Internet Response to Tuskegee syphilis experiment, influenced by the Nuremberg Code, Informed consent Informed Consent Researchers must inform their respondents of the general purpose of the research and any effects the research may have on the respondent and obtain written consent from the respondent Human non-subjects research include genetic data from a data bank in which individual names or identifiers are not included genetic data can lead to individuals who thought their data were being used anonymously being identified via their genetic code at a later date If a number represents the geographic location of a business using the zip code, then the level of data represented by the number Nominal Sampling distribution of a statistic the probability distribution for the statistic based on all possible random samples from a population Necessary for making confidence statements about an unknown population parameter depends on the nature of the population being samples
embeds and operationalizes data science across an enterprise such that it can deliver the next level of organizational performance and return on investment Big data technologies Utilized software that incorporates data mining, data storage, data sharing, and data visualization, the comprehensive term embraces data, data framework including tools and techniques used to investigate and transform data. What is a "traditional" research model Approach to research, statistical inference and modeling begins with the specification of a theory or model Linear regression and logistic regression, estimate parameters for linear predictions Classical model uses techniques such as Ordinary Least Squares and Maximum Likelihood - this is the conventional type of statistics that you see in most textbooks covering estimation, regression, hypothesis testing, confidence intervals, etc. Bayesian model it is fundamentally all about modifying conditional probabilities - it uses prior distributions for unknown quantities which it then updates to posterior distributions using the laws of probability Predictive model Involves searching for meaningful relationships among variables and representing those relationships in models Response variables: what we are trying to predict Explanatory variables or predictors: what we observe, manipulate or
control Focuses on a set of indicators that correlate in some way with a quantity of interest Model dependent research Begins with the specification of a model and uses that model to generate data, predictions, or recommendations. Simulations and mathematical programming methods, primary tools of operations research Models are improved by comparing generated data with real data Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed Test set and training set In a dataset a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Data points in the training set are excluded from the test (validation) set. Use of Machine Learning a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data Data adaptive approach Begin with data and search through those data to find useful predictors Give litte thought to theories or hypotheses prior to running the analysis Adapt to the available data, representing the nonlinear relationships
Histogram contiguous rectangles that represent the frequency of data in given class intervals Spine chart two-dimensional graph plot of pairs of points from two numerical variables; Useful for examining possible relationships between variables b. Horizon Plot Regression (linear regression) The technique that specifies the dependence of the response variable on the explanatory variable. It's finding the line that best fits the pattern of the linear relationship. ANOVA Analysis of Variance Compares mean values of a contributes variable for multiple categories/groups Method A way, technique, or process for doing something What data science technologies would you use in a promotional campaign? Scatterplot and box plot first to assess the data and see the obvious relationships Build a linear model for predicting whatever the campaign is trying to cover Employ a training-and-test regimen to provide an evaluation of the models predictive performance and ANOVA Traditional regression models Market basket analysis
Looking for combinations of items that occur together frequently in transactions There are relationships between the items that people buy Used to make purchase suggestions to their customers and predict future purchase decisions of customers Conjoint analysis Used to evaluate the strength and direction of customer preference for a combination of product or service attributes Might be used to determine which factors are most important to customers who are purchasing something Enterprise-level approach Which activities have the biggest impact on business performance How do we know whether we are executing against our strategy Large scale implementation Davenport & Harris's levels of analytic competition Analytically impaired Localized Analytic Analytical Aspirations Analytical Company Analytical Competition Analyticall Impaired Company has some data and management interest in analytics Localized Analytics Functional management builds analytics momentum and executives' interest through applications of basic analytics Analytical Aspirations
Multiple regression analysis Neural Network Analysis Simulation Textual Analysis Yield Analysis External processes Customer Supplier Related to managing and responding to customer demand and supplier relationships CRM and SCM Require cooperation from outsiders Use predictive modeling to identify the most profitable customers Integrate data generated in-house with data acquired from outside sources Optimize supply chain and can determine the impact of unexpected glitches, simulate alternatives, and route shipments around problems Analyze historical sales and pricing trends Use sophisticated experiments to measure the overall impact or lift Supply chain? The network of all the individuals, organizations, resources, activities and technology involved in the creation and sale of a product, from the delivery of source materials from the supplier to the manufacturer, through to its eventual delivery to the end user Customer related processes Pursuing a range of tactics that enable them to attract and retain customers more effectively, engage in "dynamic pricing", optimize their brand management, translate customer interactions into sales, manage
customer life cycles, and differentiate their products by personalizing them across multiple channels Customer centered management Ideas and theories where customers are at the forefront by maximizing service and/or product offerings and building relationships Consumer choice model The theory of consumer choice is the branch of microeconomics that relates preferences to consumption expenditures and to consumer demand curves. The models that make up consumer theory are used to represent prospectively observable demand patterns for an individual buyer on the hypothesis of constrained optimization Peer reviewed article Articles are written by experts and are reviewed by several other experts in the field before it is published in the journal Bag of words approach A text is represented as the bag of its words Way of extracting features from text for us in modeling Simple and flexible Vocab of known words and measure of the presence of known words Has nothing to do with the order or structure of words Word count Text analysis analyzes unstructured data to find trends and patterns in words and sentences NLP (Natural Language Processing)
and understand; and (3) can be inserted into a database in a seamless fashion. Unstructured data No pre-defined format or organization, stored in native form, requires more work to process and understand, schema on read Nominal Least precise and informative, only names characteristic or identity Examples: Name, geographic location, zip code, partisanship Ordinal The values stress the order or rank of the values, not known what the difference is Likert scale Class Standing Socioeconomic status Quality of democracy Interval know how far apart the difference is between the two levels Order of values + ability to quantify the difference between each Ratio Ultimate-order, interval values, plust the ability to calculate ratios since a "true zero" can be defined Categorical data Data that consists of names, labels, or other nonnumerical values Discrete data
Numerical data values that can be counted, only take in certain values, countable in finite amount of time, numeric and categorical Continuous data Data that can take on any value, infinite numbers, take any value Quantitative data information and quantities and numbers Statistical Structured: rigid and defined Numbers and values Qualitative data Non-statistical Typically unstructured or semi-structured Isn't measured using hard numbers but is categorized based on properties, attributes, labels, and other identifiers Answer: why? Theories, interpretations, developing hypotheses etc. Open for exploration CSV File format for transferring data, which stores fields and records in a plain text file, separated by commas. Comma-separated values Json JavaScript Object Notation (JSON) - a popular data interchange format, JSON is a technology standard often used to format data when being sent or received via APIs. TXT files
NoSQL Non-relational or distributed Document abased, grou[ databases, wide-column stores, key-value pairs Non-relational DMS that does not require a fixed schema, avoid joins, and is easy to scale Distributed data stores with humongous data storage needs Big data and real-time web apps Relational database Type of database that stores and provides access to data points that are related to one another The columns of the table hold attributes of the data and each record has a value for each attribute, making it easy to establish the relationships among data points Data warehousing type of data management system that is designed to enable and support business intelligence activities, especially analytics. perform queries and analysis and often contain large amounts of historical data System that stores data from a company's operational databases as well as external sources Different from operational databases because they store historical information, making it easier for business leaders to analyze data over a specific period of time Sort data based on different subject matter Important Ensure consistency
Make better business decisions Improve the bottom line Algorithm Set of instructions designed to perform a specific task Created as functions Finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation "normalize" data Process of structuring a relational database in accordance with a series of normal forms in order to reduce data redundancy and improve data integrity Normalization makes sure that all of your data looks and reads the same way across all records Accuracy Most intuitive performance measure and it is a ratio of correctly predicted observations to the total observation A great measure with symmetric datasets where values of false positive and false negatives are almost the same Precision Quantifies the number of positive class predictions that actually belong to the positive class Ratio of correctly predicted positive observations to the total predicted positive observations Recall metrics Quantifies the number of positive class predictions made out of all positive examples in the dataset