Data Mining and Machine Learning Practice Questions, Exams of Business Accounting

A series of multiple-choice questions related to data mining, machine learning, and simulation techniques. It covers topics such as the kdd process, collaborative filtering, time series analysis, decision tree algorithms (chaid, sliq, sprint), monte carlo simulation, discrete event simulation, linear programming optimization, and svm. Each question is followed by the correct answer and a brief explanation, making it a useful resource for students and professionals in the field. The questions test understanding of key concepts and their applications in various analytical scenarios. Useful for university students.

Typology: Exams

2024/2025

Available from 05/23/2025

locaz-turus-1
locaz-turus-1 🇺🇸

5

(1)

13K documents

1 / 126

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CAP Study Questions ALL
Q 1: Which of the following is not a part of the process of knowledge discovery from data
(KDD)?
A) data cleaning
B) data mining.
C) knowledge presentation
D) knowledge transfer correct answer D) knowledge transfer.
KDD process includes seven steps - data cleaning, data integration, data selection, data
transformation, data mining, pattern evaluation, and knowledge presentation. Knowledge
transfer is not a part of KDD but it is a part of training program.
Q 2: Collaborative filtering is a type of
A) image editing process.
B) data mining based recommender system.
C) data cleaning tool.
D) team building activity. correct answer B) data mining based recommender system.
Collaborative filtering is a type of data mining based recommender system in which user
similarity is measured based on their transactions, and an item is recommended based on that
user similarity. It is generally used on online retail websites like Amazon.
Q 3: Which component of the time series results in periodic above trend and below train
behavior of the time series lasting more than one year?
A) trend component
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Data Mining and Machine Learning Practice Questions and more Exams Business Accounting in PDF only on Docsity!

CAP Study Questions ALL

Q 1: Which of the following is not a part of the process of knowledge discovery from data (KDD)? A) data cleaning B) data mining. C) knowledge presentation D) knowledge transfer correct answer D) knowledge transfer. KDD process includes seven steps - data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation. Knowledge transfer is not a part of KDD but it is a part of training program. Q 2: Collaborative filtering is a type of A) image editing process. B) data mining based recommender system. C) data cleaning tool. D) team building activity. correct answer B) data mining based recommender system. Collaborative filtering is a type of data mining based recommender system in which user similarity is measured based on their transactions, and an item is recommended based on that user similarity. It is generally used on online retail websites like Amazon. Q 3: Which component of the time series results in periodic above trend and below train behavior of the time series lasting more than one year? A) trend component

B) cyclical component. C) seasonal component. D) irregular component. correct answer B) cyclical component. Irregular component is random pattern; seasonal component shows a periodic pattern over one year or less; trend is the long run shift or movement in time series observable over several periods of time. Q4: Which of the following is not a smoothing method for timeseries data? A) logarithmic. B) exponential. C) moving averages. D) weighted moving averages correct answer A) logarithmic. Three popular smoothing methods are moving averages, weighted moving averages, and exponential smoothing. The smoothing methods are appropriate for a stable timeseries i.e. one that exhibits no significant trend, cyclical, or seasonal effects, because they adapt well to changes in the level of the time series. Q 5: Which of the following decision tree algorithm is based on chi-squared statistical test? A) cart. B) C4. C) CHAID D) ID3. correct answer C) CHAID. CHAID is chi-squared automatic interaction detector based on statistical test, chi-Square. It is popularly used in the marketing field.

B) necessary to assign the particular appropriate random numbers. C) not necessary to develop a cumulative probability distribution. D) necessary to develop a cumulative probability distribution. correct answer D) necessary to develop a cumulative probability distribution. Monte Carlo simulation technique involves conducting repetitive experiments on the model of the system under study, with some known probability distribution to draw random samples using random numbers. It involves setting up a probability distribution for variables; building a cumulative probability distribution For each random variable; generating random numbers; and conducting the simulation experiment using random sampling. Q 9: Which of the following is an assumption of discrete events simulation? A) no change in the system would occur between the events. B) the changes in the system state are continuous. C) time is not an important factor for simulation. D) they run slower as compared to non-discrete event simulation correct answer A) no change in the system would occur between the events. The primary condition for discrete event simulation is that no change in the system would occur between the events. The changes are not continuous, but discrete in nature. Time is always an important factor in all type of simulations. And since this approach doesn't need to simulate for every time slice, therefore, it is assumed to be much faster than continuous simulations. Q 10: A restaurant is running to its full capacity and does not have resources to accommodate to the request of new customers. Analytics consultant has proposed to optimize. The current set up as there is no scope of changing the current set up and move to a new set up. The consultant is given a data sheet for service patterns, inventory management, staff, working profiles, orders served per day, new target customers and their demands, and frequency of product selling. What should be the next step for the consultant? A) forecast the future order requirements or predict the demand.

B) review the customer and staff satisfaction levels and compare them using statistical testing methods. C) study the data, find the best optimization model to be used, and map the data to the Problem. D) used discrete event simulation to find the approximation of the current complex situation. correct answer C) study the data, find the best optimization model to be used, and map the data to the problem. Finding the best optimization model to be used is the best way forward as the need for optimization have already been proposed. There is no point of forecasting as we already know that new customers are coming to the already packed system. There is no point in simulating the current process as real time. Data is available to manage the resources. Also, the satisfaction levels of both customer and staff seem to be reasonably high as services are in full capacity demand. Q 11: An online retail company wants to deliver products to their customers as per the following constraints - (A) there are limited numbers of courier staff (B) the courier staff can have limited weight to be carried in a single day (C) there is priority associated with every order (D) maximum number of orders must be delivered on time. Which of the following techniques would be best suited for solving this problem? A) linear programming optimization. B) Monte Carlo simulation. C) multiple linear regression. D) CHAID classification tree model correct answer A) linear programming optimization. This is a clear case of optimization where delivery of the orders needs to be maximized with given constraints of quantity that can be delivered with their given priorities. No simulation of the situation is required. No prediction model needs to be generated to predict the orders delivered and nothing needs to be classified to see order has been delivered or not. It is a case of linear programming.

A) machine learning techniques. B) simulation techniques. C) data visualization techniques. D) group communication techniques. correct answer A) machine learning techniques. Supervised and unsupervised learning are two categories of machine learning algorithms. Q 15: nonparametric statistics is most probably based on the following assumption. A) the confidence interval is very narrow. B) the population is distribution free. C) the population is normally distributed. D) the standard deviation of the sample is close to zero. correct answer B) the population is distribution free. Nonparametric statistics are based on fewer assumptions about the population, and the parameters, then are parametric statistics. They are sometimes referred to as the distribution free statistics, as the population is not assumed to be having any distribution, which is not the case with parametric test were population is generally assumed to have normal distribution. Q 16: suppose five weeks of average prices for a stock are 57, 68, 64, 71, and 62 with a standard deviation of 4.84. What is the coefficient of variation (CV) for this stock? A) 2.50% B) 5.00%, C) 7.50% D) 10.00%. correct answer C) 7.50% CV is defined as (standard deviation/mean)100. so, (4.84/64.4)100 = 7.5%.

Q 17: which is a better prediction model based on performance lift? A) model having constant lift curve running parallel to the X axis. B) model having overlapping lift curve to the base line. C) model having small area between the lift curve and baseline. D) model having large area between the lift curve and the baseline. correct answer D) model having a large area between the lift curve and the baseline. Performance lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance. Both charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better is the model. Q 18: which of the following is not a restriction of performance lift chart of an analytics model? A) lift charts require that the predictable attribute be a discrete value. In other words, you cannot use lift charts to measure the accuracy of models that predict continuous numeric values. B) to see prediction accuracy lines for any individual value of the predictable attribute, you need not create a separate lift chart for each target value. C) you cannot display timeseries models in a lift chart. D) you can add multiple models to a lift chart, as long as the models all have the same predictable attribute correct answer B) to see prediction accuracy lines for any individual value of the predictable attribute, you need not create a separate lift chart for each targeted value. The prediction accuracy, for all discreet values of the predictable attribute is shown in a single line. If you want to see prediction accuracy lines for any individual value of the predictable attribute, you must create a separate lift chart for each targeted value.

A) panel data regression. B) simulation. C) logistic regression D) MANOVA correct answer C) logistic regression. Since the dependent variable is binary in nature, it is most appropriate to use logistic regression model. Q 22: which of the following is an analogous metric of adjusted R squared and logistic regression? A) AIC B) ROC curve C) accuracy. D) residual deviance. correct answer A) AIC The analogous metric of adjusted R squared in logistic regression is AIC (Akaike information criteria). AIC is the measure of fit, which penalize the model for the number of model coefficients. Therefore, the model with the minimum AIC value is always preferred. Q 23: given that the sum of squares total is 22.5, sum of squares due to regression is 15.625, and sum of squares error is 6.875, for a regression model. The coefficient of determination, for this model would be: A). B). C). D) .56. correct answer A).

R squared or the coefficient of determination is calculated as SSR/SST or 1 - (SSE/SST). Q 24: considering all other parameters to be constant, which is the most appropriate explanation for a new predictor variable to remain in the regression model? A) if the adjusted R squared of the model decreases B) if the adjusted R squared of the model increases C) if the adjusted R squared of the model remains constant. D) new variables, inclusion in the model does not depend on adjusted R squared. correct answer B) if the adjusted R squared of the model increases Generally, the new variables should increase the adjusted R squared of the model, which means that strength of the prediction model has increased. Q 25: which of the following is not an assumption to regression modeling with respect to the error terms? A) the errors are dependent. B) the errors are normally distributed. C) the errors have a mean of 0 D) the errors have a constant variance. correct answer A) the errors are dependent. One important assumption while performing regression is that the errors are independent. Q 26: which of the following is most appropriate assumption to regression modeling with respect to predictor variables? A) predictor should be perfectly correlated to each other. B) predictor should have zero variance.

There exist high pairwise correlation; R squared from auxiliary regression is higher; condition index is very large, generally greater than 30; and large VIF, are some of the indicators of multicollinearity in the model. Q 29: a sure way of removing multicollinearity from the model is to A) work with panel data. B) drop a variables that cause multicollinearity in the first place. C) transform the variables by first differencing them. D) obtain additional sample data. correct answer B) drop the variables that caused multicollinearity in the first place. Other options may or may not improve the multicollinearity but the dropping the variable certainly improves the multicollinearity in the model. Q 30: equal variance of error. Terms in a regression model is termed as: A) multicollinearity. B) homoscedasticity. C) heteroscedasticity. D) autocorrelation. correct answer B) homoscedasticity. Unequal variance in error terms is known as heteroscedasticity, while equal variance of error terms is known as homoscedasticity. If regressors are perfectly correlated to each other, it is known as multicollinearity. Q 31: which of the following is not a source of heteroscedasticity? A) skewness in the regressor's

B) outliers in the regressor's C) incorrect data transformations. D) addition of significant regressor's. correct answer D) addition of significant regressor's. Addition of significant regressor is never creates heteroscedasticity, in fact, correctness of models supports homoscedasticity. Outliers, skewness, and incorrect transformations, certainly impact the model and increase heteroscedasticity. Q 32: when error terms across timeseries data or inter-correlated, it is known as. A) cross correlation. B) cross autocorrelation. C) special autocorrelation. D) serial autocorrelation. correct answer D) serial autocorrelation. When terms across sections of data are correlated, then it is special autocorrelation, while across timeseries, it is known as serial autocorrelation. Q 33: the regression coefficient is estimated in the presence of autocorrelation in the sample data or not A) unbiased estimators B) consistent estimators. C) efficient estimators. D) linear estimators. correct answer C) efficient estimators. They are not efficient estimators, and must not be reliable on for any decision-making.

A) independent T test. B) ANOVA C) chi-squared. D) F test correct answer C) chi-squared. Since the data is categorical in nature, chi-squared is the most suitable test to answer the query based on a given data set. Q 37: consider the following data set and suggest the most appropriate nature of the data. Data: [1, 2, 3, 2, 5, 6, 3, 1, 8, 9, 3, 2, 1, 7, 6, 8, 9, 10] A) unimodal B) bimodal. C) multimodal. D) no modal correct answer C) multimodal. There are three terms having highest frequency in the data set, therefore, it is multi modal in nature. Q 38: which of the following is most appropriate relation for the data having symmetric distribution? A) mean = median = mode B) mean > median > mode C) mean < median < mode. D) mean > median < mode correct answer A) mean = median = mode

For symmetric distribution, approximately mean = medium = mode, and all are present in the center of the distribution. Q39: the amount of peakedness of a distribution is measured by. A) skewness. B) kurtosis. C) interquartile range. D) standard deviation. correct answer B) kurtosis Kurtosis describes the amount of peakedness of a distribution. It may be leptokurtic, platykurtic or mesokurtic distribution. Q 40: if a set of data is normally distributed or bellshaped, as per the empirical rule, approximately how much percentage of the data values are within two times the standard deviation of the meme? A) 68% B) 95% C) 99.70% D) 99.90%. correct answer B) 95% As per the empirical rule, 68% of data is within one standard deviation, 95% is within two standard deviations, and 99.7% is within three standard deviations of the mean. Q 41: the covariance of the two variables, normalized by the variance of each variable is termed as. A) correlation.

Normally distributed data assumes a bell shaped curve with mean, median, and mode being placed, approximately at the center of the distribution. Q 44: which of these is not true for difference between system dynamics (SD) and discrete event simulation (DES)? A) SD model C behavior of systems using differential equations while DES models using a simulation clock that advances time in fixed increments. B) SD model attempts to capture all of the aspects of a process within a close system, while DES models more often reflects systems where entities are processed in a linear fashion. C) DES models are used when the goal is a statistically valid estimate of system performance; SD is more often the tool of choice for a training vehicle, D) a major part of the DES modeling effort is associated with capturing the mental models, while SD models are often built from a process map, or flow chart correct answer D) a major part of the DES modeling effort is associated with capturing the mental models, while SD models are often built from a process map, or flow chart. A major part of the SD modeling effort is associated with capturing the mental models, while, DES models are often built from a process map, or flow chart. Q 45: which of these is most relevant to the application of discrete event simulation? A) solving single server queuing system. B) managing inventory model in which the stock is inspected only once a week. C) predicting the house rent based on house characteristics like number of rooms, size of house, etc. D) comparing the before, and after impact of a TV advertisement on brand recall of a product for the consumers. correct answer A) solving single server queuing system.

Inventory model is example of only discrete simulation and not discrete event simulation; house rent prediction can be done using multiple linear regression; TV ad impact can be compared using paired t-test statistic. Q46: which is the most suitable method to compute the risk value of a portfolio? A) maximax criterian B) logistic regression C) Monte Carlo simulation D) linear programming. correct answer C) Monte Carlo simulation. Monte Carlo simulation uses probability distribution for modeling, a stochastic or random variable. Different probability distributions are used for modeling input variables, such as normal, log normal, uniform, and triangular. From probability, distribution of input, variable, different paths of outcome are generated. Compared to deterministic analysis, the Montecarlo method provides a superior simulation of risk. It gives an idea of not only what outcome to expect, but also the probability of occurrence of that outcome. It is also possible to model correlated input variables. For instance, Monte Carlo simulation can be used to compute the value at risk of a portfolio. This method tries to predict the worst return expected from a portfolio, given a certain confidence interval for a specified time. Normally, stock prices are believed to follow a geometric Brownian motion (GMP), which is a mark off process, which means a certain state follows. A random walk in its future values is dependent on the current value. Q47: Annabelle wants to open a small apparel sharp in Vienna. She has located a good mall that attracts Rhett customers. Her options are to open a small shop, a medium size shop, or no shop at all. The market for an apparel shop can be good, average, or bad. The probabilities for these three possibilities are .2 for a good market, .5 for average market, and .3 for bad market. net profit or loss for the different size shops is given in the table below. Which is the best decision based on EMV criterion. For good, average, bad respectively: Small shop = 75k, 25k, -40k