Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Mining and Machine Learning Practice Questions, Exams of Business Accounting

London Institute of Business and Technology (LIBT)Business Accounting

A series of multiple-choice questions related to data mining, machine learning, and simulation techniques. It covers topics such as the kdd process, collaborative filtering, time series analysis, decision tree algorithms (chaid, sliq, sprint), monte carlo simulation, discrete event simulation, linear programming optimization, and svm. Each question is followed by the correct answer and a brief explanation, making it a useful resource for students and professionals in the field. The questions test understanding of key concepts and their applications in various analytical scenarios. Useful for university students.

Typology: Exams

2024/2025

Available from 05/23/2025

locaz-turus-1 🇺🇸

5

(1)

13K documents

1 / 126

This page cannot be seen from the preview

Don't miss anything!

CAP Study Questions ALL

Q 1: Which of the following is not a part of the process of knowledge discovery from data

(KDD)?

A) data cleaning

B) data mining.

C) knowledge presentation

D) knowledge transfer correct answer D) knowledge transfer.

KDD process includes seven steps - data cleaning, data integration, data selection, data

transformation, data mining, pattern evaluation, and knowledge presentation. Knowledge

transfer is not a part of KDD but it is a part of training program.

Q 2: Collaborative filtering is a type of

A) image editing process.

B) data mining based recommender system.

C) data cleaning tool.

D) team building activity. correct answer B) data mining based recommender system.

Collaborative filtering is a type of data mining based recommender system in which user

similarity is measured based on their transactions, and an item is recommended based on that

user similarity. It is generally used on online retail websites like Amazon.

Q 3: Which component of the time series results in periodic above trend and below train

behavior of the time series lasting more than one year?

A) trend component

Discover Exams of Business Accounting London Institute of Business and Technology (LIBT)

Partial preview of the text

Download Data Mining and Machine Learning Practice Questions and more Exams Business Accounting in PDF only on Docsity!

CAP Study Questions ALL

Q 1: Which of the following is not a part of the process of knowledge discovery from data (KDD)? A) data cleaning B) data mining. C) knowledge presentation D) knowledge transfer correct answer D) knowledge transfer. KDD process includes seven steps - data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation. Knowledge transfer is not a part of KDD but it is a part of training program. Q 2: Collaborative filtering is a type of A) image editing process. B) data mining based recommender system. C) data cleaning tool. D) team building activity. correct answer B) data mining based recommender system. Collaborative filtering is a type of data mining based recommender system in which user similarity is measured based on their transactions, and an item is recommended based on that user similarity. It is generally used on online retail websites like Amazon. Q 3: Which component of the time series results in periodic above trend and below train behavior of the time series lasting more than one year? A) trend component

B) cyclical component. C) seasonal component. D) irregular component. correct answer B) cyclical component. Irregular component is random pattern; seasonal component shows a periodic pattern over one year or less; trend is the long run shift or movement in time series observable over several periods of time. Q4: Which of the following is not a smoothing method for timeseries data? A) logarithmic. B) exponential. C) moving averages. D) weighted moving averages correct answer A) logarithmic. Three popular smoothing methods are moving averages, weighted moving averages, and exponential smoothing. The smoothing methods are appropriate for a stable timeseries i.e. one that exhibits no significant trend, cyclical, or seasonal effects, because they adapt well to changes in the level of the time series. Q 5: Which of the following decision tree algorithm is based on chi-squared statistical test? A) cart. B) C4. C) CHAID D) ID3. correct answer C) CHAID. CHAID is chi-squared automatic interaction detector based on statistical test, chi-Square. It is popularly used in the marketing field.

B) necessary to assign the particular appropriate random numbers. C) not necessary to develop a cumulative probability distribution. D) necessary to develop a cumulative probability distribution. correct answer D) necessary to develop a cumulative probability distribution. Monte Carlo simulation technique involves conducting repetitive experiments on the model of the system under study, with some known probability distribution to draw random samples using random numbers. It involves setting up a probability distribution for variables; building a cumulative probability distribution For each random variable; generating random numbers; and conducting the simulation experiment using random sampling. Q 9: Which of the following is an assumption of discrete events simulation? A) no change in the system would occur between the events. B) the changes in the system state are continuous. C) time is not an important factor for simulation. D) they run slower as compared to non-discrete event simulation correct answer A) no change in the system would occur between the events. The primary condition for discrete event simulation is that no change in the system would occur between the events. The changes are not continuous, but discrete in nature. Time is always an important factor in all type of simulations. And since this approach doesn't need to simulate for every time slice, therefore, it is assumed to be much faster than continuous simulations. Q 10: A restaurant is running to its full capacity and does not have resources to accommodate to the request of new customers. Analytics consultant has proposed to optimize. The current set up as there is no scope of changing the current set up and move to a new set up. The consultant is given a data sheet for service patterns, inventory management, staff, working profiles, orders served per day, new target customers and their demands, and frequency of product selling. What should be the next step for the consultant? A) forecast the future order requirements or predict the demand.

B) review the customer and staff satisfaction levels and compare them using statistical testing methods. C) study the data, find the best optimization model to be used, and map the data to the Problem. D) used discrete event simulation to find the approximation of the current complex situation. correct answer C) study the data, find the best optimization model to be used, and map the data to the problem. Finding the best optimization model to be used is the best way forward as the need for optimization have already been proposed. There is no point of forecasting as we already know that new customers are coming to the already packed system. There is no point in simulating the current process as real time. Data is available to manage the resources. Also, the satisfaction levels of both customer and staff seem to be reasonably high as services are in full capacity demand. Q 11: An online retail company wants to deliver products to their customers as per the following constraints - (A) there are limited numbers of courier staff (B) the courier staff can have limited weight to be carried in a single day (C) there is priority associated with every order (D) maximum number of orders must be delivered on time. Which of the following techniques would be best suited for solving this problem? A) linear programming optimization. B) Monte Carlo simulation. C) multiple linear regression. D) CHAID classification tree model correct answer A) linear programming optimization. This is a clear case of optimization where delivery of the orders needs to be maximized with given constraints of quantity that can be delivered with their given priorities. No simulation of the situation is required. No prediction model needs to be generated to predict the orders delivered and nothing needs to be classified to see order has been delivered or not. It is a case of linear programming.

A) machine learning techniques. B) simulation techniques. C) data visualization techniques. D) group communication techniques. correct answer A) machine learning techniques. Supervised and unsupervised learning are two categories of machine learning algorithms. Q 15: nonparametric statistics is most probably based on the following assumption. A) the confidence interval is very narrow. B) the population is distribution free. C) the population is normally distributed. D) the standard deviation of the sample is close to zero. correct answer B) the population is distribution free. Nonparametric statistics are based on fewer assumptions about the population, and the parameters, then are parametric statistics. They are sometimes referred to as the distribution free statistics, as the population is not assumed to be having any distribution, which is not the case with parametric test were population is generally assumed to have normal distribution. Q 16: suppose five weeks of average prices for a stock are 57, 68, 64, 71, and 62 with a standard deviation of 4.84. What is the coefficient of variation (CV) for this stock? A) 2.50% B) 5.00%, C) 7.50% D) 10.00%. correct answer C) 7.50% CV is defined as (standard deviation/mean)100. so, (4.84/64.4)100 = 7.5%.

Q 17: which is a better prediction model based on performance lift? A) model having constant lift curve running parallel to the X axis. B) model having overlapping lift curve to the base line. C) model having small area between the lift curve and baseline. D) model having large area between the lift curve and the baseline. correct answer D) model having a large area between the lift curve and the baseline. Performance lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance. Both charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better is the model. Q 18: which of the following is not a restriction of performance lift chart of an analytics model? A) lift charts require that the predictable attribute be a discrete value. In other words, you cannot use lift charts to measure the accuracy of models that predict continuous numeric values. B) to see prediction accuracy lines for any individual value of the predictable attribute, you need not create a separate lift chart for each target value. C) you cannot display timeseries models in a lift chart. D) you can add multiple models to a lift chart, as long as the models all have the same predictable attribute correct answer B) to see prediction accuracy lines for any individual value of the predictable attribute, you need not create a separate lift chart for each targeted value. The prediction accuracy, for all discreet values of the predictable attribute is shown in a single line. If you want to see prediction accuracy lines for any individual value of the predictable attribute, you must create a separate lift chart for each targeted value.

A) panel data regression. B) simulation. C) logistic regression D) MANOVA correct answer C) logistic regression. Since the dependent variable is binary in nature, it is most appropriate to use logistic regression model. Q 22: which of the following is an analogous metric of adjusted R squared and logistic regression? A) AIC B) ROC curve C) accuracy. D) residual deviance. correct answer A) AIC The analogous metric of adjusted R squared in logistic regression is AIC (Akaike information criteria). AIC is the measure of fit, which penalize the model for the number of model coefficients. Therefore, the model with the minimum AIC value is always preferred. Q 23: given that the sum of squares total is 22.5, sum of squares due to regression is 15.625, and sum of squares error is 6.875, for a regression model. The coefficient of determination, for this model would be: A). B). C). D) .56. correct answer A).

R squared or the coefficient of determination is calculated as SSR/SST or 1 - (SSE/SST). Q 24: considering all other parameters to be constant, which is the most appropriate explanation for a new predictor variable to remain in the regression model? A) if the adjusted R squared of the model decreases B) if the adjusted R squared of the model increases C) if the adjusted R squared of the model remains constant. D) new variables, inclusion in the model does not depend on adjusted R squared. correct answer B) if the adjusted R squared of the model increases Generally, the new variables should increase the adjusted R squared of the model, which means that strength of the prediction model has increased. Q 25: which of the following is not an assumption to regression modeling with respect to the error terms? A) the errors are dependent. B) the errors are normally distributed. C) the errors have a mean of 0 D) the errors have a constant variance. correct answer A) the errors are dependent. One important assumption while performing regression is that the errors are independent. Q 26: which of the following is most appropriate assumption to regression modeling with respect to predictor variables? A) predictor should be perfectly correlated to each other. B) predictor should have zero variance.

There exist high pairwise correlation; R squared from auxiliary regression is higher; condition index is very large, generally greater than 30; and large VIF, are some of the indicators of multicollinearity in the model. Q 29: a sure way of removing multicollinearity from the model is to A) work with panel data. B) drop a variables that cause multicollinearity in the first place. C) transform the variables by first differencing them. D) obtain additional sample data. correct answer B) drop the variables that caused multicollinearity in the first place. Other options may or may not improve the multicollinearity but the dropping the variable certainly improves the multicollinearity in the model. Q 30: equal variance of error. Terms in a regression model is termed as: A) multicollinearity. B) homoscedasticity. C) heteroscedasticity. D) autocorrelation. correct answer B) homoscedasticity. Unequal variance in error terms is known as heteroscedasticity, while equal variance of error terms is known as homoscedasticity. If regressors are perfectly correlated to each other, it is known as multicollinearity. Q 31: which of the following is not a source of heteroscedasticity? A) skewness in the regressor's

B) outliers in the regressor's C) incorrect data transformations. D) addition of significant regressor's. correct answer D) addition of significant regressor's. Addition of significant regressor is never creates heteroscedasticity, in fact, correctness of models supports homoscedasticity. Outliers, skewness, and incorrect transformations, certainly impact the model and increase heteroscedasticity. Q 32: when error terms across timeseries data or inter-correlated, it is known as. A) cross correlation. B) cross autocorrelation. C) special autocorrelation. D) serial autocorrelation. correct answer D) serial autocorrelation. When terms across sections of data are correlated, then it is special autocorrelation, while across timeseries, it is known as serial autocorrelation. Q 33: the regression coefficient is estimated in the presence of autocorrelation in the sample data or not A) unbiased estimators B) consistent estimators. C) efficient estimators. D) linear estimators. correct answer C) efficient estimators. They are not efficient estimators, and must not be reliable on for any decision-making.

A) independent T test. B) ANOVA C) chi-squared. D) F test correct answer C) chi-squared. Since the data is categorical in nature, chi-squared is the most suitable test to answer the query based on a given data set. Q 37: consider the following data set and suggest the most appropriate nature of the data. Data: [1, 2, 3, 2, 5, 6, 3, 1, 8, 9, 3, 2, 1, 7, 6, 8, 9, 10] A) unimodal B) bimodal. C) multimodal. D) no modal correct answer C) multimodal. There are three terms having highest frequency in the data set, therefore, it is multi modal in nature. Q 38: which of the following is most appropriate relation for the data having symmetric distribution? A) mean = median = mode B) mean > median > mode C) mean < median < mode. D) mean > median < mode correct answer A) mean = median = mode

For symmetric distribution, approximately mean = medium = mode, and all are present in the center of the distribution. Q39: the amount of peakedness of a distribution is measured by. A) skewness. B) kurtosis. C) interquartile range. D) standard deviation. correct answer B) kurtosis Kurtosis describes the amount of peakedness of a distribution. It may be leptokurtic, platykurtic or mesokurtic distribution. Q 40: if a set of data is normally distributed or bellshaped, as per the empirical rule, approximately how much percentage of the data values are within two times the standard deviation of the meme? A) 68% B) 95% C) 99.70% D) 99.90%. correct answer B) 95% As per the empirical rule, 68% of data is within one standard deviation, 95% is within two standard deviations, and 99.7% is within three standard deviations of the mean. Q 41: the covariance of the two variables, normalized by the variance of each variable is termed as. A) correlation.

Normally distributed data assumes a bell shaped curve with mean, median, and mode being placed, approximately at the center of the distribution. Q 44: which of these is not true for difference between system dynamics (SD) and discrete event simulation (DES)? A) SD model C behavior of systems using differential equations while DES models using a simulation clock that advances time in fixed increments. B) SD model attempts to capture all of the aspects of a process within a close system, while DES models more often reflects systems where entities are processed in a linear fashion. C) DES models are used when the goal is a statistically valid estimate of system performance; SD is more often the tool of choice for a training vehicle, D) a major part of the DES modeling effort is associated with capturing the mental models, while SD models are often built from a process map, or flow chart correct answer D) a major part of the DES modeling effort is associated with capturing the mental models, while SD models are often built from a process map, or flow chart. A major part of the SD modeling effort is associated with capturing the mental models, while, DES models are often built from a process map, or flow chart. Q 45: which of these is most relevant to the application of discrete event simulation? A) solving single server queuing system. B) managing inventory model in which the stock is inspected only once a week. C) predicting the house rent based on house characteristics like number of rooms, size of house, etc. D) comparing the before, and after impact of a TV advertisement on brand recall of a product for the consumers. correct answer A) solving single server queuing system.

Inventory model is example of only discrete simulation and not discrete event simulation; house rent prediction can be done using multiple linear regression; TV ad impact can be compared using paired t-test statistic. Q46: which is the most suitable method to compute the risk value of a portfolio? A) maximax criterian B) logistic regression C) Monte Carlo simulation D) linear programming. correct answer C) Monte Carlo simulation. Monte Carlo simulation uses probability distribution for modeling, a stochastic or random variable. Different probability distributions are used for modeling input variables, such as normal, log normal, uniform, and triangular. From probability, distribution of input, variable, different paths of outcome are generated. Compared to deterministic analysis, the Montecarlo method provides a superior simulation of risk. It gives an idea of not only what outcome to expect, but also the probability of occurrence of that outcome. It is also possible to model correlated input variables. For instance, Monte Carlo simulation can be used to compute the value at risk of a portfolio. This method tries to predict the worst return expected from a portfolio, given a certain confidence interval for a specified time. Normally, stock prices are believed to follow a geometric Brownian motion (GMP), which is a mark off process, which means a certain state follows. A random walk in its future values is dependent on the current value. Q47: Annabelle wants to open a small apparel sharp in Vienna. She has located a good mall that attracts Rhett customers. Her options are to open a small shop, a medium size shop, or no shop at all. The market for an apparel shop can be good, average, or bad. The probabilities for these three possibilities are .2 for a good market, .5 for average market, and .3 for bad market. net profit or loss for the different size shops is given in the table below. Which is the best decision based on EMV criterion. For good, average, bad respectively: Small shop = 75k, 25k, -40k

Data Mining and Machine Learning Practice Questions, Exams of Business Accounting

Related documents

Partial preview of the text

Download Data Mining and Machine Learning Practice Questions and more Exams Business Accounting in PDF only on Docsity!

CAP Study Questions ALL