


































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The MIS Data Mining Ultimate Exam focuses on extracting meaningful insights from large datasets using advanced analytical techniques. Topics include data preprocessing, clustering, classification, regression, association rules, and predictive modeling. Candidates will learn how to use data mining tools and algorithms to support decision-making. The exam includes practical scenarios that simulate real business intelligence applications, making it ideal for professionals in information systems and analytics.
Typology: Exams
1 / 74
This page cannot be seen from the preview
Don't miss anything!



































































Question 1. Which stage of the KDD process involves transforming raw data into a suitable format for mining? A) Data selection B) Data preprocessing C) Pattern evaluation D) Knowledge presentation Answer: B Explanation: Data preprocessing cleans, integrates, and transforms raw data, preparing it for mining. Question 2. In a star schema, the central table is called a: A) Dimension table B) Fact table C) Bridge table D) Aggregate table Answer: B Explanation: The fact table stores quantitative data and is linked to dimension tables in a star schema. Question 3. Which OLAP operation allows you to view data from a different perspective by rotating the axes? A) Drill‑down B) Roll‑up C) Slice D) Pivot Answer: D Explanation: Pivot (or rotate) changes the dimensional orientation of the data view. Question 4. Which of the following is a common method for handling missing values in data cleaning? A) Z‑score normalization B) Imputation with mean
C) PCA reduction D) Apriori pruning Answer: B Explanation: Imputing missing numeric values with the mean of the attribute is a standard technique. Question 5. The Min‑Max normalization scales data to which range? A) 0 to 1 B) – 1 to 1 C) 0 to 100 D) – ∞ to +∞ Answer: A Explanation: Min‑Max rescales each attribute to the interval [0,1]. Question 6. Which algorithm is specifically designed to mine frequent itemsets without candidate generation? A) Apriori B) FP‑Growth C) k‑Means D) CART Answer: B Explanation: FP‑Growth builds a compact FP‑tree and extracts frequent patterns directly. Question 7. In association rule mining, the metric “lift” measures: A) The probability of the consequent occurring alone. B) The increase in confidence over random chance. C) The number of transactions containing both antecedent and consequent. D) The support of the antecedent. Answer: B Explanation: Lift = confidence / (support of consequent); values >1 indicate a rule better than random.
C) Identity kernel D) None – SVM is always linear Answer: B Explanation: Polynomial (and RBF) kernels map data into higher‑dimensional spaces, allowing non‑linear separation. Question 12. In k‑Means clustering, the objective function minimizes: A) Sum of squared distances between points and their cluster centroids. B) Maximum distance between any two points in a cluster. C) Number of clusters. D) Silhouette coefficient. Answer: A Explanation: k‑Means iteratively reduces the within‑cluster sum of squares. Question 13. Which clustering method builds a dendrogram by initially treating each object as a separate cluster? A) Divisive hierarchical clustering B) Agglomerative hierarchical clustering C) k‑Medoids D) DBSCAN Answer: B Explanation: Agglomerative clustering merges the closest pairs of clusters step‑by‑step, starting with singletons. Question 14. DBSCAN defines a cluster based on: A) Fixed number of points (k). B) Density reachability and a minimum number of points within ε‑neighborhood. C) Hierarchical merging. D) Centroid distance only. Answer: B
Explanation: DBSCAN groups points that are density‑connected, using ε and MinPts parameters. Question 15. The Silhouette Coefficient for a data point ranges between: A) 0 and 1 B) – 1 and 1 C) – ∞ and +∞ D) 0 and 100 Answer: B Explanation: Values near +1 indicate well‑matched points, near 0 ambiguous, and negative values indicate possible mis‑assignment. Question 16. TF‑IDF is used primarily in: A) Time‑series forecasting B) Text mining to weight term importance C) Image classification D) Association rule mining Answer: B Explanation: TF‑IDF combines term frequency with inverse document frequency to highlight discriminative words. Question 17. In sentiment analysis, a lexicon‑based approach relies on: A) Deep neural networks B) Pre‑defined lists of positive and negative words C) Clustering of documents D) Decision trees Answer: B Explanation: Lexicon‑based methods assign sentiment scores based on known word polarity. Question 18. Clickstream analysis is a type of:
D) Softmax Answer: C Explanation: ReLU introduces non‑linearity, accelerates training, and mitigates vanishing gradients. Question 22. Backpropagation updates network weights based on: A) Random search B) Gradient of the loss with respect to weights C) Decision tree splits D) Association rule confidence Answer: B Explanation: Backprop computes the gradient of the error and adjusts weights via gradient descent. Question 23. Convolutional Neural Networks (CNNs) are especially effective for: A) Sequential text generation B) Image classification and pattern recognition C) Market basket analysis D) Clustering categorical data Answer: B Explanation: CNNs exploit spatial hierarchies via convolutional filters, excelling in visual tasks. Question 24. Recurrent Neural Networks (RNNs) are best suited for: A) Static tabular data B) Temporal or sequential data C) Spatial raster data D) Transactional association rules Answer: B Explanation: RNNs maintain hidden state across time steps, capturing dependencies in sequences. Question 25. In a confusion matrix, the term “precision” is defined as:
C) (TP + TN) / Total D) FP / (FP + TN) Answer: B Explanation: Precision measures the proportion of predicted positives that are truly positive. Question 26. The ROC curve plots: A) Precision vs. Recall B) True Positive Rate vs. False Positive Rate C) Accuracy vs. Error rate D) Support vs. Confidence Answer: B Explanation: ROC visualizes the trade‑off between sensitivity (TPR) and 1‑specificity (FPR). Question 27. Which cross‑validation technique uses every observation once as a test set? A) 5‑fold CV B) Leave‑one‑out CV (LOOCV) C. Bootstrap D. Hold‑out Answer: B Explanation: LOOCV iteratively holds out a single instance for testing while training on the rest. Question 28. Overfitting is most directly caused by: A) Too few features B. High bias models C. Excessively complex models that capture noise D. Insufficient training data only Answer: C
B. Denormalized into a single flat table C. Not linked to fact tables D. Used only for OLTP systems Answer: A Explanation: Snowflake schemas normalize dimensions to reduce redundancy. Question 33. The “support” of an itemset in market‑basket analysis is defined as: A. The proportion of transactions containing the itemset B. The confidence of the rule containing the itemset C. The lift value of the rule D. The number of items in the set Answer: A Explanation: Support = count(itemset) / total transactions. Question 34. Which pruning strategy is used in the Apriori algorithm? A. Remove itemsets whose subsets are infrequent B. Remove clusters with low silhouette scores C. Delete decision tree branches with low information gain D. Drop neurons with small weights Answer: A Explanation: Apriori eliminates candidate itemsets if any of their subsets are not frequent. Question 35. In C4.5, the attribute selection measure is: A. Gini index B. Information gain ratio C. Euclidean distance D. Lift Answer: B Explanation: C4.5 uses gain ratio to compensate for bias toward many‑valued attributes.
Question 36. A “data mart” differs from an enterprise data warehouse primarily in: A. Size and subject‑area focus B. Use of star schema only C. Lack of ETL processes D. Real‑time update capability Answer: A Explanation: Data marts are smaller, department‑specific subsets of the enterprise warehouse. Question 37. Which of the following is a density‑based clustering algorithm that can discover clusters of arbitrary shape? A. k‑Means B. DBSCAN C. Agglomerative clustering D. Spectral clustering Answer: B Explanation: DBSCAN groups points based on density, handling non‑convex clusters. Question 38. In PCA, the first principal component captures: A. The smallest variance in the data B. The direction of maximum variance C. The median variance D. The categorical relationships Answer: B Explanation: PCA orders components by decreasing variance; the first captures the most. Question 39. Attribute subset selection aims to: A. Increase the number of attributes B. Reduce dimensionality while retaining predictive power
Question 43. The term “concept drift” in streaming data mining refers to: A. Changes in hardware performance B. Evolution of the underlying data distribution over time C. Increase in dataset size only D. Model overfitting Answer: B Explanation: Concept drift occurs when the statistical properties of the target variable shift, requiring model updates. Question 44. In fraud detection for financial services, which metric is often prioritized? A. Accuracy B. Recall (true positive rate) for fraudulent cases C. Specificity D. F1‑Score for legitimate transactions Answer: B Explanation: Missing a fraud (false negative) is costlier; high recall for fraud cases is essential. Question 45. Recommendation engines that use “collaborative filtering” rely on: A. Content attributes of items B. User‑item interaction patterns C. Geographic location data D. Time‑series forecasting only Answer: B Explanation: Collaborative filtering predicts preferences based on similar users’ behavior. Question 46. In healthcare, genomic data mining often employs: A. Association rules only B. Sequence alignment and pattern discovery techniques C. Simple linear regression
D. K‑Nearest Neighbor for image classification only Answer: B Explanation: Genomic analysis searches for motifs, mutations, and expression patterns within DNA sequences. Question 47. The “bias‑variance trade‑off” describes the relationship between: A. Model complexity and training time B. Training error and testing error C. Systematic error (bias) and sensitivity to data fluctuations (variance) D. Number of features and number of instances Answer: C Explanation: Reducing bias often increases variance and vice versa; optimal models balance both. Question 48. In an ERP integration scenario, mined insights are typically delivered through: A. Direct SQL queries only B. Business intelligence dashboards and APIs C. Manual printed reports D. Raw data files only Answer: B Explanation: APIs and dashboards embed predictive insights into ERP workflows for decision makers. Question 49. Which visualization is most appropriate for showing the distribution of a single numeric variable? A. Scatter plot B. Histogram C. Bar chart of categories D. Sankey diagram Answer: B Explanation: Histograms display frequency of numeric intervals, revealing shape and spread.
C. Decision tree induction D. Rule‑based classifiers Answer: B Explanation: In high‑dimensional spaces, distances become less discriminative, degrading performance. Question 54. Which of the following is a model‑based clustering method? A. k‑Means B. Gaussian Mixture Models (GMM) C. DBSCAN D. Hierarchical agglomerative clustering Answer: B Explanation: GMM assumes data are generated from a mixture of Gaussian distributions and estimates parameters. Question 55. In text mining, “stemming” is performed to: A. Remove stop words B. Reduce words to their root form (e.g., “running” → “run”) C. Convert text to uppercase D. Encode words as binary vectors Answer: B Explanation: Stemming normalizes word variants, reducing dimensionality. Question 56. The “support vector” in SVM refers to: A. Any data point in the training set B. Data points that lie closest to the decision boundary and define the margin C. The weight vector of the model D. The bias term only Answer: B Explanation: Support vectors are critical training instances that influence the optimal hyperplane.
Question 57. A “pivot table” in BI tools is primarily used for: A. Data cleaning B. Summarizing and aggregating data across dimensions C. Training machine‑learning models D. Real‑time streaming ingestion Answer: B Explanation: Pivot tables rearrange data to compute aggregates like sums or averages across categories. Question 58. Which evaluation metric is most appropriate for imbalanced binary classification where the positive class is rare? A. Accuracy B. Precision‑Recall curve (or AUC‑PR) C. Mean Squared Error D. R‑squared Answer: B Explanation: PR curves focus on performance for the minority class, avoiding misleading accuracy. Question 59. In a decision tree built with ID3, the attribute with highest: A. Gini impurity reduction B. Information gain is selected for splitting. C. Euclidean distance is selected. D. Correlation coefficient is selected. Answer: B Explanation: ID3 uses information gain (based on entropy) to choose the best attribute. Question 60. Which of the following best describes “concept hierarchy” in data warehousing? A. A sequence of time‑stamped records B. A multi‑level representation of data (e.g., day → month → year) used for roll‑up/drill‑down
Question 64. Which of the following is a primary advantage of using a “data lake” over a traditional data warehouse? A. Strict schema enforcement before loading B. Ability to store raw, unstructured data at scale C. Faster OLAP query performance D. No need for ETL processes Answer: B Explanation: Data lakes accept raw data in varied formats, supporting flexible analytics later. Question 65. In a logistic regression model, the output is interpreted as: A. A continuous numeric value B. The probability of the positive class (between 0 and 1) C. A categorical label directly D. The distance to the decision boundary Answer: B Explanation: Logistic function maps linear combination of inputs to a probability. Question 66. Which of the following is the most suitable algorithm for discovering frequent sequential patterns in clickstream data? A. Apriori B. PrefixSpan C. k‑Means D. Naïve Bayes Answer: B Explanation: PrefixSpan efficiently mines sequential patterns without candidate generation. Question 67. In the context of model deployment, “A/B testing” is used to: A. Compare two algorithms on the same dataset offline
B. Evaluate the impact of a new model on live traffic against the existing model C. Split data into training and testing sets D. Perform cross‑validation Answer: B Explanation: A/B testing runs both versions in production to measure performance differences. Question 68. Which type of neural network architecture is specifically designed to handle grid‑like data such as images? A. Recurrent Neural Network (RNN) B. Convolutional Neural Network (CNN) C. Feed‑forward Perceptron only D. Autoencoder for tabular data only Answer: B Explanation: CNNs apply convolutional filters across spatial dimensions, capturing local patterns. Question 69. In a multi‑class classification problem, the “one‑vs‑rest” strategy involves: A. Building a single model that predicts all classes simultaneously B. Training one binary classifier per class, treating it as positive vs. all others C. Using clustering to define classes D. Applying regression techniques instead Answer: B Explanation: One‑vs‑rest creates separate binary models for each class. Question 70. Which of the following best describes “feature engineering”? A. Automatically selecting the best algorithm B. Creating, transforming, or selecting variables to improve model performance C. Deploying a model into production D. Visualizing model results Answer: B