Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A set of practice questions and solutions for the bsan (business analytics) final exam. The questions cover various topics related to business intelligence, data preprocessing, data visualization, regression analysis, data mining techniques, and decision-making. A comprehensive review of the key concepts and methods in business analytics, which can be useful for students preparing for their bsan final exam. The questions are designed to test the understanding of fundamental principles, the ability to apply analytical techniques, and the interpretation of results. By studying this document, students can gain insights into the types of questions they may encounter in the final exam and develop the necessary skills to excel in the course.
Typology: Exams
1 / 12
T or F:Decision support system are computer-based support systems that integrate individuals' expertise and computer capabilities, and they have precise definitions agreed to by practitioners. - correct answerFalse Business Intelligence (BI) - correct answeris an umbrella term that combines architectures, databases, analytical tools, applications, and methodologies T or F: Data is a collection of observations, experiments, and experiences that do not necessarily represent absolute facts that are universally true. - correct answerTrue Descriptive Analytics - correct answerhelp managers understand current events in the organization including causes, trends, and patterns. What type of analytics seeks to recognize what is going on as well as the likely forecast and make decisions to achieve the best performance possible? - correct answerPrescriptive Which of the following is/are predictive analytics method(s)? A)Boxplot B)Text analysis C)Simulation D)Regression analysis, E)Clustering B, D and EB, C and ED and E - correct answerB, D, E Using characteristics of first year undergraduate students, such as age, gender, major, location, workout/sports activities, if we developed a model to forecast which students are at risk of dropping out after the first year of college, decided which students to reach out to and offered them support services to reduce their risk of dropping out, what kind of analytics application would this work represent? - correct answerprescriptive analytics Which chart type below would be most helpful to show the comparison between worldwide turnover rate compared with tech sector turnover rate? Line chart Histogram Bar chart Scatterplot - correct answerBar chart Which chart type below would be most helpful to show the relative proportions of turnover rate of different categories (e.g., computer games, Internet, computer software and other) within the tech sector that drive tech turnover the most? Histogram Pie chart Bar chart Scatterplot - correct answerPie Chart T or F: Original (raw) data is usually collected from multiple data sources including various formats, and it is readily usable by analytics tools and algorithms - correct answerFalse
T or F: During data transformation, depending on the context and purpose of preprocessing the data can be rescaled to a fixed range, and numeric variables can be converted to categorical variables - correct answerTrue T or F: Data reduction can be applied to rows (observations) and/or columns (variables) in a given dataset - correct answerTrue T or F: In data preprocessing step to reduce the dimension of data prior to analysis, sampling the rows is more complex than selecting the columns (variables - correct answerFalse T or F: Choice of visualization method that meets the presentation requirements for a given data depends on the data types available, purpose of the visual and context - correct answerTrue Which of the below is not a data preprocessing step? data consolidation data transformation data separation data reduction - correct answerdata seperation Which of the below is a method to deal with filling out the missing values in data? data cleaning data reduction data smoothing data imputation - correct answerdata imputations Which of the below statement(s) is/are correct? A: An important data transformation subtask is to select the relevant data using domain expert input, i.e., decide which sources and data to collect. B: When merging two data source tables A and B, using the full outer join method eliminates all rows from the resulting table that do not have corresponding rows in both source tables A and B. C: For numerical variables, normalizing the observed values between two values, such as 0 and 1, allows to rescale the values and compare variables with different means and/or standard deviations on a single scale. D: Identifying and reducing noise in the data is a subtask of data reduction. - correct answerC When analyzing the original data of household income of a selected population, analysts notice that 5% of observations are missing and entered in the dataset as N/A (not available). Further, they notice that there are a few extremely low household income values. Which of the following method(s) would be well-suited to prepare the data before conducting descriptive analysis, such as calculating descriptive statistics and creating histogram of household income? A: Use the original dataset to avoid introducing additional noise to data prior to analysis
B: Identify the outliers in data with statistical techniques and remove the extremely low income values C: Identify the outliers in data with statistical techniques and replace the extremely low values using the mean of the income values to smooth the values D: Fill in missing values (imputations) with most appropriate values using zeros to indicate that these income valu - correct answerB and C Which of the below statement(s) is/are correct? A: Information dashboards provide interactive visual displays of important information that is so that the level of granularity of key insights can modified by drilled in or moving out for more / less exploration B: Visual analytics combines data visualization and different analytics methods such as descriptive, predictive and prescriptive analytics. C: Interactive information dashboards provide key insights as static information that focus on better understanding of what happened. - correct answerA and B Which of the following data preprocessing activity/activities that Mia conducts would fall under data transformation? A: Identify and replace extremely high and low selling price values using appropriate imputation methods B: Convert number of bikes sold per month (numeric) into discrete categories using frequency-based bins C: Filter the data to ensure that only key performance and price features needed for the analysis are included in the data D: Reduce the range of values of quarterly market share (numeric) data to a standard range (e.g., 0 to 1 or -1 to +1) by using normalization or scaling techniques E: Oversample the less represented financial performance measurements - correct answerB and D Which chart should Mia use to visualize the relative proportion of market share of Peloton in 2020 compared to its competitors Nordic Track, Myx Fitness, and Echelon? - correct answerPie chart Which chart should Mia use to visualize the number of new members joining the Peloton customer community every month from 2012 to 2020? - correct answerLine chart Which of the following data preprocessing activity that Mia conducts is not associated with data cleaning? - correct answerDerive a new variable representing total time of class material from existing variables Mia decides to use imputation methods as part of the data preprocessing. What is the main purpose of imputation methods? - correct answerFill in missing values with most appropriate values
T or F Linear regression models represent the mathematical relationship between one or more dependent variables to explain or predict a binary (i.e., a variable that takes values 0=no and 1=yes) independent variable." - correct answerFalse T or F: "Linear regression analysis can be used to predict an unknown value of a dependent variable using the values of a set of numeric and/or categorical independent variables." - correct answerTrue Comparing two regression models (Model 1 and Model 2) developed using the same dataset, assume Model 1 has an R-squared of 0.58 and Model 2 has an R-squared of 0.79. Which of the following statement(s) is/are correct? A: Model 2 describes 79% of the variation in the given data B: Comparing both models and how well they explain the variation in the given data, Model 1 is a better fit compared to Model 2 C: The independent variables used in Model 1 do not capture 42% of the variation in the given data - correct answerA and C T or F: "Using the correlation between size and selling price, we can predict the selling price of a new house (that is not included in this dataset) if we know the size of that new house." - correct answerFalse T or F: "If the correlation between size and selling price of a house is 0.85, and we develop a simple regression model using size as independent and selling price as the dependent variable, the slope coefficient associated with size in the regression equation would have a positive sign." - correct answerTrue Assume we develop a regression model to predict the final grade of a student using the following variables: midterm grade, time spent studying for the final exam, number of other classes the student is taking the same term, whether the student took a similar class before (yes or no) and whether the student is female (1=female or 0=male). Which of the following statement(s) is/are correct? - correct answerThis model is a multiple linear regression model Using Figure 1, test the hypothesis that students with higher high school GPA percentile have a higher SAT score compared to students with a lower high school GPA percentile. In other words, we want to test if we increase the high school GPA percentile of a student by 1% then their SAT score will also increase. Which of the following method would help to test this hypothesis? - correct answerSimple linear regression with high school percentile as the independent and SAT as the dependent variable Assume we fit a regression line to the scatterplot in Figure 1 from Question 1. Which of the following statement(s) is/are correct? A: The intercept of the line would represent the high school GPA percentile of a student given his/her/their SAT score is 0. B: The intercept of the line would represent the SAT score of a student given his/her/their high school GPA percentile is 0.
C: The slope of the regression line would represent how much the SAT score changes if the high school GPA percentile changes by 1% D: The slope of the regression line will be positive - correct answerB C and D Using the dataset of 178 students from Question 1, your team develops a regression model to predict the Combined score of a student (i.e., score that the school uses to rank applicants) using HSPercentile (i.e., high school GPA percentile and it takes values from 0 to 1 where 1 represents the 100th percentile meaning maximum), Gender (where Female=0 and Male=1) and the student's SAT score. The regression equation is given by: CombinedScore =118.95 + 1.91HSPercentile -1.74(Gender) + 0.06*SAT This model is a ______________ regression model. - correct answerMultiple Linear Assume there are two students, Neal and Jimmy, with the same high school GPA percentile and gender, and Neal's SAT score is one point higher than Jimmy's. Using the regression model shown in Question 3, Neal's predicted combined score would be _______. - correct answer0.06 higher than Jimmy's combined score A regression model is developed to predict which students will be retained in second year (i.e., still enrolled in second year). Using various characteristics of the 178 students in the given dataset, the variable we want to predict is "SecondFallRegistered" which is either yes (=1) if the student is still enrolled in second year (measure of retention), or no (=0) if student dropped out between the beginning of first year and second year of college. In this case, a ______________ regression model would be the best suited model to predict the variable SecondFallRegistered. - correct answerLogistic Given the model predicted that a student would be retained, __________ is a measure that quantifies the ratio of number of students who actually were retained compared to all students that were predicted to be retained. The value of this measure is this example is ___________. - correct answerPrecision, 2/ Given a student was retained, the number of times the model predicted a student's retention correctly is called ____________. The value of this measure is this example is ___________. - correct answerRecall, 2/ T or F: "The relational data in a data warehouse are modified and analyzed using Online Analytical Processing (OLAP) tools. Commonly used OLAP tools are slice, dice, drill up and down, and pivot." - correct answerTrue When querying a dimensional database, a user goes from summarized data (e.g., quarters) to its underlying details (e.g, months). The OLAP function that serves this purpose is: - correct answerDrill Down When querying a dimensional database, a user transforms the data coming from rows of a table into data grouped on several columns. The OLAP function that serves this purpose is: - correct answerPivot
________ is an important concept to consider when developing a data warehouse because data warehouses can grow quickly and issues can arise regarding the amount of data, e.g., the pace at which the data warehouse is expected to grow and the complexity of user queries. - correct answerScalability "One of the characteristics of a data warehouse is that it is non-volatile. That means, _______________________________________." - correct answerAfter data are entered into a data warehouse, previous data is not erased when new data is added to it T or F: "Classification learns a set of information on characteristics of the previously labeled items, objects, or events to place new instances (with unknown labels) into their respective groups." - correct answerTrue In data mining, finding an affinity of two products to be commonly purchased together is known as - correct answerAssociation rule mining T or F:"In Association Rule Mining, confidence is a metric that represents the probability of observing the items A and B together in a given dataset, P(A&B)." - correct answerFalse Consider the item set in Figure 1 (with 6 items: eggs, beer, juice which is represented by the red glass, milk, diapers, and bread) in a dataset with 5 transactions where each row in the dataset is a shopping purchase transaction. For example, transaction 1 is the first row in dataset and includes the purchase of bread and milk, transaction 2 is the second row and includes the purchase of bread, diapers, beer, and eggs, etc. Figure 1.Itemset and a dataset derived from shopping data. Support for Diapers and Beers being purchased together, i.e., P(X&Y) where X = Diapers and Y = Beer being purchased together, is ________ - correct answer60% Confidence for Milk and Juice (that is the probability P(Y|X) where X= Milk and Y =Juice) is ___________ - correct answer50% Use the following description to answer Questions 1 - 5.According to the American Academy of Pediatrics, toddlers (i.e., children who are 12 to 36 months old) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 20-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including:• day of the week (categorical, where the categories are: Monday,..., Sunday),• length of mid-day nap (numerical, in Minutes),• type of snacks he had before going to bed (categorical where the categories are: fruit, yoghurt, crackers),• length of night sleep (numerical, in hours)• night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours)Which data mining method would be best suited to predict the length - correct answerLinear Regression
Which data mining method would be best suited to predict the length of night sleep (numerical) using the previous nights' length of night sleep in the past 60 days? - correct answerTimes Series Which data mining method would be best suited to find out which days are similar to each other with regards to length of mid-day nap and length of night sleep? - correct answerClustering Which data mining method would be best suited to find out which days (categorical) and night sleep categories (binary) are frequently observed together? - correct answerAssociation Rule Analysis Which data mining method would be best suited to predict the night sleep category (binary) using the variables: day of the week, length of mid-day nap, and type of snacks he had before going to bed? - correct answerLogistic Regression Which of the following is a segmentation model that classifies the items in a dataset based on pairwise distances between these items until every observation is linked into one large group? - correct answerHierarchical clustering Which of the following is not a linkage criterion used in clustering models? - correct answerK-fold linkage Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. T or F: "The input variables used in this model are not on the same scale, and this makes comparing the distance between students difficult. We need to convert the input variables to be on a similar scale by standardizing." - correct answerFalse T or F: Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include {Joe and Hannah}." - correct answerTrue Using the single linkage criteria, after creating the first cluster, the next step would be: - correct answerAdd Sam to the cluster {Joe and Hannah} Using the complete linkage criteria, after creating the first cluster, the next step would be: - correct answerCreate a cluster that includes {Sam and Max}
T or F: "When developing a data mining model, we split the original data into training data and testing data in order to evaluate the model performance in a dataset that was not used to develop the model." - correct answerTrue In evaluating a two-class classification model, the accuracy is __________________. - correct answerthe ratio of correctly classified positives and correctly classified negatives divided by the sum of all positive (true and false) and negative (true and false) counts. In ________, the complete data set is randomly split into mutually exclusive subsets and tested multiple times on each left-out subset, using the others as a training set. - correct answerk-fold cross-validation Which of the following statements in incorrect? - correct answerPerfect classification is represented by AUC = 0.5. Assume there are 4 customers (Kristen, Dennis, Ben and Ross), and we want to develop a hierarchical clustering model to group the customers into clusters. The input variables used in this clustering model to calculate the distance values are: how much the individual spent for grocery shopping in winter 2020 (measured in $) and how much the individual spent for grocery shopping in summer 2020 (measured in $).Euclidean distance values (measured in $) between all pairs of customers are given in Table 1 below. For example, the Euclidean distance between Dennis and Ben is $360. Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include: - correct answerDennis and Ross Using the single linkage criteria, after creating the first cluster, the next step would be - correct answerAdd Ben to the cluster {Dennis and Ross} Using the complete linkage criteria, after creating the first cluster, the next step would be: - correct answercreate a cluster that includes {Kristen and Ben} or Add Kristen to the cluster {Dennis and Ross} Table 2.Confusion matrix True/Observed Class Positive (Patient is healthy) Negative (Patient is sick) Predicted class Positive (Patient is healthy)
Negative (Patient is sick) 40 60 QUESTION 4What is the false positive (FP) count? - correct answer How many mistakes (misclassifications) did the model make? - correct answer T or F: Based on the results shown in Table 2, the true positive rate is higher than the true negative rate." - correct answerFalse T or F: "In a clustering model with two numerical input variables used for clustering, if the input variables are not on the same scale standardizing is used to convert the variables and compare them on a single scale." - correct answerTrue T or F: "Due to potential risk of model overfitting, rather than using all available data we split the data into training and testing data, we use the training data for model development and evaluate model performance using the testing data." - correct answerTrue In text mining, tokenizing is the process of _________________. - correct answerbreaking a text into simple units, like sentences or words The Bag-of-Words method uses ____________ to extract feature from textual data. - correct answerWord frequencies in a text In text mining, what is a lexicon? - correct answera catalog of words and scores (or categories) assigned to the words based on their meaning Consider the following sentence: "I enjoy taking a walk in the rain." Which of the following statements in incorrect? - correct answerAfter removing the stop words, the bigram method creates the vector: [ "enjoy", "taking", "walk", "rain"] T or F: "Web structure mining focuses on navigation through a website by analyzing the links in Web documents, and Web content mining is related to extraction of information from the content of Web pages using text mining." - correct answerTrue Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide whether to expand or stay in the current location.• If they decide to stay in current location, by the end of 2021 their revenue will be $120,000 with certainty and the cost (associated with
business maintenance) will be $20,000 with certainty.• If they decide to expand, the cost (associated with business maintenance plus expansion cost) will be $50,000 with certainty. In case of expansion, the demand for yoga classes could increase with probability of 0.6 or decrease with probability 0.4. If they expand and demand increases
Using the Expected Monetary Value (EMV) criterion, which decision alternative should we choose? - correct answerIndifferent between Decision alternative 1 and Decision alternative 2 A probability node of a decision tree for decision modeling represents __________________. - correct answera time when the result of an uncertain outcome becomes known Let's assume we are confronted with a one-stage decision problem that has 3 decision alternatives: Decision Alternative 1, 2 and 3, where each Decision Alternative has possible outcomes and probabilities associated with possible outcomes. We use the Expected Monetary Value (EMV) and pick Decision Alternative 2. Assume we conduct sensitivity analysis for this decision model. Which of the following statements is incorrect? - correct answerUsing sensitivity analysis, selected decision cannot change if we use the same decision criterion (e.g., EMV-maximizer) Optimization is a _________ analytics method. - correct answerprescriptive T or F: "A feasible solution of a linear programming model is a solution that represents the values for all decision variables that satisfies all the constraints." - correct answerTrue Consider the following decision problem. You are preparing your backpack for a hike and need to decide how many items of different categories to pack. There are four item categories: snacks, drinks, sun protection items, and clothing items. Each snack, drink, sun protection item, and clothing item has a weight. For example, each snack weighs 60 grams, each water bottle weighs 25 grams, each sun protection item weighs 20 grams, and each clothing item weighs 400 grams. You can't carry more than 1600 grams in your backpack. You must have at least 3 snacks, at least 4 water bottles, and you can have at most 2 sunscreen items in the backpack for your hike. Your goal is to minimize to the total weight of your backpack while satisfying all the constraints. QUESTION 3Is the following statement true or false?"This is a linear programming model where the objective function is a maximization." - correct answerFalse What are the decision variables in this model? - correct answerNumber of snacks, drinks, sun protection items, and clothing items Which of the following is not a constraint of this model? You must have at least 3 snacks in the backpack You can have at most 2 sunscreen items in the backpack The number of items in each category that you take with you in your backpack cannot be negative Minimum weight of all items combined in the backpack must be at least 1600 grams - correct answerMinimum weight of all items combined in the backpack must be at least 1600 grams
T or F: "In this linear programming model, taking 5 snacks, 6 water bottles, 2 sunscreen items and 3 clothing items is a feasible solution." - correct answerFalse Consider the decision problem used in Questions 3-6. Assume there is one more constraint that you have to satisfy. The total weight of snacks in your backpack cannot exceed the total weight of water bottles. Let's define x1 = number of snacks in your backpack and x2 = number of water bottles in your backpack. Which formulation below represents this new constraint? - correct answer60x1 <= 25x