Data Analytics and Preprocessing Techniques, Exams of Advanced Education

Various data analytics and preprocessing techniques, including text analysis, simulation, regression analysis, and clustering. It discusses the application of these techniques to forecast student dropout risk, analyze tech sector turnover, and visualize market share data. The document also covers data transformation, reduction, and imputation methods, as well as the use of linear regression models, olap functions, and data mining techniques like classification and association rule mining. Insights into data preprocessing steps, the purpose of imputation methods, and the suitability of different data mining methods for predicting numerical and binary outcomes. It also discusses the importance of data scaling, splitting data into training and testing sets, and the use of web structure and content mining techniques.

Typology: Exams

2024/2025

Available from 10/14/2024

cate-mentor
cate-mentor 🇺🇸

1

(2)

2.6K documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
BSAN 160 exam review
T or F:Decision support system are computer-based support systems that integrate individuals' expertise
and computer capabilities, and they have precise definitions agreed to by practitioners. - False
Business Intelligence (BI) - is an umbrella term that combines architectures, databases, analytical tools,
applications, and methodologies
T or F: Data is a collection of observations, experiments, and experiences that do not necessarily
represent absolute facts that are universally true. - True
Descriptive Analytics - help managers understand current events in the organization including causes,
trends, and patterns.
What type of analytics seeks to recognize what is going on as well as the likely forecast and make
decisions to achieve the best performance possible? - Prescriptive
Which of the following is/are predictive analytics method(s)?
A)Boxplot B)Text analysis C)Simulation D)Regression analysis, E)Clustering B, D and EB, C and ED and E -
B, D, E
Using characteristics of first year undergraduate students, such as age, gender, major, location,
workout/sports activities, if we developed a model to forecast which students are at risk of dropping out
after the first year of college, decided which students to reach out to and offered them support services
to reduce their risk of dropping out, what kind of analytics application would this work represent? -
prescriptive analytics
Which chart type below would be most helpful to show the comparison between worldwide turnover
rate compared with tech sector turnover rate?
Line chart
Histogram
Bar chart
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Data Analytics and Preprocessing Techniques and more Exams Advanced Education in PDF only on Docsity!

BSAN 160 exam review

T or F:Decision support system are computer-based support systems that integrate individuals' expertise and computer capabilities, and they have precise definitions agreed to by practitioners. - False Business Intelligence (BI) - is an umbrella term that combines architectures, databases, analytical tools, applications, and methodologies T or F: Data is a collection of observations, experiments, and experiences that do not necessarily represent absolute facts that are universally true. - True Descriptive Analytics - help managers understand current events in the organization including causes, trends, and patterns. What type of analytics seeks to recognize what is going on as well as the likely forecast and make decisions to achieve the best performance possible? - Prescriptive Which of the following is/are predictive analytics method(s)? A)Boxplot B)Text analysis C)Simulation D)Regression analysis, E)Clustering B, D and EB, C and ED and E - B, D, E Using characteristics of first year undergraduate students, such as age, gender, major, location, workout/sports activities, if we developed a model to forecast which students are at risk of dropping out after the first year of college, decided which students to reach out to and offered them support services to reduce their risk of dropping out, what kind of analytics application would this work represent? - prescriptive analytics Which chart type below would be most helpful to show the comparison between worldwide turnover rate compared with tech sector turnover rate? Line chart Histogram Bar chart

Scatterplot - Bar chart Which chart type below would be most helpful to show the relative proportions of turnover rate of different categories (e.g., computer games, Internet, computer software and other) within the tech sector that drive tech turnover the most? Histogram Pie chart Bar chart Scatterplot - Pie Chart T or F: Original (raw) data is usually collected from multiple data sources including various formats, and it is readily usable by analytics tools and algorithms - False T or F: During data transformation, depending on the context and purpose of preprocessing the data can be rescaled to a fixed range, and numeric variables can be converted to categorical variables - True T or F: Data reduction can be applied to rows (observations) and/or columns (variables) in a given dataset - True T or F: In data preprocessing step to reduce the dimension of data prior to analysis, sampling the rows is more complex than selecting the columns (variables - False T or F: Choice of visualization method that meets the presentation requirements for a given data depends on the data types available, purpose of the visual and context - True Which of the below is not a data preprocessing step? data consolidation data transformation data separation data reduction - data seperation

B: Visual analytics combines data visualization and different analytics methods such as descriptive, predictive and prescriptive analytics. C: Interactive information dashboards provide key insights as static information that focus on better understanding of what happened. - A and B Which of the following data preprocessing activity/activities that Mia conducts would fall under data transformation? A: Identify and replace extremely high and low selling price values using appropriate imputation methods B: Convert number of bikes sold per month (numeric) into discrete categories using frequency-based bins C: Filter the data to ensure that only key performance and price features needed for the analysis are included in the data D: Reduce the range of values of quarterly market share (numeric) data to a standard range (e.g., 0 to 1 or -1 to +1) by using normalization or scaling techniques E: Oversample the less represented financial performance measurements - B and D Which chart should Mia use to visualize the relative proportion of market share of Peloton in 2020 compared to its competitors Nordic Track, Myx Fitness, and Echelon? - Pie chart Which chart should Mia use to visualize the number of new members joining the Peloton customer community every month from 2012 to 2020? - Line chart Which of the following data preprocessing activity that Mia conducts is not associated with data cleaning? - Derive a new variable representing total time of class material from existing variables Mia decides to use imputation methods as part of the data preprocessing. What is the main purpose of imputation methods? - Fill in missing values with most appropriate values T or F Linear regression models represent the mathematical relationship between one or more dependent variables to explain or predict a binary (i.e., a variable that takes values 0=no and 1=yes) independent variable." - False

T or F: "Linear regression analysis can be used to predict an unknown value of a dependent variable using the values of a set of numeric and/or categorical independent variables." - True Comparing two regression models (Model 1 and Model 2) developed using the same dataset, assume Model 1 has an R-squared of 0.58 and Model 2 has an R-squared of 0.79. Which of the following statement(s) is/are correct? A: Model 2 describes 79% of the variation in the given data B: Comparing both models and how well they explain the variation in the given data, Model 1 is a better fit compared to Model 2 C: The independent variables used in Model 1 do not capture 42% of the variation in the given data - A and C T or F: "Using the correlation between size and selling price, we can predict the selling price of a new house (that is not included in this dataset) if we know the size of that new house." - False T or F: "If the correlation between size and selling price of a house is 0.85, and we develop a simple regression model using size as independent and selling price as the dependent variable, the slope coefficient associated with size in the regression equation would have a positive sign." - True Assume we develop a regression model to predict the final grade of a student using the following variables: midterm grade, time spent studying for the final exam, number of other classes the student is taking the same term, whether the student took a similar class before (yes or no) and whether the student is female (1=female or 0=male). Which of the following statement(s) is/are correct? - This model is a multiple linear regression model Using Figure 1, test the hypothesis that students with higher high school GPA percentile have a higher SAT score compared to students with a lower high school GPA percentile. In other words, we want to test if we increase the high school GPA percentile of a student by 1% then their SAT score will also increase. Which of the following method would help to test this hypothesis? - Simple linear regression with high school percentile as the independent and SAT as the dependent variable Assume we fit a regression line to the scatterplot in Figure 1 from Question 1. Which of the following statement(s) is/are correct? A: The intercept of the line would represent the high school GPA percentile of a student given his/her/their SAT score is 0.

When querying a dimensional database, a user goes from summarized data (e.g., quarters) to its underlying details (e.g, months). The OLAP function that serves this purpose is: - Drill Down When querying a dimensional database, a user transforms the data coming from rows of a table into data grouped on several columns. The OLAP function that serves this purpose is: - Pivot ________ is an important concept to consider when developing a data warehouse because data warehouses can grow quickly and issues can arise regarding the amount of data, e.g., the pace at which the data warehouse is expected to grow and the complexity of user queries. - Scalability "One of the characteristics of a data warehouse is that it is non-volatile. That means, _______________________________________." - After data are entered into a data warehouse, previous data is not erased when new data is added to it T or F: "Classification learns a set of information on characteristics of the previously labeled items, objects, or events to place new instances (with unknown labels) into their respective groups." - True In data mining, finding an affinity of two products to be commonly purchased together is known as - Association rule mining T or F:"In Association Rule Mining, confidence is a metric that represents the probability of observing the items A and B together in a given dataset, P(A&B)." - False Consider the item set in Figure 1 (with 6 items: eggs, beer, juice which is represented by the red glass, milk, diapers, and bread) in a dataset with 5 transactions where each row in the dataset is a shopping purchase transaction. For example, transaction 1 is the first row in dataset and includes the purchase of bread and milk, transaction 2 is the second row and includes the purchase of bread, diapers, beer, and eggs, etc. Figure 1.Itemset and a dataset derived from shopping data. Support for Diapers and Beers being purchased together, i.e., P(X&Y) where X = Diapers and Y = Beer being purchased together, is ________ - 60% Confidence for Milk and Juice (that is the probability P(Y|X) where X= Milk and Y =Juice) is ___________

  • 50%

Use the following description to answer Questions 1 - 5.According to the American Academy of Pediatrics, toddlers (i.e., children who are 12 to 36 months old) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 20-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including:• day of the week (categorical, where the categories are: Monday,..., Sunday),• length of mid-day nap (numerical, in Minutes),• type of snacks he had before going to bed (categorical where the categories are: fruit, yoghurt, crackers),• length of night sleep (numerical, in hours)• night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours)Which data mining method would be best suited to predict the length - Linear Regression Which data mining method would be best suited to predict the length of night sleep (numerical) using the previous nights' length of night sleep in the past 60 days? - Times Series Which data mining method would be best suited to find out which days are similar to each other with regards to length of mid-day nap and length of night sleep? - Clustering Which data mining method would be best suited to find out which days (categorical) and night sleep categories (binary) are frequently observed together? - Association Rule Analysis Which data mining method would be best suited to predict the night sleep category (binary) using the variables: day of the week, length of mid-day nap, and type of snacks he had before going to bed? - Logistic Regression Which of the following is a segmentation model that classifies the items in a dataset based on pairwise distances between these items until every observation is linked into one large group? - Hierarchical clustering Which of the following is not a linkage criterion used in clustering models? - K-fold linkage Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to

  1. and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points.

Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include: - Dennis and Ross Using the single linkage criteria, after creating the first cluster, the next step would be - Add Ben to the cluster {Dennis and Ross} Using the complete linkage criteria, after creating the first cluster, the next step would be: - create a cluster that includes {Kristen and Ben} or Add Kristen to the cluster {Dennis and Ross} Table 2.Confusion matrix True/Observed Class Positive (Patient is healthy) Negative (Patient is sick) Predicted class Positive (Patient is healthy) 80 20 Negative (Patient is sick) 40 60 QUESTION 4What is the false positive (FP) count? - 20

How many mistakes (misclassifications) did the model make? - 60 T or F: Based on the results shown in Table 2, the true positive rate is higher than the true negative rate." - False T or F: "In a clustering model with two numerical input variables used for clustering, if the input variables are not on the same scale standardizing is used to convert the variables and compare them on a single scale." - True T or F: "Due to potential risk of model overfitting, rather than using all available data we split the data into training and testing data, we use the training data for model development and evaluate model performance using the testing data." - True In text mining, tokenizing is the process of _________________. - breaking a text into simple units, like sentences or words The Bag-of-Words method uses ____________ to extract feature from textual data. - Word frequencies in a text In text mining, what is a lexicon? - a catalog of words and scores (or categories) assigned to the words based on their meaning Consider the following sentence: "I enjoy taking a walk in the rain." Which of the following statements in incorrect? - After removing the stop words, the bigram method creates the vector: [ "enjoy", "taking", "walk", "rain"] T or F: "Web structure mining focuses on navigation through a website by analyzing the links in Web documents, and Web content mining is related to extraction of information from the content of Web pages using text mining." - True Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide

Use the following description for Questions 4-6. Imagine we have a decision problem where we are asked to choose between two decision alternatives. Decision alternative 1 can result in a payoff of 10000 with probability 0.4 or a loss 4000 with probability 0.6. Decision alternative 2 results in a payoff 2000 with probability 0.5 or a payoff 1200 with probability 0.5.If we look at the worst possible outcome for each decision alternative and choose the decision that has the best "worst outcome", which decision alternative should we choose? - Decision alternative 2 If we look at the best possible outcome for each decision alternative and choose the decision that has the best "best outcome", which decision alternative should we choose? - Decision alternative 1 Using the Expected Monetary Value (EMV) criterion, which decision alternative should we choose? - Indifferent between Decision alternative 1 and Decision alternative 2 A probability node of a decision tree for decision modeling represents __________________. - a time when the result of an uncertain outcome becomes known Let's assume we are confronted with a one-stage decision problem that has 3 decision alternatives: Decision Alternative 1, 2 and 3, where each Decision Alternative has possible outcomes and probabilities associated with possible outcomes. We use the Expected Monetary Value (EMV) and pick Decision Alternative 2. Assume we conduct sensitivity analysis for this decision model. Which of the following statements is incorrect? - Using sensitivity analysis, selected decision cannot change if we use the same decision criterion (e.g., EMV-maximizer) Optimization is a _________ analytics method. - prescriptive T or F: "A feasible solution of a linear programming model is a solution that represents the values for all decision variables that satisfies all the constraints." - True Consider the following decision problem. You are preparing your backpack for a hike and need to decide how many items of different categories to pack. There are four item categories: snacks, drinks, sun protection items, and clothing items. Each snack, drink, sun protection item, and clothing item has a weight. For example, each snack weighs 60 grams, each water bottle weighs 25 grams, each sun protection item weighs 20 grams, and each clothing item weighs 400 grams. You can't carry more than 1600 grams in your backpack. You must have at least 3 snacks, at least 4 water bottles, and you can have at most 2 sunscreen items in the backpack for your hike. Your goal is to minimize to the total weight of your backpack while satisfying all the constraints. QUESTION 3Is the following statement true or false?"This is a linear programming model where the objective function is a maximization." - False

What are the decision variables in this model? - Number of snacks, drinks, sun protection items, and clothing items Which of the following is not a constraint of this model? You must have at least 3 snacks in the backpack You can have at most 2 sunscreen items in the backpack The number of items in each category that you take with you in your backpack cannot be negative Minimum weight of all items combined in the backpack must be at least 1600 grams - Minimum weight of all items combined in the backpack must be at least 1600 grams T or F: "In this linear programming model, taking 5 snacks, 6 water bottles, 2 sunscreen items and 3 clothing items is a feasible solution." - False Consider the decision problem used in Questions 3-6. Assume there is one more constraint that you have to satisfy. The total weight of snacks in your backpack cannot exceed the total weight of water bottles. Let's define x1 = number of snacks in your backpack and x2 = number of water bottles in your backpack. Which formulation below represents this new constraint? - 60x1 <= 25x