
























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A wide range of topics related to data preprocessing and modeling techniques in the field of data science and machine learning. It discusses various concepts such as detecting unusual values, handling categorical variables, discovering causality and variable interactions, working with python libraries like numpy and matplotlib, data manipulation with pandas, linear regression modeling, handling categorical predictors, model evaluation metrics like rmse and mae, and the use of techniques like principal component analysis (pca) and principal component regression (pcr). A comprehensive overview of these fundamental data science and machine learning concepts, making it a valuable resource for students, researchers, and professionals working in these domains.
Typology: Exams
1 / 32
This page cannot be seen from the preview
Don't miss anything!

























The marketing department of ACME Corporation needs to identify potential high-value customers for their new Kitchen Robot. These robots are expensive, so we are looking to identify customers that can afford such a machine. It's been determined that households with a net income greater than $50, USD are of interest in the marketing campaign, so you will choose a(n) ___________ algorithm to model these customers. classification regression affinity analysis recommender system - Precise Answer ✔✔classification Reducing the number of predictors to the smallest set that will still provide accurate predictions is a concept called ____________________. regeneration parsimony shrinkage Gillette's razor - Precise Answer ✔✔parsimony You have been given a dataset with 15 predictors and a binary outcome that denotes whether a customer has left the company (yes or no). As an absolute minimum, you'll need _____________ samples to achieve an minimally accurate prediction. 150 180 190 200 - Precise Answer ✔✔ 180
You have been given a dataset with 15 predictors and a numeric outcome that denotes the income that a household has obtained. As an absolute minimum, you'll need _____________ samples to achieve an minimally accurate prediction. 200 180 300 150 - Precise Answer ✔✔ 150 The process of identifying outliers is best performed by someone with domain knowledge as opposed to someone with statistical knowledge. True False - Precise Answer ✔✔True If you impute a missing value with its column mean, then you will ___________________. maximize the variability of the dataset overweight the variability of the dataset understate the variability of the dataset normalize the variability of the dataset - Precise Answer ✔✔understate the variability of the dataset Standardization uses the following formula: Using the rule-of-thumb method, one can assume that all extreme values (outliers) will be greater than ____________ or less than ____________. 0, 1 1, 0
underweight, oversample subsample, oversample overweight, underweight - Precise Answer ✔✔overweight, underweight You are given a dataset that has many duplicate entries, that is customers who appear multiple times in the data because of address changes, marriages, and mis-entries (boulevard instead of blvd., etc.) Because of this situation, you will choose a(n) __________ algorithm to clean up the dataset. dimension reduction data reduction adaptive filtering collaborative filtering - Precise Answer ✔✔data reduction When working with linear or logistic regression, categorical variables must have one subtype removed when one-hot encoding (dummy coding) or else the model will fail. True False - Precise Answer ✔✔True Because machine learning is automated, there is not any human bias or discrimination in the results. True False - Precise Answer ✔✔False An individual's behavior can be psychologically manipulated through the analysis of their Facebook data. True False - Precise Answer ✔✔True
Fake Facebook and _____________ accounts, most notably under Russian control, have helped create and spread divisive and destabilizing messaging in Western democracies with a goal of affecting election outcomes. TikTok Instagram Twitter Etsy - Precise Answer ✔✔Twitter Citizens of the European Union, like citizens of the United States, can opt out of automated decision making algorithms. True False - Precise Answer ✔✔False The Competitive Intelligence department of the Acme Corporation has gathered every possible variable to model their chief competitors. In addition to the number of employees by department, the sales and profit numbers, product satisfaction, warranty returns, and product pricing, they also gathered data such as the brand of wristwatches worn by the chief executives and the softness of the bath tissue in the executive washroom. You suspect that many of these variables are unnecessary, so you choose a(n) ____________ technique for data preprocessing, which should improve ____________________. dimension reduction, predictive accuracy data reduction, bias-variance tradeoff dimension reduction, bias-variance tradeoff data reduction, predictive accuracy - Precise Answer ✔✔dimension reduction, predictive accuracy The primary purpose of data exploration is to understand the global landscape of the data and to _________________. detect unusual values detect categorical variables
False - Precise Answer ✔✔False A list can contain any Python type. But a list itself is also a Python type. That means that a list can also contain a list! Python is getting funkier by the minute, but fear not, just remember the list syntax: (Python List Assignment) my_list = [el1, el2, el3] Can you tell which ones of the following lines of Python code are valid ways to build a list? Please choose all correct answers. [1, 3, 4, 2] [[1, 2, 3], [4, 5, 7]] [1 + 2, "a" * 5, 3] - Precise Answer ✔✔ALL ARE CORRECT Assume, you are given two lists: a = [1,2,3,4,5] b = [6,7,8,9] The task is to create a list which has all the elements of a and b in one dimension. Output: a = [1,2,3,4,5,6,7,8,9] Which of the following options would you choose? (Python List) a.extend(b) a.append(b) a "+" b a.join(b) - Precise Answer ✔✔a.extend(b) Have a look at this line of code: (python basics) np.array([True, 1, 2]) + np.array([3, 4, False]) Can you tell which code chunk builds the exact same Python object?
np.array([True, 1, 2, 3, 4, False]) np.array([1, 1, 2]) + np.array([3, 4, -1]) np.array([0, 1, 2, 3, 4, 5]) np.array([4, 3, 0]) + np.array([0, 2, 2]) - Precise Answer ✔✔np.array([4, 3, 0]) + np.array([0, 2, 2]) Assuming matplotlib.pyplot is imported as plt. Which of the following commands plots a histogram of the variable named X? (Intro to matplotlib) plt.histogram(X) plt.plot.hist(X) plt.hist(X) plt.plot(X) - Precise Answer ✔✔plt.hist(X) How would you create a Figure with 6 Axes objects organized in 3 rows and 2 columns? (matplotlib grammar) fig, ax = plt.subplots[3, 2] fig, ax = plt.subplots((2, 3)) fig, ax = plt.axes((2, 3)) fig, ax = plt.subplots(3, 2) - Precise Answer ✔✔fig, ax = plt.subplots(3, 2) What is the correct way to calculate 2 to the power of 10? (python basics) 2pow 2%% 2^ 210 - Precise Answer ✔✔2* Which one is NOT like the others? (python basics)
Fill in the following blank to inspect a dataframe "df". (loading data in pandas) print (df.______) info() inspect inspect() info - Precise Answer ✔✔info() A cluster is always comprised of a group consisting of two or more members. True False - Precise Answer ✔✔False Agglomerative clustering begins with n clusters, and n = the number of samples the number of variables one zero - Precise Answer ✔✔the number of samples Ward's method that is used to form clusters considers the ___________________________ that occurs when records are clustered. loss of distance gain of entropy dissimilarity loss of information - Precise Answer ✔✔loss of information
Interpreting a dendrogram involves determining a _____________ on the _____ axis. mean threshold, X cutoff distance, X mean threshold, Y cutoff distance, Y - Precise Answer ✔✔cutoff distance, Y Ward's method of creating clusters results in clusters that are roughly _______________. convex and equal sized concave and equal-sized concave and truly unique convex and lumpy in appearance - Precise Answer ✔✔concave and truly unique The vertical length on a dendrogram represents the _____________________. distance between clusters measure of entropy uniqueness of the records distance between records - Precise Answer ✔✔distance between records The primary means of grouping neighborhoods by lifestyle involves using _____________ to segment the demographics. counties cities GPS coordinates zip codes - Precise Answer ✔✔zip codes
-reduce the data dimension - Precise Answer ✔✔1. Determine the purpose
Complete this command: df.______ ("_____")["______"].mean() groupby, income, gender groupby, age, gender groupby, age, income groupby, gender, income - Precise Answer ✔✔groupby, gender, income You have the following dataframe df: (Data Manipulation with Pandas) Select the code that returns the following output: print(df.iloc[3:] print(df.iloc([0:3]) print(df.iloc[1:3]) print(df.iloc([2:3]) - Precise Answer ✔✔print(df.iloc[1:3]) Suppose you have the following data in a csv file named as sales.csv : (Data Manipulation with Pandas) What is the correct command to read it as a pandas dataframe: pd.read_excel("sales.xlsx") pd.readcsv("sales.csv") pd.read_csv("sales.csv") pd. read_excel("sales.csv) - Precise Answer ✔✔pd.read_csv("sales.csv") Use the table and choose the correct code to generate the output shown. (Data Manipulation with Pandas)
df.filter(income ($), no of family members) df.select(income,no of family members) df[['income ($)', 'no of family members']] df['income', 'no of family members'] - Precise Answer ✔✔df[['income ($)', 'no of family members']] The following formula for a cluster model denotes the _____________________ that is used to form clusters. min(distance(Ai,Bj)), i = 1, 2, ..., m; j = 1, 2, ..., n. average linkage formula single linkage formula centroid linkage formula complete linkage formula - Precise Answer ✔✔single linkage formula The following formula for a cluster model denotes the _____________________ that is used to form clusters. max(distance(Ai, Bj)), i = 1, 2, ..., m; j =1, 2, ..., n. average linkage formula complete linkage formula single linkage formula centroid linkage formula - Precise Answer ✔✔complete linkage formula The following formula for a cluster model denotes the _____________________ that is used to form clusters. Average(distance(Ai, Bj)), i= 1,2,...,m; j=1,2,...,n.
complete linkage formula average linkage formula centroid linkage formula single linkage formula - Precise Answer ✔✔average linkage formula Based on the following profile plot, the greatest distinction between clusters is found with _______________. - Precise Answer ✔✔Sales According to our textbook, heatmaps can be used to show ______________. (There is more than one correct answer.) Missing values. Correlation between variables. Whether mean and median of a variable are close to each other. Mean value of a variable. - Precise Answer ✔✔Missing values. Correlation between variables. If the Pearson correlation coefficient between two variables is 0, then the two variables are totally independent from each other. True False - Precise Answer ✔✔False What we CANNOT conclude from the following missing value chart? -After removing all observations that have missing values, we will end up with a dataset that contains at least 804 observations. -More than 10 variables have some missing values. -Usually, we don't have to remove all variables that have missing values.
-There are more observations where CHAS is 1. -The average value of MEDV across all observations should be between 20 and 30. -When CHAS is 0, the average MEDV is higher compared to when CHAS is 1. - Precise Answer ✔✔The average value of MEDV across all observations should be between 20 and 30. Data exploration must be followed by more formal analysis. True False - Precise Answer ✔✔False What can we conclude from the following box plot? -CHAS is a categorical variable. -The maximum value for MEDV when CHAS is 0 is around 37. -The median for MEDV when CHAS is 0 is around 25. -MEDV is a categorical variable. - Precise Answer ✔✔CHAS is a categorical variable. What is a correct statement according to the following figure? Please select the best choice. -From this figure, we can conclude that the observations that have LSTAT value of above 35 should be determined as outliers. -There is a positive relationship between LSTAT and MEDV. -Plotting the two variables as a line plot will give us similar insights compared to a scatter plot. -In this figure, each circle represents one observation. - Precise Answer ✔✔In this figure, each circle represents one observation. In scatter plots, we mainly use different color of points to _________________ -Represent different categorical variables. -Represent different numeric variables.
-Make the figure looks beautiful. -Represent outliers. - Precise Answer ✔✔Represent different categorical variables. What is the best response when we see the following figure? -Variable CRIM is not suitable for plotting boxplots. -We need to rescale CAT.MEDV. -We need to rescale CRIM. -The figure is already informative, we can take it as it is. - Precise Answer ✔✔We need to rescale CRIM. Which of the following is the most important assumption for linear regression? The normality of the outcome variable The normality of the residuals The normality of the predictor variables None of these - Precise Answer ✔✔The normality of the residuals Which of the options is correct with respect to RMSE and MAE? RMSE can be negative or positive but MAE will always be positive MAE is preferable to RMSE if the residuals have outliers Both RMSE and MAE can be either positive or negative RMSE is preferable to MAE if the residuals have outliers - Precise Answer ✔✔MAE is preferable to RMSE if the residuals have outliers Y = B0 + B1x1 + B2x2 + ... + Bpxp + E What does Y represent in the equation?