











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A comprehensive set of questions and answers related to data visualization, specifically tailored for the cse 578 course. It covers a wide range of topics, including the benefits of data visualization, differences between data processing and querying, data exploration and navigation, and various data challenges. The material also delves into data schemas, database structures, the curse of dimensionality, data transformation, and scalable data exploratory systems. Visual variables, color schemes, and statistical concepts like skewness and quantiles are also addressed, making it a valuable resource for students studying data visualization.
Typology: Exams
1 / 19
This page cannot be seen from the preview
Don't miss anything!












Why is data visualization helpful? CORRECT ANSWER √√ >>>1. amplifies cognition
What does HMLE stand for and mean? CORRECT ANSWER √√ >>>HMLE is a list of data challenges: High-dimensional, Multi-modal, Inter-Linked, Evolving What is a data schema? CORRECT ANSWER √√ >>>A set of constraints that...
Give an example of ordinal data CORRECT ANSWER √√ >>>Small, Medium, Large What is ordinal data? CORRECT ANSWER √√ >>>Data that has a specified order, but no specified distance metric. What is interval data? CORRECT ANSWER √√ >>>Data that has measurable distances (such as temperature). What is ratio data? CORRECT ANSWER √√ >>>Same as interval data, but includes a zero point (measuring tape). In which step of the visualization pipeline is data prepared for the visualization (smooth, interpolate, transform)? CORRECT ANSWER √√ >>>Data analysis In which step of the visualization pipeline is a subset of the data selected for visualization? CORRECT ANSWER √√ >>>Filtering In which step of the visualization pipeline are data mapped to geometric primitives and their attributes? CORRECT ANSWER √√ >>>Mapping In which step of the visualization pipeline is geometric data transformed to image data? CORRECT ANSWER √√ >>>Rendering According to Bertin's Visual Variables, what is the best way to represent a quantitative dimension visually? CORRECT ANSWER √√ >>>Position True or False: Area and volume are among the best attributes to use for graphing data. CORRECT ANSWER √√ >>>False. They are among the WORST attributes.
True or False: Objects in the skyline are not dominated by any other objects in the database. CORRECT ANSWER √√ >>>True Which data type is best suited for a rainbow color scheme? CORRECT ANSWER √√ >>>Nominal because no ordering is implied. Qualitative color scheme is a univariate or multivariate color scheme? CORRECT ANSWER √√ >>>Univariate Sequential color schemes are best suited for which type of data? CORRECT ANSWER √√ >>>Ordered data A divergent color scheme is a univariate color scheme that is best suited for which type of data? CORRECT ANSWER √√ >>>Ratio data where there is some meaningful zero point In a positive skew, the curve slopes down toward which direction? CORRECT ANSWER √√ >>>Slope is down from left to right In a negative skew, the curve slopes down toward which direction CORRECT ANSWER √√ >>>Slope is down from right to left How many visual variables did Bertin identify? How many can you name? CORRECT ANSWER √√ >>>7 and they are: position, size, value, color, texture, orientation, and shape. Can you describe "data deluge"? CORRECT ANSWER √√ >>>A vast increase in the amount of data generated by individuals and businesses.
√√ >>>Solution is to drop labels which overlap or fall outside the data range. This leads to unevenly spaced labels or axes with only one label. ____________ are points taken at regular intervals from the cumulative distribution function of a random variable. CORRECT ANSWER √√ >>>Quantiles What are the 5 main measures in a box-and-whisker plot? CORRECT ANSWER √√ >>>Lower extreme, lower quartile, median, upper quartile, upper extreme When a histogram has a tail that goes to the right, which way is it skewed? CORRECT ANSWER √√ >>>Right skewed (AKA positive skew) Using Sturge's formula, how many bins should there be for a dataset of 10 points? CORRECT ANSWER √√ >>>Sturge's formula is K=1+log (base 2) N where K is the number of class intervals (bins), and N is the number of observations. This formula is useful when we want to make the data fit a normal distribution pattern. In our case, it would be 5 bins. The first step in finding quantiles of a dataset is to? CORRECT ANSWER √√ >>>Sort the data 19 23 26 30 33 35 38 38 40 42 45 45 47 56 What is the value of the first quartile in this dataset? CORRECT ANSWER √√ >>> How do you calculate the maximum number of data points that falls below the third quartile? CORRECT ANSWER √√ >>>Divide the number of data points by the number of quantiles, and multiply by the quantile number. Example: Suppose that a dataset containing 36 data points is divided into 9 quantiles.
36 / 9 = 4, and 4 * 3 = 12. On a box plot, which values can be considered outliers? CORRECT ANSWER √√ >>>Values more than 2 times the inter-quartile range (IQR) from the upper or lower quartiles. What are the two main ways of presenting multivariate data sets? CORRECT ANSWER √√ >>>Directly (textually) - Tables and Symbolically (pictures) - Graphs What is the term for the exploratory graphical technique that can help determine notable relationships between two variables? CORRECT ANSWER √√ >>>Scagnostics What can small multiples be used for? CORRECT ANSWER √√ >>>1. Show snapshots of events that change over time
Which node has the smallest entropy?
Regression can be solved by estimating what two things, using the provided data set and labels Y? CORRECT ANSWER √√ >>>The vector of regression coefficients (W) and epsilon. True or False: In supervised learning, testing data is used for determining when a model is overfitted, and can be used to evaluate the model. CORRECT ANSWER √√ >>>True K-fold cross validation is a resampling strategy used to evaluate the model. What does the parameter k refer to? CORRECT ANSWER √√ >>>k is the number of groups into which the given data is split. Suppose you have a dataset of 100 points. If the leave-one-out validation technique is used, how many times does the model need to be fit? CORRECT ANSWER √√ >>>100. In the leave-one-out technique, we use all the instances but one for training. The one instance left is used for testing. If we have N instances, we use N-1 for training and 1 for testing. Does k-NN learn? CORRECT ANSWER √√ >>>No, it is considered lazy learning. What is the formula for accuracy given a confusion matrix? CORRECT ANSWER √√ >>>(TP + TN) / (TP + TN + FP + FN) When using k-means, how are new centroids formed? CORRECT ANSWER √√ >>>New centroids are formed by taking the mean of all the points in each cluster. Which of these is NOT a stopping criteria for k-means classification?
What measurements compares average distance value between instances in the same cluster to average distance values between instances in different clusters? CORRECT ANSWER √√ >>>Silhouette index Silhouette index lies between? CORRECT ANSWER √√ >>>[-1, 1] What is the best case measurement for silhouette? CORRECT ANSWER √√ >>>Best measurement is 1, meaning the distance within the cluster is 0 and the distance between clusters is high. What is the relationship between the number of clusters and cohesiveness in a clustering algorithm? CORRECT ANSWER √√ >>>As the number of clusters increases, the value of cohesiveness decreases. True or False: A scatterplot matrix can be used to identify the relationship between different categories of data in a multivariate dataset. CORRECT ANSWER √√ >>>True, they enable the eye to efficiently and quickly identify variable pairings with strong or weak relationships. What is the goal of the model evaluation process? CORRECT ANSWER √√ >>>To see how the model is performing on the unseen data. In k-means clustering, a point is considered to be in a particular cluster if it is closer to that cluster's _________ than any other ________. CORRECT ANSWER √√ >>>Centroid How is error calculated using accuracy in supervised learning? CORRECT ANSWER √√ >>>1 - accuracy What does k-fold cross-validation training do? CORRECT ANSWER √√ >>>1. Divide training sets into some number of equally-sized sets.
Describe how you would complete the k-means algorithm. CORRECT ANSWER √√ >>>1. Calculate the distance between each point and each centroid.
In which supervised learning method is linear approximation used? CORRECT ANSWER √√ >>>Regression True or False: Ordered time domains consider things that happen one after another. CORRECT ANSWER √√ >>>True __________ time considers multiple what-if scenarios, allowing comparison of alternate scenarios. CORRECT ANSWER √√ >>>Branching True or False: For temporal analysis design principles, spatial position should not be used as a visual cue. CORRECT ANSWER √√ >>>False, it is the strongest visual cue. True or False: For temporal analysis design principles, we should provide side-by-side comparisons of small multiple views. CORRECT ANSWER √√ >>>True What is a control chart? CORRECT ANSWER √√ >>>A graph used to study how a process changes over time. It always has a central line for average, an upper line for upper control limit, and a lower line for lower control limit. When should we use a control chart? CORRECT ANSWER √√ >>>1. Controlling ongoing processes by finding and correcting problems as they occur.
What is a moving average chart? CORRECT ANSWER √√ >>>1. Monitors process location over time.
With a brute-force approach, what is the worst case cost of finding a 2-character substring in a string of 10 characters? CORRECT ANSWER √√ >>>>18, The minimum number of comparisons made to find a substring of length M in a string of length N is M*(N - M + 1). ______ ______ is a trie representation of a string, with suffixes of given text as key and position in the text as value. CORRECT ANSWER √√ >>>Suffix tree Suppose you are using the KMP algorithm to search a pattern of length M in a string of length N. Which statement does NOT accurately identify characteristics of the KMP algorithm?