












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A series of questions and answers related to data visualization, covering topics such as the benefits of data visualization, differences between data processing and querying, data challenges (ins, 3vs, hmle), data schemas, advantages of structured and semi-structured databases, the curse of dimensionality, data transformation, scalable data exploratory systems, prefix and subsequence searches, edit distance, skyline queries, visual variable data types (nominal, ordinal, interval, ratio), the visualization pipeline, bertin's visual variables, color schemes, data skewness, data deluge, schema, dynamic visualization, vector representation, pie charts, line charts, heckbert's labeling algorithm, quantiles, box-and-whisker plots, sturge's formula, outliers, multivariate data sets, and scagnostics. It serves as a study guide or exam preparation material for students in data visualization courses, offering concise explanations and examples for key concepts.
Typology: Exams
1 / 20
This page cannot be seen from the preview
Don't miss anything!













Why is data visualization helpful? CORRECT ANSWER √√ >>>1. amplifies cognition
What does HMLE stand for and mean? CORRECT ANSWER √√ >>>HMLE is a list of data challenges: High-dimensional, Multi-modal, Inter-Linked, Evolving What is a data schema? CORRECT ANSWER √√ >>>A set of constraints that...
True or False: For nominal data, order matters. CORRECT ANSWER √√ >>>False. Nominal data is data whose categories have no implied ordering. Give an example of ordinal data CORRECT ANSWER √√ >>>Small, Medium, Large What is ordinal data? CORRECT ANSWER √√ >>>Data that has a specified order, but no specified distance metric. What is interval data? CORRECT ANSWER √√ >>>Data that has measurable distances (such as temperature). What is ratio data? CORRECT ANSWER √√ >>>Same as interval data, but includes a zero point (measuring tape). In which step of the visualization pipeline is data prepared for the visualization (smooth, interpolate, transform)? CORRECT ANSWER √√ >>>Data analysis In which step of the visualization pipeline is a subset of the data selected for visualization? CORRECT ANSWER √√ >>>Filtering In which step of the visualization pipeline are data mapped to geometric primitives and their attributes? CORRECT ANSWER √√ >>>Mapping In which step of the visualization pipeline is geometric data transformed to image data? CORRECT ANSWER √√ >>>Rendering According to Bertin's Visual Variables, what is the best way to represent a quantitative dimension visually? CORRECT ANSWER √√ >>>Position
True or False: Area and volume are among the best attributes to use for graphing data. CORRECT ANSWER √√ >>>False. They are among the WORST attributes. True or False: Objects in the skyline are not dominated by any other objects in the database. CORRECT ANSWER √√ >>>True Which data type is best suited for a rainbow color scheme? CORRECT ANSWER √√ >>>Nominal because no ordering is implied. Qualitative color scheme is a univariate or multivariate color scheme? CORRECT ANSWER √√ >>>Univariate Sequential color schemes are best suited for which type of data? CORRECT ANSWER √√ >>>Ordered data A divergent color scheme is a univariate color scheme that is best suited for which type of data? CORRECT ANSWER √√ >>>Ratio data where there is some meaningful zero point In a positive skew, the curve slopes down toward which direction? CORRECT ANSWER √√ >>>Slope is down from left to right In a negative skew, the curve slopes down toward which direction CORRECT ANSWER √√ >>>Slope is down from right to left How many visual variables did Bertin identify? How many can you name? CORRECT ANSWER √√ >>>7 and they are: position, size, value, color, texture, orientation, and shape.
Which type of chart is best for changes over time? CORRECT ANSWER √√ >>>Line chart Heckbert's Labeling Algorithm addresses the problem: for small numbers, the range of labels can be much larger than the data range. How is it addressed? CORRECT ANSWER √√ >>>Solution is to drop labels which overlap or fall outside the data range. This leads to unevenly spaced labels or axes with only one label. ____________ are points taken at regular intervals from the cumulative distribution function of a random variable. CORRECT ANSWER √√ >>>Quantiles What are the 5 main measures in a box-and-whisker plot? CORRECT ANSWER √√ >>>Lower extreme, lower quartile, median, upper quartile, upper extreme When a histogram has a tail that goes to the right, which way is it skewed? CORRECT ANSWER √√ >>>Right skewed (AKA positive skew) Using Sturge's formula, how many bins should there be for a dataset of 10 points? CORRECT ANSWER √√ >>>Sturge's formula is K=1+log (base 2) N where K is the number of class intervals (bins), and N is the number of observations. This formula is useful when we want to make the data fit a normal distribution pattern. In our case, it would be 5 bins. The first step in finding quantiles of a dataset is to? CORRECT ANSWER √√ >>>Sort the data 19 23 26 30 33 35 38 38 40 42 45 45 47 56 What is the value of the first quartile in this dataset? CORRECT ANSWER √√ >>>
How do you calculate the maximum number of data points that falls below the third quartile? CORRECT ANSWER √√ >>>Divide the number of data points by the number of quantiles, and multiply by the quantile number. Example: Suppose that a dataset containing 36 data points is divided into 9 quantiles. 36 / 9 = 4, and 4 * 3 = 12. On a box plot, which values can be considered outliers? CORRECT ANSWER √√ >>>Values more than 2 times the inter-quartile range (IQR) from the upper or lower quartiles. What are the two main ways of presenting multivariate data sets? CORRECT ANSWER √√ >>>Directly (textually) - Tables and Symbolically (pictures) - Graphs What is the term for the exploratory graphical technique that can help determine notable relationships between two variables? CORRECT ANSWER √√ >>>Scagnostics What can small multiples be used for? CORRECT ANSWER √√ >>>1. Show snapshots of events that change over time
The objective of a supervised learning model is what? CORRECT ANSWER √√ >>>To predict the correct label for a newly presented input data In supervised learning, for every observation of the feature measurement(s), x_i, i = 1, ..., n there is an associated _______ _______ y_i. CORRECT ANSWER √√ >>>response measurement
Which node has the smallest entropy?
What measurements compares average distance value between instances in the same cluster to average distance values between instances in different clusters? CORRECT ANSWER √√ >>>Silhouette index Silhouette index lies between? CORRECT ANSWER √√ >>>[-1, 1] What is the best case measurement for silhouette? CORRECT ANSWER √√ >>>Best measurement is 1, meaning the distance within the cluster is 0 and the distance between clusters is high. What is the relationship between the number of clusters and cohesiveness in a clustering algorithm? CORRECT ANSWER √√ >>>As the number of clusters increases, the value of cohesiveness decreases. True or False: A scatterplot matrix can be used to identify the relationship between different categories of data in a multivariate dataset. CORRECT ANSWER √√ >>>True, they enable the eye to efficiently and quickly identify variable pairings with strong or weak relationships. What is the goal of the model evaluation process? CORRECT ANSWER √√ >>>To see how the model is performing on the unseen data. In k-means clustering, a point is considered to be in a particular cluster if it is closer to that cluster's _________ than any other ________. CORRECT ANSWER √√ >>>Centroid How is error calculated using accuracy in supervised learning? CORRECT ANSWER √√ >>>1 - accuracy What does k-fold cross-validation training do? CORRECT ANSWER √√ >>>1. Divide training sets into some number of equally-sized sets.
________________ is an exploratory querying tool used to identify high level patterns in the data. Please review "Exploratory Querying" in Week 1 and attempt this question again. CORRECT ANSWER √√ >>>Drill-down/rollup What is a disadvantage of using Chernoff faces? CORRECT ANSWER √√ >>>A single Chernoff face is not sufficient to get an idea of the attributes belonging to the data. We need at least 2 faces to compare. Suppose you have a set of ratio data with a meaningful zero point. Which color scheme is most suitable for visualizing this data? CORRECT ANSWER √√ >>>Divergent What do we call points that are taken at regular intervals from the cumulative distribution function of a random variable? CORRECT ANSWER √√ >>>Quantiles In which step of the supervised learning process do we give the model an unlabelled dataset to get predictions? CORRECT ANSWER √√ >>>testing What is the L1 norm between two points? CORRECT ANSWER √√ >>>L1 norm distance is the sum of the absolute difference for all coordinates. For example: (1, 1) and (2, 2), the L1 norm is 2. What is the formula to calculate the entropy of a subset in a decision tree? CORRECT ANSWER √√ >>>-p log p-n log n, where p represents the probability of positive class in the subset and n represents the probability of negative class in the subset. For the histogram width-based formula, the number of bins can be calculated as ⌈ CORRECT ANSWER √√ >>>Ceiling of (max x - min x) / h
What are some potential problems with multivariate analysis? CORRECT ANSWER √√ >>>1. Regular charts like parallel coordinate plots may become too congested.
Which technique would be best for extracting all addresses that have a "W" after the house number? CORRECT ANSWER √√ >>>Subsequence. Subsequences indicate the string formed by removing some symbols from the original string. Subsequence search is useful for finding the subsequences in a larger regular expression. Consider this sample of street addresses: 1221 N CLARK ST 2360 W ADDISON ST 1239 W GRANVILLE AVE 2712 N CLARK ST 8902 N BROADWAY Which technique would be best for extracting addresses where the street number starts with 12? CORRECT ANSWER √√ >>>Prefix search is the best technique for identifying and extracting the patterns that match the beginning of a string. Let S be a set of s strings from alphabet Σ such that no string in S is a prefix of another string. If T is the trie for S, then how many leaves does T have? CORRECT ANSWER √√ >>>s Which of the following is not a subsequence of 'GCFITQSPPN'? IST CPN
GIT CORRECT ANSWER √√ >>>IST is not a subsequence of 'GCFITQSPPN' because it is not contiguous. With a brute-force approach, what is the worst case cost of finding a 2-character substring in a string of 10 characters? CORRECT ANSWER √√ >>>>18, The minimum number of comparisons made to find a substring of length M in a string of length N is M*(N - M + 1). ______ ______ is a trie representation of a string, with suffixes of given text as key and position in the text as value. CORRECT ANSWER √√ >>>Suffix tree Suppose you are using the KMP algorithm to search a pattern of length M in a string of length N. Which statement does NOT accurately identify characteristics of the KMP algorithm?