Data visualization in Data warehouse and mining, Summaries of Data Warehousing

The importance of data visualization and dimensionality reduction in the process of data analysis. It emphasizes the need to choose an appropriate representation for the data and the algorithm to continue the analysis. The document also explains the challenges of visualizing data in high dimensions and the use of projection to reduce the number of dimensions. It further discusses the central limit theorem and its impact on data visualization.

Typology: Summaries

2020/2021

Available from 01/19/2022

SanketSalvi
SanketSalvi 🇮🇳

3 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Visualization
The process of data analysis does not just consist of picking an algorithm, fitting
it to the data and reporting the results. We have seen that we need to choose a
representation for the data necessitating data-preprocessing in many cases. Depending
on the data representation and the task at hand we then have to choose
an algorithm to continue our analysis. But even after we have run the algorithm
and study the results we are interested in, we may realize that our initial choice of
algorithm or representation may not have been optimal. We may therefore decide
to try another representation/algorithm, compare the results and perhaps combine
them. This is an iterative process.
What may help us in deciding the representation and algorithm for further
analysis? Consider the two datasets in Figure ??. In the left figure we see that the
data naturally forms clusters, while in the right figure we observe that the data is
approximately distributed on a line. The left figure suggests a clustering approach
while the right figure suggests a dimensionality reduction approach. This illustrates
the importance of looking at the data before you start your analysis instead
of (literally) blindly picking an algorithm. After your first peek, you may decide
to transform the data and then look again to see if the transformed data better suit
the assumptions of the algorithm you have in mind.
“Looking at the data” sounds more easy than it really is. The reason is that
we are not equipped to think in more than 3 dimensions, while most data lives
in much higher dimensions. For instance image patches of size 10 × 10 live in a
100 pixel space. How are we going to visualize it? There are many answers to
this problem, but most involve projection: we determine a number of, say, 2 or
3 dimensional subspaces onto which we project the data. The simplest choice of
subspaces are the ones aligned with the features, e.g. we can plot X1n versus X2n
7
8 CHAPTER 2. DATA VISUALIZATION
etc. An example of such a scatter plot is given in Figure ??.
Note that we have a total of d(d − 1)/2 possible two dimensional projections
which amounts to 4950 projections for 100 dimensional data. This is usually too
many to manually inspect. How do we cut down on the number of dimensions?
perhaps random projections may work? Unfortunately that turns out to be not a
great idea in many cases. The reason is that data projected on a random subspace
often looks distributed according to what is known as a Gaussian distribution (see
Figure ??). The deeper reason behind this phenomenon is the central limit theorem
which states that the sum of a large number of independent random variables
is (under certain conditions) distributed as a Gaussian distribution. Hence, if we
denote with w a vector in Rd and by x the d-dimensional random variable, then
y = wTx is the value of the projection. This is clearly is a weighted sum of
the random variables xi, i = 1..d. If we assume that xi are approximately independent,
then we can see that their sum will be governed by this central limit
theorem. Analogously, a dataset {Xin} can thus be visualized in one dimension
by “histogramming”1 the values of Y = wTX, see Figure ??. In this figure we
clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of
projected and histogrammed data.
In one sense the central limit theorem is a rather helpful quirk of nature. Many
variables follow Gaussian distributions and the Gaussian distribution is one of
the few distributions which have very nice analytic properties. Unfortunately, the
pf2

Partial preview of the text

Download Data visualization in Data warehouse and mining and more Summaries Data Warehousing in PDF only on Docsity!

Data Visualization

The process of data analysis does not just consist of picking an algorithm, fitting it to the data and reporting the results. We have seen that we need to choose a representation for the data necessitating data-preprocessing in many cases. Depending on the data representation and the task at hand we then have to choose an algorithm to continue our analysis. But even after we have run the algorithm and study the results we are interested in, we may realize that our initial choice of algorithm or representation may not have been optimal. We may therefore decide to try another representation/algorithm, compare the results and perhaps combine them. This is an iterative process. What may help us in deciding the representation and algorithm for further analysis? Consider the two datasets in Figure ??. In the left figure we see that the data naturally forms clusters, while in the right figure we observe that the data is approximately distributed on a line. The left figure suggests a clustering approach while the right figure suggests a dimensionality reduction approach. This illustrates the importance of looking at the data before you start your analysis instead of (literally) blindly picking an algorithm. After your first peek, you may decide to transform the data and then look again to see if the transformed data better suit the assumptions of the algorithm you have in mind. “Looking at the data” sounds more easy than it really is. The reason is that we are not equipped to think in more than 3 dimensions, while most data lives in much higher dimensions. For instance image patches of size 10 × 10 live in a 100 pixel space. How are we going to visualize it? There are many answers to this problem, but most involve projection : we determine a number of, say, 2 or 3 dimensional subspaces onto which we project the data. The simplest choice of subspaces are the ones aligned with the features, e.g. we can plot X 1 n versus X 2 n 7 8 CHAPTER 2. DATA VISUALIZATION etc. An example of such a scatter plot is given in Figure ??. Note that we have a total of d(d − 1)/ 2 possible two dimensional projections which amounts to 4950 projections for 100 dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? perhaps random projections may work? Unfortunately that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ?? ). The deeper reason behind this phenomenon is the central limit theorem which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote with w a vector in Rd and by x the d-dimensional random variable, then y = wTx is the value of the projection. This is clearly is a weighted sum of the random variables xi, i = 1..d. If we assume that xi are approximately independent, then we can see that their sum will be governed by this central limit theorem. Analogously, a dataset {Xin} can thus be visualized in one dimension by “histogramming” 1 the values of Y = wTX, see Figure ??. In this figure we clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of projected and histogrammed data. In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytic properties. Unfortunately, the

Gaussian distribution is also the most uninformative distribution. This notion of “uninformative” can actually be made very precise using information theory and states: Given a fixed mean and variance, the Gaussian density represents the least amount of information among all densities with the same mean and variance. This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at.