Data visualization in Data warehouse and mining | Summaries Data Warehousing

Data Visualization

The process of data analysis does not just consist of picking an algorithm, fitting

it to the data and reporting the results. We have seen that we need to choose a

representation for the data necessitating data-preprocessing in many cases. Depending

on the data representation and the task at hand we then have to choose

an algorithm to continue our analysis. But even after we have run the algorithm

and study the results we are interested in, we may realize that our initial choice of

algorithm or representation may not have been optimal. We may therefore decide

to try another representation/algorithm, compare the results and perhaps combine

them. This is an iterative process.

What may help us in deciding the representation and algorithm for further

analysis? Consider the two datasets in Figure ??. In the left figure we see that the

data naturally forms clusters, while in the right figure we observe that the data is

approximately distributed on a line. The left figure suggests a clustering approach

while the right figure suggests a dimensionality reduction approach. This illustrates

the importance of looking at the data before you start your analysis instead

of (literally) blindly picking an algorithm. After your first peek, you may decide

to transform the data and then look again to see if the transformed data better suit

the assumptions of the algorithm you have in mind.

“Looking at the data” sounds more easy than it really is. The reason is that

we are not equipped to think in more than 3 dimensions, while most data lives

in much higher dimensions. For instance image patches of size 10 × 10 live in a

100 pixel space. How are we going to visualize it? There are many answers to

this problem, but most involve projection: we determine a number of, say, 2 or

3 dimensional subspaces onto which we project the data. The simplest choice of

subspaces are the ones aligned with the features, e.g. we can plot X1n versus X2n

8 CHAPTER 2. DATA VISUALIZATION

etc. An example of such a scatter plot is given in Figure ??.

Note that we have a total of d(d − 1)/2 possible two dimensional projections

which amounts to 4950 projections for 100 dimensional data. This is usually too

many to manually inspect. How do we cut down on the number of dimensions?

perhaps random projections may work? Unfortunately that turns out to be not a

great idea in many cases. The reason is that data projected on a random subspace

often looks distributed according to what is known as a Gaussian distribution (see

Figure ??). The deeper reason behind this phenomenon is the central limit theorem

which states that the sum of a large number of independent random variables

is (under certain conditions) distributed as a Gaussian distribution. Hence, if we

denote with w a vector in Rd and by x the d-dimensional random variable, then

y = wTx is the value of the projection. This is clearly is a weighted sum of

the random variables xi, i = 1..d. If we assume that xi are approximately independent,

then we can see that their sum will be governed by this central limit

theorem. Analogously, a dataset {Xin} can thus be visualized in one dimension

by “histogramming”1 the values of Y = wTX, see Figure ??. In this figure we

clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of

projected and histogrammed data.

In one sense the central limit theorem is a rather helpful quirk of nature. Many

variables follow Gaussian distributions and the Gaussian distribution is one of

the few distributions which have very nice analytic properties. Unfortunately, the

Partial preview of the text

Download Data visualization in Data warehouse and mining and more Summaries Data Warehousing in PDF only on Docsity!

Data Visualization

The process of data analysis does not just consist of picking an algorithm, fitting it to the data and reporting the results. We have seen that we need to choose a representation for the data necessitating data-preprocessing in many cases. Depending on the data representation and the task at hand we then have to choose an algorithm to continue our analysis. But even after we have run the algorithm and study the results we are interested in, we may realize that our initial choice of algorithm or representation may not have been optimal. We may therefore decide to try another representation/algorithm, compare the results and perhaps combine them. This is an iterative process. What may help us in deciding the representation and algorithm for further analysis? Consider the two datasets in Figure ??. In the left figure we see that the data naturally forms clusters, while in the right figure we observe that the data is approximately distributed on a line. The left figure suggests a clustering approach while the right figure suggests a dimensionality reduction approach. This illustrates the importance of looking at the data before you start your analysis instead of (literally) blindly picking an algorithm. After your first peek, you may decide to transform the data and then look again to see if the transformed data better suit the assumptions of the algorithm you have in mind. “Looking at the data” sounds more easy than it really is. The reason is that we are not equipped to think in more than 3 dimensions, while most data lives in much higher dimensions. For instance image patches of size 10 × 10 live in a 100 pixel space. How are we going to visualize it? There are many answers to this problem, but most involve projection : we determine a number of, say, 2 or 3 dimensional subspaces onto which we project the data. The simplest choice of subspaces are the ones aligned with the features, e.g. we can plot X 1 n versus X 2 n 7 8 CHAPTER 2. DATA VISUALIZATION etc. An example of such a scatter plot is given in Figure ??. Note that we have a total of d(d − 1)/ 2 possible two dimensional projections which amounts to 4950 projections for 100 dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? perhaps random projections may work? Unfortunately that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ?? ). The deeper reason behind this phenomenon is the central limit theorem which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote with w a vector in Rd and by x the d-dimensional random variable, then y = wTx is the value of the projection. This is clearly is a weighted sum of the random variables xi, i = 1..d. If we assume that xi are approximately independent, then we can see that their sum will be governed by this central limit theorem. Analogously, a dataset {Xin} can thus be visualized in one dimension by “histogramming” 1 the values of Y = wTX, see Figure ??. In this figure we clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of projected and histogrammed data. In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytic properties. Unfortunately, the

Gaussian distribution is also the most uninformative distribution. This notion of “uninformative” can actually be made very precise using information theory and states: Given a fixed mean and variance, the Gaussian density represents the least amount of information among all densities with the same mean and variance. This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at.

Data visualization in Data warehouse and mining, Summaries of Data Warehousing

Related documents

Partial preview of the text

Download Data visualization in Data warehouse and mining and more Summaries Data Warehousing in PDF only on Docsity!

Data Visualization