

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The importance of data visualization and dimensionality reduction in the process of data analysis. It emphasizes the need to choose an appropriate representation for the data and the algorithm to continue the analysis. The document also explains the challenges of visualizing data in high dimensions and the use of projection to reduce the number of dimensions. It further discusses the central limit theorem and its impact on data visualization.
Typology: Summaries
1 / 2
This page cannot be seen from the preview
Don't miss anything!


The process of data analysis does not just consist of picking an algorithm, fitting it to the data and reporting the results. We have seen that we need to choose a representation for the data necessitating data-preprocessing in many cases. Depending on the data representation and the task at hand we then have to choose an algorithm to continue our analysis. But even after we have run the algorithm and study the results we are interested in, we may realize that our initial choice of algorithm or representation may not have been optimal. We may therefore decide to try another representation/algorithm, compare the results and perhaps combine them. This is an iterative process. What may help us in deciding the representation and algorithm for further analysis? Consider the two datasets in Figure ??. In the left figure we see that the data naturally forms clusters, while in the right figure we observe that the data is approximately distributed on a line. The left figure suggests a clustering approach while the right figure suggests a dimensionality reduction approach. This illustrates the importance of looking at the data before you start your analysis instead of (literally) blindly picking an algorithm. After your first peek, you may decide to transform the data and then look again to see if the transformed data better suit the assumptions of the algorithm you have in mind. “Looking at the data” sounds more easy than it really is. The reason is that we are not equipped to think in more than 3 dimensions, while most data lives in much higher dimensions. For instance image patches of size 10 × 10 live in a 100 pixel space. How are we going to visualize it? There are many answers to this problem, but most involve projection : we determine a number of, say, 2 or 3 dimensional subspaces onto which we project the data. The simplest choice of subspaces are the ones aligned with the features, e.g. we can plot X 1 n versus X 2 n 7 8 CHAPTER 2. DATA VISUALIZATION etc. An example of such a scatter plot is given in Figure ??. Note that we have a total of d(d − 1)/ 2 possible two dimensional projections which amounts to 4950 projections for 100 dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? perhaps random projections may work? Unfortunately that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ?? ). The deeper reason behind this phenomenon is the central limit theorem which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote with w a vector in Rd and by x the d-dimensional random variable, then y = wTx is the value of the projection. This is clearly is a weighted sum of the random variables xi, i = 1..d. If we assume that xi are approximately independent, then we can see that their sum will be governed by this central limit theorem. Analogously, a dataset {Xin} can thus be visualized in one dimension by “histogramming” 1 the values of Y = wTX, see Figure ??. In this figure we clearly recognize the characteristic “Bell-shape” of the Gaussian distribution of projected and histogrammed data. In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytic properties. Unfortunately, the
Gaussian distribution is also the most uninformative distribution. This notion of “uninformative” can actually be made very precise using information theory and states: Given a fixed mean and variance, the Gaussian density represents the least amount of information among all densities with the same mean and variance. This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at.