
Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
In this document you'll get clear review of what is data representation.
Typology: Summaries
1 / 1
This page cannot be seen from the preview
Don't miss anything!

What does “data” look like? In other words, what do we download into our computer? Data comes in many shapes and forms, for instance it could be words from a document or pixels from an image. But it will be useful to convert data into a 1.1. DATA REPRESENTATION 3 standard format so that the algorithms that we will discuss can be applied to it. Most datasets can be represented as a matrix, X = [Xin], with rows indexed by “attribute-index” i and columns indexed by “data-index” n. The value Xin for attribute i and data-case n can be binary, real, discrete etc., depending on what we measure. For instance, if we measure weight and color of 100 cars, the matrix X is 2 × 100 dimensional and X 1 , 20 = 20, 684. 57 is the weight of car nr. 20 in some units (a real value) while X 2 , 20 = 2 is the color of car nr. 20 (say one of 6 predefined colors). Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a nr. and simply count how often a word was present. Say the word “book” is defined to have nr. 10 , 568 in the vocabulary then X 10568 , 5076 = 4 would mean: the word book appeared 4 times in document 5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about rats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to rescale the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?”, to indicate that that entry wasn’t observed. It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation is may become totally obscure. It is still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “ Hummers and Ferraries can be separated with by a line , see figure ??. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to recode these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter ?? will discuss some elementary visualization techniques.