Data Representation in Data warehouse and mining | Summaries Data Warehousing

Data Representation

What does “data” look like? In other words, what do we download into our computer?

Data comes in many shapes and forms, for instance it could be words from

a document or pixels from an image. But it will be useful to convert data into a

1.1. DATA REPRESENTATION 3

standard format so that the algorithms that we will discuss can be applied to it.

Most datasets can be represented as a matrix, X = [Xin], with rows indexed by

“attribute-index” i and columns indexed by “data-index” n. The value Xin for

attribute i and data-case n can be binary, real, discrete etc., depending on what

we measure. For instance, if we measure weight and color of 100 cars, the matrix

X is 2 × 100 dimensional and X1,20 = 20, 684.57 is the weight of car nr. 20 in

some units (a real value) while X2,20 = 2 is the color of car nr. 20 (say one of 6

predefined colors).

Most datasets can be cast in this form (but not all). For documents, we can

give each distinct word of a prespecified vocabulary a nr. and simply count how

often a word was present. Say the word “book” is defined to have nr. 10, 568 in the

vocabulary then X10568,5076 = 4 would mean: the word book appeared 4 times in

document 5076. Sometimes the different data-cases do not have the same number

of attributes. Consider searching the internet for images about rats. You’ll retrieve

a large variety of images most with a different number of pixels. We can either

try to rescale the images to a common size or we can simply leave those entries in

the matrix empty. It may also occur that a certain entry is supposed to be there but

it couldn’t be measured. For instance, if we run an optical character recognition

system on a scanned document some letters will not be recognized. We’ll use a

question mark “?”, to indicate that that entry wasn’t observed.

It is very important to realize that there are many ways to represent data and

not all are equally suitable for analysis. By this I mean that in some representation

the structure may be obvious while in other representation is may become

totally obscure. It is still there, but just harder to find. The algorithms that we will

discuss are based on certain assumptions, such as, “Hummers and Ferraries can

be separated with by a line, see figure ??. While this may be true if we measure

weight in kilograms and height in meters, it is no longer true if we decide to recode

these numbers into bit-strings. The structure is still in the data, but we would

need a much more complex assumption to discover it. A lesson to be learned is

thus to spend some time thinking about in which representation the structure is as

obvious as possible and transform the data if necessary before applying standard

algorithms. In the next section we’ll discuss some standard preprocessing operations.

It is often advisable to visualize the data before preprocessing and analyzing

it. This will often tell you if the structure is a good match for the algorithm you

had in mind for further analysis. Chapter ?? will discuss some elementary visualization

techniques.

Partial preview of the text

Download Data Representation in Data warehouse and mining and more Summaries Data Warehousing in PDF only on Docsity!

Data Representation

What does “data” look like? In other words, what do we download into our computer? Data comes in many shapes and forms, for instance it could be words from a document or pixels from an image. But it will be useful to convert data into a 1.1. DATA REPRESENTATION 3 standard format so that the algorithms that we will discuss can be applied to it. Most datasets can be represented as a matrix, X = [Xin], with rows indexed by “attribute-index” i and columns indexed by “data-index” n. The value Xin for attribute i and data-case n can be binary, real, discrete etc., depending on what we measure. For instance, if we measure weight and color of 100 cars, the matrix X is 2 × 100 dimensional and X 1 , 20 = 20, 684. 57 is the weight of car nr. 20 in some units (a real value) while X 2 , 20 = 2 is the color of car nr. 20 (say one of 6 predefined colors). Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a nr. and simply count how often a word was present. Say the word “book” is defined to have nr. 10 , 568 in the vocabulary then X 10568 , 5076 = 4 would mean: the word book appeared 4 times in document 5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about rats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to rescale the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?”, to indicate that that entry wasn’t observed. It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation is may become totally obscure. It is still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “ Hummers and Ferraries can be separated with by a line , see figure ??. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to recode these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter ?? will discuss some elementary visualization techniques.

Data Representation in Data warehouse and mining, Summaries of Data Warehousing

Related documents

Partial preview of the text

Download Data Representation in Data warehouse and mining and more Summaries Data Warehousing in PDF only on Docsity!

Data Representation