Data Representation in Data warehouse and mining, Summaries of Data Warehousing

In this document you'll get clear review of what is data representation.

Typology: Summaries

2020/2021

Available from 01/19/2022

SanketSalvi
SanketSalvi 🇮🇳

3 documents

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Representation
What does “data” look like? In other words, what do we download into our computer?
Data comes in many shapes and forms, for instance it could be words from
a document or pixels from an image. But it will be useful to convert data into a
1.1. DATA REPRESENTATION 3
standard format so that the algorithms that we will discuss can be applied to it.
Most datasets can be represented as a matrix, X = [Xin], with rows indexed by
“attribute-index” i and columns indexed by “data-index” n. The value Xin for
attribute i and data-case n can be binary, real, discrete etc., depending on what
we measure. For instance, if we measure weight and color of 100 cars, the matrix
X is 2 × 100 dimensional and X1,20 = 20, 684.57 is the weight of car nr. 20 in
some units (a real value) while X2,20 = 2 is the color of car nr. 20 (say one of 6
predefined colors).
Most datasets can be cast in this form (but not all). For documents, we can
give each distinct word of a prespecified vocabulary a nr. and simply count how
often a word was present. Say the word “book” is defined to have nr. 10, 568 in the
vocabulary then X10568,5076 = 4 would mean: the word book appeared 4 times in
document 5076. Sometimes the different data-cases do not have the same number
of attributes. Consider searching the internet for images about rats. You’ll retrieve
a large variety of images most with a different number of pixels. We can either
try to rescale the images to a common size or we can simply leave those entries in
the matrix empty. It may also occur that a certain entry is supposed to be there but
it couldn’t be measured. For instance, if we run an optical character recognition
system on a scanned document some letters will not be recognized. We’ll use a
question mark “?”, to indicate that that entry wasn’t observed.
It is very important to realize that there are many ways to represent data and
not all are equally suitable for analysis. By this I mean that in some representation
the structure may be obvious while in other representation is may become
totally obscure. It is still there, but just harder to find. The algorithms that we will
discuss are based on certain assumptions, such as, “Hummers and Ferraries can
be separated with by a line, see figure ??. While this may be true if we measure
weight in kilograms and height in meters, it is no longer true if we decide to recode
these numbers into bit-strings. The structure is still in the data, but we would
need a much more complex assumption to discover it. A lesson to be learned is
thus to spend some time thinking about in which representation the structure is as
obvious as possible and transform the data if necessary before applying standard
algorithms. In the next section we’ll discuss some standard preprocessing operations.
It is often advisable to visualize the data before preprocessing and analyzing
it. This will often tell you if the structure is a good match for the algorithm you
had in mind for further analysis. Chapter ?? will discuss some elementary visualization
techniques.

Partial preview of the text

Download Data Representation in Data warehouse and mining and more Summaries Data Warehousing in PDF only on Docsity!

Data Representation

What does “data” look like? In other words, what do we download into our computer? Data comes in many shapes and forms, for instance it could be words from a document or pixels from an image. But it will be useful to convert data into a 1.1. DATA REPRESENTATION 3 standard format so that the algorithms that we will discuss can be applied to it. Most datasets can be represented as a matrix, X = [Xin], with rows indexed by “attribute-index” i and columns indexed by “data-index” n. The value Xin for attribute i and data-case n can be binary, real, discrete etc., depending on what we measure. For instance, if we measure weight and color of 100 cars, the matrix X is 2 × 100 dimensional and X 1 , 20 = 20, 684. 57 is the weight of car nr. 20 in some units (a real value) while X 2 , 20 = 2 is the color of car nr. 20 (say one of 6 predefined colors). Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a nr. and simply count how often a word was present. Say the word “book” is defined to have nr. 10 , 568 in the vocabulary then X 10568 , 5076 = 4 would mean: the word book appeared 4 times in document 5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about rats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to rescale the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?”, to indicate that that entry wasn’t observed. It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation is may become totally obscure. It is still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “ Hummers and Ferraries can be separated with by a line , see figure ??. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to recode these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter ?? will discuss some elementary visualization techniques.