Data Preprocessing: Techniques for Missing Values, Noisy Data, and Integration, Essays (high school) of Mathematics

An overview of data preprocessing techniques used in data mining to improve accuracy, simplify results, and reduce data volume. Topics include data cleaning for handling missing values and noisy data, data integration for schema and object matching, and data reduction through techniques such as data cube aggregation and attribute subset selection. The document also covers concepts like correlation analysis and discretization.

Typology: Essays (high school)

2015/2016

Uploaded on 11/05/2016

bhargav_vangara
bhargav_vangara 🇮🇳

1 document

1 / 67

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Preprocessing
Week 2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43

Partial preview of the text

Download Data Preprocessing: Techniques for Missing Values, Noisy Data, and Integration and more Essays (high school) Mathematics in PDF only on Docsity!

Data Preprocessing

Week 2

TopicsTopics

•^

Data

Types

•^

Data

Repositories

•^

Data

Preprocessing

•^

Present

homework

assignment

Team Homework Assignment #3Team

Homework Assignment

P^

f^

h^

d^

i^

i^

f^

j

•^

Prepare

for

the

one

‐page

d

escription

of

your

group

project

topic

-^

Prepare for presentation using slidesPrepare

for

presentation

using

slides

•^

Due

date

beginning

of

the

lecture

on

Friday

February

th

Figur disco

re 1.4 Data overy

a Mining as a step in the proceess of knowwledge

Major Tasks in Data

P

i

Preprocessing

Figure 2.1 Forms of data preprocessing

Why Data Preprocessing is Beneficial to

D

Mi i

Data Mining?

•^

Less

data

data

mining

methods

can

learn

faster

Hi h

-^

Hi

gher

accuracy

data

mining

methods

can

generalize

better

•^

Simple results

-^

Simple

results

they

are

easier

to

understand

•^

Fewer

attributes

For

the

next

round

of

data

collection,

saving

can

be

made

by

removing

redundant

and

irrelevant

features

Remarks on Data CleaningRemarks

on Data Cleaning

•^

“Data

cleaning

is

one

of

the

biggest

problems

in

data

warehousing”

Ralph

Kimball

•^

“Data

cleaning

is

the

number

one

problem

in

data

warehousing”

DCI

survey

Why Data Is “Dirty”?

I^

l^

i^

d i

i^

d

-^

Incomplete,

noisy,

and

inconsistent

d

ata

are

commonplace

properties

of

large

real

‐world

databases

(p 48)

databases

(p

.^ 48)

-^

There

are

many

possible

reasons

for

noisy

data

(p.

Methods for Missing

Values (1)

Methods

for Missing Values (1)

•^

Ignore

the

tuple

•^

Fill in the missing value manuallyFill

in

the

missing

value

manually

•^

Use

a

global

constant

to

fill

in

the

missing

value

Methods for Missing

Values (2)

Methods

for Missing Values (2)

•^

Use

the

attribute

mean

to

fill

in

the

missing

value

•^

Use

the

attribute

mean

for

all

samples

belonging

to

the

same

class

as

the

given

tuple

•^

Use

the

most

probable

value

to

fill

in

the

missing

value

BinningBinning

RegressionRegression

Data IntegrationData Integration

Data IntegrationData

Integration

•^

Schema

integration

and

object

matching

Entity

identification

problem

•^

Redundant

data

(between

attributes)

occur

often

when

integration

of

multiple

databases

Redundant

attributes

may

be

able

to

be

detected

by

l ti

l^

i^

d^

hi

th d

correlation

analysis,

and

chi

‐square

method