Cheat Sheet for Introduction to Data Mining | CS 412, Study notes of Computer Science

CHEAT SHEET Material Type: Notes; Professor: Han; Class: Introduction to Data Mining; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2012;

Typology: Study notes

2012/2013
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 01/22/2013

zhi-1
zhi-1 🇺🇸

2 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Mean: Sum / N; trimmed: w/o extreme values; Median: middle val; approx: group intervals
by freq, take median intv.
L1+(N/2(freq)l
freqmedian )width
L_1 lower bd, N all, \Sigma sum of freq lower than median intv.
Mode: Most freq value. mean - mode = 3 * (mean - median).
Midrange: (max+min) / 2. Range: max-min.
k-th q-quantile: at most (k/q) less than x, (q-k)/q greater than x. IQR: Q_3 - Q_1. 5 num
summary: Min, Q1, Median, Q3, Max. Boxplot: End at the quartiles, len = IQR. Mark Median.
Whiskers extend to min and max. Outliers: more than 1.5*IQR.
Variance: (STD deviation square)
1
N(xi¯x)2=(1
Nx2
i)¯x2
.
Quantilie plot: x against f. x: inc sorted data. f_i: f_i * 100% of data below x_i.
f_i = (i-0.5)/N, from 1/(2N) to 1-1/(2N). Q-Q plot: Between two sets of observations. M=N
then simply plot x against y. M<N, (i-0.5)/M quantile x against y.
Visualization methods: Basic: Boxplot, Histogram, Quantile Plot, QQ plot, Scatter plot. Pixel-
oriented: m-dim m-windows (pixels), color reflecting values. Circle segment. Geometric
projection: Direct visualization, Scatterplot and scatterplot matrices, Landscapes, Projection
pursuit: Help users find meaningful projections of multidimal data, Prosection views,
Hyperslice, Parallel coordinates. Icon-based: Chernoff Faces: 10-dim, Stick Figures: 5-piece.
Shape, color, tile bars. Hierarchical: Dimal Stacking (n-dim in 2D. >9D hard), Worlds-within-
Worlds, Tree-Map, Cone Trees (up to 1k nodes, overlapping 2D), InfoCube. Visualizing
complex data and relations: tag cloud. Data Correlation: attrs implies each other. neg, pos,
or null. Similarity: How alike 2 are. [0,1]. Dissimilarity: How diff. min 0.
d(i,j) = (r+s) / (q+r+s+t); d(i,j) = (r+s) / (q+r+s); Sim_Jaccard(i,j)=q/(q+r+s);
Data matrix: n pts with p dim; Two modes. Dissimilarity matrix: n pts, but
registers only the distance (dis b/w pts); A triangular matrix n; Single
mode. Minkowski Dist:
h
p|xi1xj1|h+···+|xip xjp|h
. >0 if i!=j
&& d(i,i)=0 (pos def); d(i,j)=d(j,i); d(i,j)=d(i,k)+d(k,j). (h=1 Manhattan, Hamming dis; h=2
Euclidean; h>2 supremum max diff b/w any attr of the vectors)
Prox Measure: (Nomial Attr) 1. simple match: #/match, #/total; 2. create a new bin attr.
(Bin Attr) 1. Create a contingency table q(1,1), r(1,0), s(0,1), t(0,0). 2. Distance measure for
symmetric (gender) binary variables: (r+s) / (q+r+s+t); 3. Distance measure for asymmetric
(symptom) binary variables (r+s) / (q+r+s); 4. Jaccard coefficient (similarity measure for
asymmetric binary variables): q / (q+r+s). (Mixed Type) Cos Similarity: (eg Doc, vector obj)
If d1, d2 are vectors.
cos(d1,d
2)=(d1·d2)/(||d1||||d2||)
.
DATA QUALITY: Accuracy: correct/wrong, accurate/not. Completeness: not recorded,
unavailable. Consistency: some modified, dangling. Timeliness: timely update? Believability:
trustable? Interpretability: easy to understand? Inconsistent: age/birthday; rating in
number / rating in letters
Data Cleaning: Missing attr tuple: ignore tuple, filling in manually, use a global constance
(Unknown), (use global/class mean/median, use most probable value) bias data.
Process: 1. discrepancy detection; 2. data transformation. Noisy Data: A random error or
variance in the measured variable. “Smooth.” Binning: Consulting values around. Local. Put
data into bins and smoothing by bin means, medians or boundaries (replace data with
closest boundary). equal-width part, equal-depth part. Regression: Conforms data values to
a function. Outlier analysis: Detect by clustering. Outsider of a cluster are outliers. Data
Integration: merging data from multiple storage. Entity id prob: Entities (attrs) from diff src
match up. Rebundancy: An attr can be derived from another. Detect by correlation analysis.
x^2 Correlation Test: Test the correlation relationship between two attrs, A and B. Does not
imply causality. Make a table, A’s c values on columns to B’s r values on rows.
2=c
i=1r
j=1
(oijeij )2
eij
o_{ij}: freq of (Ai, Bj). e_{ij}: expected frequency
=(count(A=a_i) * count(B=b_j)) / n. Higher-> corre.
Correlation Coefficient (Pearson’s prod moment coefficient): For numeric data.
rA,B =(ai¯
A)(bi¯
B)/(nAB)=((aibi)n¯
A¯
B)/(nAB)
(=0: indep,
<0 neg correlated; >0 pos). Covariance: Compare to expected val (mean),
CovA,B =(ai¯
A)(bi¯
B)/n
(>0 A,B both tend to > expected val; <0 A > expected
val b < expected val). Some pairs of random variables may have a covariance of 0 but
are not independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence.
Data Reduction: Dimenionality: (eg remove unimportant attrs) Wavelet transforms:
Decomposes a signal into different frequency, preserve relative distance between objects
at different levels of resolution, Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space. Principal Components Analysis (PCA): Find a projection
that captures the largest amount of variation in data, The original data are projected onto
a much smaller space, resulting in dimensionality reduction. 1. Normalize input data, 2.
Compute k orthonormal (unit) vectors, 3 Each input data (vector) is a linear combination, 4
Sort, delete weak.. Works for numeric data only. Feature subset selection: remove
redundant and irrelevant attributes, Best single attribute, Best step-wise selection, remove.
feature creation. Numerosity (Data Reduction): n Regression and Log-Linear Models,
[Histograms, clustering, sampling], Data cube aggregation; Data compression.
Wavelet: O(n). DWT (for linear signal processing, multi-resolution analysis) store only a
small fraction of the strongest of the wavelet coefficients. Similar to DFT, but better lossy
compression, localized in space. PCA: Find a projection that captures the largest amount of
variation in data. Project into smaller space. Attr Subset Selection: Reduce 1. Redundant
attrs (duplicate much info, eg price+tax); 2. Irrelevant attrs (no useful info eg student_id
+GPA). Heuristic Search: 2^d comb of d attrs. 1. Best single attr under the attr indep
assumption: choose by significance tests; 2. Best step-wise feature selection: The best
single-attr is picked first; Then next best attr condition to the first, …; 3. Step-wise attr
elimination: Repeatedly eliminate the worst attr; 4. Best combined attr selection and
elimination; 5. Optimal branch and bound: Use attr elimination and backtracking.
Attr/Feature Gen: Create new attr to capture info better: Attr extraction; Mapping to new
space (data reduction); Attr construction: combine, discretization.
Data Transformation: Min-Max Norm:
viminA
maxAmin A(nmax Anmin A)
. z-score Norm:
. Decimal Scaling Norm:
vi/10j
(j guarantees all norms < 1).
Data Discretization: Divide the range of a continuous attribute into intervals. Binning (top-
down, unsupervised), histogram (td, un), cluster (td/bu, un), decision tree (td, su),
correlation analyses (BU merge, un).
Classfication vs Correlation Analysis: C: supervised, with class; top-down, using entropy to
split. CA: su; BU merge: find neighbors (x^2 values) to merge.
Concept Hierarchy Generation: specify by user; by explicit data grouping; spec of a set of
attr; spec of only a partial set.
Data Warehouse: subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process. DB (OLTP) vs Data Warehouses
(OLAP): Users and sys orientation: customer-o by clients, clerks vs market-o by knowledge
workers; Data contents, db design, view, access patterns. 3-tier: warehouse db serv; OLAP;
front-end client. 3-models: Enterprise warehouse: collects all of the info about subjects
spanning the entire org; Data Mart: a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific, selected groups, such as
marketing data mart; Virtual warehouse: A set of views over operational db. Only some of
the possible summary views may be. Metadata Repo: DW structure; op metadata;
algorithms for sum; mapping from the op env to DW; perf; business.
Dim tables: such as item (item_name, brand, type). Fact table: contains measures (such as
dollars_sold) and keys to each of the related dim tables.
Measures: Distributive: sum(), count(), min(), max(). Algebraic: avg(), min_N(), max_N(),
stddev(). Holistic: median(), mode(), rank().
OLAP Op: roll-up: up on concept hierarchy; drill-down: more detail; slice: get 1 dim; dice:
get 2+ dim; pivot: rotate; drill across: involving (across) >1 fact table. drill through: bottom
lvl to its back-end relational tables (using SQL). Typical DW design process: 1. Business
process: eg orders, invoices. 2. Grain (atomic level of data). 3. Dim in each fact-t 4.
Measurements. SDB: Privacy, hidden hierarchy. Indexing: Bitmap: Index on a particular
column; Each value in the column has a bit vector: bit-op is fast; The length of the bit
vector: # of records in the base table; The i-th bit is set if the i-th row of the base table
has the value for the indexed column; not suitable for high cardinality domains. Join
Indices: Join index: JI(R-id, S-id) where R (R-id, ...) S (S-id, ...). It materializes relational join
in JI file and speeds up relational join. E.g. fact table: Sales and two dimensions city and
product: A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city. Join indices can span multiple dimensions
Cubes: If there is no concept hierarchy, n dim, p base cells, c common dim. Total Cuboid #
= 2^n, aggregated # = 2^n - 1. Total Cell = Total - overlapping cells * overlapping times.
Total = p * 2^n. Overlapping: count * 2^{common dim}. Closed cell: if there exists no d,
that d is a descendant of c and has the same measure value. Cannot be specialized to
obtain a same measure value. All base cells, apex cell is usually NOT. Iceberg cells: having
count >= min support. Cube Shell: Precompute only the cuboids with a small # of dim.
Fragment cubes’ space req:
O(Tceil(D/F )(2F1))
. Multi-way Array Aggr: Array-
based BU algorithm; Using multi-dim chunks; No direct tuple comparisons; Simultaneous
aggr on multiple dim; Intermediate aggr values are re-used for computing ancestor
cuboids; Can’t do Apriori pruning -> No iceberg optimization. Best order: min the mem and
I/Os; Method: the planes should be sorted and computed according to their size in
ascending order. Idea: keep the smallest plane in the main memory, fetch and compute
only one chunk at a time for the largest plane. Limitation: computing well only for small #/
dim. For large: TD and iceberg cube computation. BUC: BU cube comp; Divides dims into
parts and facilitates iceberg pruning; If a part does not satisfy min_sup, its descendants
can be pruned; If minsup = 1, comp full cube; No simultaneous aggregation. Star-Cubing:
Explore shared dimensions; eg dim A is the shared dim of ACD and AD. Allows for shared
computations; cuboid AB is comped simly as ABD. Aggregate in a TD manner but with the
BU sub-layer underneath: allow Apriori pruning. Shared dim grow in BU. (Loseless
compression.) High-Dimensional OLAP: ONLY one handles high dim. Challenge: high-dim;
Iceberg cube and compressed cube: only delay the inevitable explosion; Full
materialization still significant overhead in accessing results on disk. Step: 1. Part the set
of dim into shell fragments; 2. Comp data cubes for each shell frag while retaining inverted
indices or value-list indices; 3. Given the pre-computed frag cubes, dynamically compute
cube cells of the high- dimensional data cube online. Properties: Partitions the data
vertically. Reduces high-dimensional cube into a set of lower dimensional cubes. Online re-
construction of original high-dimensional space. Lossless reduction. Offers tradeoffs
between the amount of pre-processing and the speed of online computation. Frag-Shells:
Part set of dim (A1,...,An) into a set of k frags (P1,...,Pk).
!Scan base table once and do the following
!!insert <tid, measure> into ID_measure table.
!!for each attribute value ai of each dimension Ai
!!!build inverted index entry <ai, tidlist>
!For each farg part Pi
1
0
1
q
r
0
s
t
pf3
pf4
Discount

On special offer

Partial preview of the text

Download Cheat Sheet for Introduction to Data Mining | CS 412 and more Study notes Computer Science in PDF only on Docsity!

Mean : Sum / N; trimmed : w/o extreme values; Median : middle val; approx : group intervals

by freq, take median intv.

L 1 + (

N/ 2 (⌃freq)l

freqmedian^ )width

L_1 lower bd, N all, \Sigma sum of freq lower than median intv. Mode : Most freq value. mean - mode = 3 * (mean - median). Midrange : (max+min) / 2. Range : max-min. k-th q-quantile : at most (k/q) less than x, (q-k)/q greater than x. IQR : Q_3 - Q_1. 5 num summary : Min, Q1, Median, Q3, Max. Boxplot : End at the quartiles, len = IQR. Mark Median. Whiskers extend to min and max. Outliers : more than 1.5*IQR.

Variance : (STD deviation square)

1

N ⌃(xi^ ^ x¯)

2

1

N ⌃x

2

i )^ ^ ¯x

2 . Quantilie plot: x against f. x: inc sorted data. f_i: f_i * 100% of data below x_i. f_i = (i-0.5)/N, from 1/(2N) to 1-1/(2N). Q-Q plot : Between two sets of observations. M=N then simply plot x against y. M<N, (i-0.5)/M quantile x against y. Visualization methods: Basic : Boxplot, Histogram, Quantile Plot, QQ plot, Scatter plot. Pixel- oriented : m-dim m-windows (pixels), color reflecting values. Circle segment. Geometric projection : Direct visualization, Scatterplot and scatterplot matrices, Landscapes, Projection pursuit: Help users find meaningful projections of multidimal data, Prosection views, Hyperslice, Parallel coordinates. Icon-based : Chernoff Faces: 10-dim, Stick Figures: 5-piece. Shape, color, tile bars. Hierarchical : Dimal Stacking (n-dim in 2D. >9D hard), Worlds-within- Worlds, Tree-Map, Cone Trees (up to 1k nodes, overlapping 2D), InfoCube. Visualizing complex data and relations : tag cloud. Data Correlation: attrs implies each other. neg, pos, or null. Similarity : How alike 2 are. [0,1]. Dissimilarity: How diff. min 0. d(i,j) = (r+s) / (q+r+s+t); d(i,j) = (r+s) / (q+r+s); Sim_Jaccard(i,j)=q/(q+r+s); Data matrix: n pts with p dim; Two modes. Dissimilarity matrix: n pts, but registers only the distance (dis b/w pts); A triangular matrix n; Single

mode. Minkowski Dist :

h

p

|xi 1 xj 1 |h^ + · · · + |xip xjp|h^. >0 if i!=j

&& d(i,i)=0 (pos def); d(i,j)=d(j,i); d(i,j)=d(i,k)+d(k,j). (h=1 Manhattan , Hamming dis; h= Euclidean ; h>2 supremum max diff b/w any attr of the vectors) Prox Measure : (Nomial Attr) 1. simple match: #/match, #/total; 2. create a new bin attr. (Bin Attr) 1. Create a contingency table q(1,1), r(1,0), s(0,1), t(0,0). 2. Distance measure for symmetric (gender) binary variables: (r+s) / (q+r+s+t); 3. Distance measure for asymmetric (symptom) binary variables (r+s) / (q+r+s); 4. Jaccard coefficient (similarity measure for asymmetric binary variables): q / (q+r+s). (Mixed Type) Cos Similarity: (eg Doc, vector obj)

If d1, d2 are vectors. cos(d^1 , d^2 ) = (d^1 ·^ d^2 )/(||d^1 ||||d^2 ||).

DATA QUALITY : Accuracy : correct/wrong, accurate/not. Completeness : not recorded, unavailable. Consistency : some modified, dangling. Timeliness : timely update? Believability : trustable? Interpretability : easy to understand? Inconsistent : age/birthday; rating in number / rating in letters Data Cleaning : Missing attr tuple : ignore tuple, filling in manually, use a global constance (Unknown ) , (use global/class mean/median, use most probable value) → bias data. Process : 1. discrepancy detection; 2. data transformation. Noisy Data: A random error or variance in the measured variable. “Smooth.” Binning : Consulting values around. Local. Put data into bins and smoothing by bin means, medians or boundaries (replace data with closest boundary). equal-width part, equal-depth part. Regression : Conforms data values to a function. Outlier analysis : Detect by clustering. Outsider of a cluster are outliers. Data Integration : merging data from multiple storage. Entity id prob : Entities (attrs) from diff src match up. Rebundancy : An attr can be derived from another. Detect by correlation analysis. x^2 Correlation Test : Test the correlation relationship between two attrs, A and B. Does not imply causality. Make a table, A’s c values on columns to B’s r values on rows.

^2 = ⌃c i=1⌃r j=

(oij eij )^2 eij (^) o_{ij}: freq of (Ai, Bj). e_{ij}: expected frequency

=(count(A=a_i) * count(B=b_j)) / n. Higher-> corre. Correlation Coefficient (Pearson’s prod moment coefficient): For numeric data.

rA,B = ⌃(ai A¯)(bi B¯)/(nAB ) = (⌃(aibi) n A¯ B¯)/(nAB ) (=0: indep,

<0 neg correlated; >0 pos). Covariance : Compare to expected val (mean),

Cov A,B = ⌃(ai A¯)(bi B¯)/n (>0 A,B both tend to > expected val; <0 A > expected

val → b < expected val). Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence. Data Reduction : Dimenionality : (eg remove unimportant attrs) Wavelet transforms: Decomposes a signal into different frequency, preserve relative distance between objects at different levels of resolution, Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space. Principal Components Analysis (PCA): Find a projection that captures the largest amount of variation in data, The original data are projected onto a much smaller space, resulting in dimensionality reduction. 1. Normalize input data, 2. Compute k orthonormal (unit) vectors, 3 Each input data (vector) is a linear combination, 4 Sort, delete weak.. Works for numeric data only. Feature subset selection: remove redundant and irrelevant attributes, Best single attribute, Best step-wise selection, remove. feature creation. Numerosity (Data Reduction): n Regression and Log-Linear Models, [Histograms, clustering, sampling], Data cube aggregation; Data compression. Wavelet : O(n). DWT (for linear signal processing, multi-resolution analysis) store only a small fraction of the strongest of the wavelet coefficients. Similar to DFT, but better lossy compression, localized in space. PCA : Find a projection that captures the largest amount of variation in data. Project into smaller space. Attr Subset Selection : Reduce 1. Redundant attrs (duplicate much info, eg price+tax); 2. Irrelevant attrs (no useful info eg student_id

+GPA). Heuristic Search : 2^d comb of d attrs. 1. Best single attr under the attr indep assumption: choose by significance tests; 2. Best step-wise feature selection: The best single-attr is picked first; Then next best attr condition to the first, …; 3. Step-wise attr elimination: Repeatedly eliminate the worst attr; 4. Best combined attr selection and elimination; 5. Optimal branch and bound: Use attr elimination and backtracking. Attr/Feature Gen: Create new attr to capture info better: Attr extraction; Mapping to new space (data reduction); Attr construction: combine, discretization.

Data Transformation : Min-Max Norm :

viminA

max AminA (n^ max^ A^ ^ n^ minA)^. z-score Norm:

(vi A¯)/A . Decimal Scaling Norm: vi/ 10

j (j guarantees all norms < 1). Data Discretization : Divide the range of a continuous attribute into intervals. Binning (top- down, unsupervised), histogram (td, un), cluster (td/bu, un), decision tree (td, su), correlation analyses (BU merge, un). Classfication vs Correlation Analysis : C: supervised, with class; top-down, using entropy to split. CA: su; BU merge: find neighbors (x^2 values) to merge. Concept Hierarchy Generation: specify by user; by explicit data grouping; spec of a set of attr; spec of only a partial set. Data Warehouse: subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process. DB (OLTP) vs Data Warehouses (OLAP): Users and sys orientation: customer-o by clients, clerks vs market-o by knowledge workers; Data contents, db design, view, access patterns. 3-tier : warehouse db serv; OLAP; front-end client. 3-models : Enterprise warehouse : collects all of the info about subjects spanning the entire org; Data Mart : a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart; Virtual warehouse : A set of views over operational db. Only some of the possible summary views may be. Metadata Repo : DW structure; op metadata; algorithms for sum; mapping from the op env to DW; perf; business. Dim tables : such as item (item_name, brand, type). Fact table: contains measures (such as dollars_sold) and keys to each of the related dim tables. Measures : Distributive : sum(), count(), min(), max(). Algebraic : avg(), min_N(), max_N(), stddev(). Holistic : median(), mode(), rank(). OLAP Op : roll-up : up on concept hierarchy; drill-down : more detail; slice : get 1 dim; dice : get 2+ dim; pivot : rotate; drill across : involving (across) >1 fact table. drill through : bottom lvl to its back-end relational tables (using SQL). Typical DW design process : 1. Business process: eg orders, invoices. 2. Grain (atomic level of data). 3. Dim in each fact-t 4. Measurements. SDB : Privacy, hidden hierarchy. Indexing : Bitmap : Index on a particular column; Each value in the column has a bit vector: bit-op is fast; The length of the bit vector: # of records in the base table; The i-th bit is set if the i-th row of the base table has the value for the indexed column; not suitable for high cardinality domains. Join Indices : Join index: JI(R-id, S-id) where R (R-id, ...) S (S-id, ...). It materializes relational join in JI file and speeds up relational join. E.g. fact table: Sales and two dimensions city and product: A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city. Join indices can span multiple dimensions

Cubes : If there is no concept hierarchy, n dim, p base cells, c common dim. Total Cuboid # = 2^n, aggregated # = 2^n - 1. Total Cell = Total - overlapping cells * overlapping times. Total = p * 2^n. Overlapping : count * 2^{common dim}. Closed cell : if there exists no d, that d is a descendant of c and has the same measure value. Cannot be specialized to obtain a same measure value. All base cells, apex cell is usually NOT. Iceberg cells : having count >= min support. Cube Shell : Precompute only the cuboids with a small # of dim.

Fragment cubes’ space req: O(T^ ⇥^ ceil(D/F^ )(

F 1))

. Multi-way Array Aggr : Array- based BU algorithm; Using multi-dim chunks; No direct tuple comparisons; Simultaneous aggr on multiple dim; Intermediate aggr values are re-used for computing ancestor cuboids; Can’t do Apriori pruning -> No iceberg optimization. Best order : min the mem and I/Os; Method : the planes should be sorted and computed according to their size in ascending order. Idea : keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane. Limitation : computing well only for small #/ dim. For large: TD and iceberg cube computation. BUC : BU cube comp; Divides dims into parts and facilitates iceberg pruning; If a part does not satisfy min_sup, its descendants can be pruned; If minsup = 1, comp full cube; No simultaneous aggregation. Star-Cubing : Explore shared dimensions; eg dim A is the shared dim of ACD and AD. Allows for shared computations; cuboid AB is comped simly as ABD. Aggregate in a TD manner but with the BU sub-layer underneath: allow Apriori pruning. Shared dim grow in BU. (Loseless compression.) High-Dimensional OLAP : ONLY one handles high dim. Challenge : high-dim; Iceberg cube and compressed cube: only delay the inevitable explosion; Full materialization still significant overhead in accessing results on disk. Step : 1. Part the set of dim into shell fragments; 2. Comp data cubes for each shell frag while retaining inverted indices or value-list indices; 3. Given the pre-computed frag cubes, dynamically compute cube cells of the high- dimensional data cube online. Properties: Partitions the data vertically. Reduces high-dimensional cube into a set of lower dimensional cubes. Online re- construction of original high-dimensional space. Lossless reduction. Offers tradeoffs between the amount of pre-processing and the speed of online computation. Frag-Shells: Part set of dim (A1,...,An) into a set of k frags (P1,...,Pk). ! Scan base table once and do the following !! insert <tid, measure> into ID_measure table. !! for each attribute value ai of each dimension Ai !!! build inverted index entry <ai, tidlist> ! For each farg part Pi

1 q r

0 s t

!! build local fragment cube Si by intersecting tid-

lists in bottom- up fashion.

Process Query: 1.Divide the query into frags; 2. Fetch the corresponding TID list for each

frag; 3. Intersect the TID lists from each frag to construct instantiated base table; Compute

the data cube using the base table with any cubing algo.

OLAP on Survey Data : Semantics of query is unchanged, Input data has changed.

Confidence Interval: ¯x^ ±^ tc^ ˆ¯x^. sigma=s/sqrt(l)x is a sample of data set; is the mean of

sample. t_c is the critical t-value, calculated by a look-up.

Intra-Cuboid Expansion : Combine other cells’ data into own to “boost” confidence. Cell

segment similarity: Two-sample t-test (confidence-based).

Inter-Cuboid Expansion : If a query dimension is not correlated with cube value. But is

causing small sample size by drilling down too much. Remove dimension (i.e., generalize

to *) and move to a more general cuboid.

(top-k) ranking query : only returns the best k results according to a user-specified

preference, consisting of (1) a selection condition and (2) a ranking function. Build a

ranking cube on both selection dimensions and ranking dimensions.

Materialize Ranking-Cube: 1. Part Data on Ranking dims; 2. Group data by Selection dims;

  1. Compute Measures for each group.

RC Execution Trace: 1. Retrieve High-level measure for LA {11, 15}. 2. Estimate lower bound

score for block 11, 15 f(block 11) = 40,000, f(block 15)=160,000. 3. Retrieve block 11 4.

Retrieve low-level measure for block 11 5. f(t6) = 130,000, f(t7) = 97,600. 6. Output t.

Ranking Cube: Methodology : Push selection and ranking simultaneously; It works for many

sophisticated ranking functions. Support high-dim : 1. Materialize only those atomic

cuboids that contain single selection dimensions; 2. Uses the idea similar to high-dim

OLAP; 3. Achieves low space overhead and high performance in answering ranking queries

with a high number of selection dimensions.

Patterns : Closed : is frequent and there exists no super-pattern Y>X, with the same support

as X. is Lossless compr of freq. patterns. Reducing the # of patterns. max-pattern : if X is

frequent and there exists no frequent super-pattern Y>X.

Absolute sup: Count(x). Sup:

abs_sup / count_all. Freq: sup >

threshold. Confidence: conditional P

that contain X also have Y.

Apriori : 1. Initially, scan DB once to

get frequent 1-itemset; 2. Generate

length (k+1) candidate itemsets

from length k frequent itemsets; 3.

Test the candidates against DB; 4.

Terminate when no frequent or

candidate set can be generated.

Improve Apriori : Challenges : 1.

Multiple scans of transaction

database; 2. Huge number of

candidates; 3. Tedious workload of

support counting for candidates.

Improve : 1. Reduce passes of

transaction database scans. 2.

Shrink number of candidates. 3. Facilitate support counting of candidates.

FP-Growth : 1) Scan the data and find out length-1 frequent items. 2) For each frequent

item, construct conditional pattern-base, and then its conditional FP-tree; 3) Performing

recursively on FP-tree; 4) Until the result FP-tree is empty, or it contains only one path.

FP Adv : Completeness : 1. Preserve complete information for frequent pattern mining. 2.

Never break a long pattern of any transaction. Compactness : 1. Reduce irrelevant info—

infrequent items are gone. 2. Items in frequency descending order: the more frequently

occurring, the more likely to be shared 3. Never be larger than the original database (not

count node-links and the count field)

Adv of Pattern Growth Approach : Divide-and-conquer: 1. Decompose both the mining task

and DB according to the frequent patterns obtained so far. 2. Lead to focused search of

smaller databases. Other factors: 1. No candidate generation, no candidate test. 2.

Compressed database: FP-tree structure 3. No repeated scan of entire database 4. Basic

ops: counting local freq items and building sub FP-tree, no pattern search and matching

Null invariant measure: the value is free from the influence of null transaction (does not

contain any of the examined items) Jaccard, cosine, all_conf and kulczynski are the four

null invariant measures. Three category of constraints: A M S

Antimonotonic: If an itemset does not satisfy the constraint, none of the supersets can

satisfy it. monotonic: If an itemset satisfies the constraint, so do the supersets. succinct:

Enumberate all and only those sets that are guaranteed to satisfy the constraint. A,M,S

Pattern Fusion: Initialization (Initial pool): Use an existing algorithm to mine all frequent

patterns up to a small size, e.g., 3. Iterative Pattern Fusion: 1 At each iteration, k seed

patterns are randomly picked. 2 For each one picked, find all the patterns within a

bounding ball centered at the seed pattern. 3.All these patterns found are fused together

to generate a set of super-patterns. All the super-patterns thus generated form a new pool

for the next iteration. Termination: when the current pool contains no more than K

patterns at the beginning of iteration.

Count Cells: (b1, b2, a3, a4, a5, ..., a9, a10):count=10; (b1, a2, b3, a4, a5, ..., a9,

a10):count=20; (a1, b2, b3, a4, a5, ..., a9, a10):count=

  1. Total non-empty cubioid: 2^
  2. Each base cell generates 2^10−1 non-base cells, so there are 3(2^10−1) non-base cells before aggregating the measure. Base cell 1 generates cell 27 cells that overlap once from base cell 1 and 2. The measure of all such cells is count=30. Similarly, from (, b2, , a4, a5, ..., a9, a10), we get 2^7 cells that overlap once from base cell 1 and 3. The measure is count=10+50=60. Similarly, from (, , b3, a4, a5, ..., a9, a10), we get 2^7 cells that overlap once from base cell 2 and 3. The measure is count=20+50=70. So there are in total 327 cells that overlap once. From (, , , a4, a5, ..., a9, a10), we get 27 cells that overlap twice. The measure is count=10+20+50=80. Since it overlaps twice, we should remove twice, i.e., 227.Finally, the number of non-base cells is 3(210−1) − 3 − 227 = 3827− 3 −327−227= 1927−3. (Nonempty aggregate cells).
  3. If count >= 0, add up cells, don’t count overlap.

Aprior vs FP: ECLAT works on the item:tidlist format (vertical) and derives frequent patterns based onvertical intersections, while Apriori and FP-growth works on tid:itemlist format (horizontal), FP-growth compresses the data into FP-tree structure. Apriori generates a lot of candidates and may need a lot of repeated scans of the database, while FP-growth does not generate candidates and no numerous repeated scans of entire database. Apriori is breadth-first search, while FP-growth is depth-first search. ECLAT uses diffset to accelerate mining. [ADV FP] Multi-level Association: Flexible min-support thresholds: Some items are more valuable but less frequent. Redundancy Filtering : Some rules may be redundant due to “ancestor” relationships between items. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. Multi-Dimensional Association:. Quantitative Associations: Techniques can be categorized by how numerical attributes, such as age or salary are treated. Static Discretization of Quantitative Attributes: Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges In relational db, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional: cuboid correspond to the predicate sets. Mining from data cubescan be much faster. Negative and Rare Patterns: Rare patterns: Very low support but interesting. Neg : Ford Expedition vs Toyota Prius. Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent. Def:

  1. sup(X U Y) < sup (X) * sup(Y) (suffers from null-invariant); 2. (neg itemset-based) a. X = Ā U B (B is a set of pos, and Ā is a set of neg items, |Ā|>=1) and s(X) >= μ. b.

(suffers too); 3. (Kulzynski) If itemsets X and Y are frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern threshold, then X and Y are negatively correlated. Constraint-based (Query-Directed): an interactive process. User flexibility : provides constraints on what to be mined. Optimization : explores such constraints for efficient mining — constraint-based mining : constraint-pushing, similar to push selection first in DB query processing. Note : still find all the answers satisfying constraints, not finding some answers in “heuristic search”. Constraints: Knowledge type: classification, association, etc. Data (SQL-like): find product pairs sold together in stores in Chicago this year. Dimension/ level: in relevance to region, price, brand, customer category. Rule (or pattern): small sales (price < $10) triggers big sales (sum > $200). Interestingness constraint: strong rules: min_support 3%, min_confidence 60%. Meta-Rule Guided: Meta-rule can be in the rule form with partially instantiated predicates and constants. Method : 1. Find frequent (l+r) predicates (based on min-support threshold); 2. Push constants deeply when possible into the mining process; 3. Use confidence, correlation, and other measures when possible. Constraint-Based Frequent Pattern Mining: Pattern space pruning constraints: Anti- monotonic : If constraint c is violated, its further mining can be terminated. Monotonic: If c is satisfied, no need to check c again. Succinct : c must be satisfied, so one can start with the data sets satisfying c. Convertible : c is not monotonic nor anti-monotonic, but it can be converted into it if items in the transaction can be properly ordered. Data space pruning constraint: Data succinct: Data space can be pruned at the initial pattern mining process. Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned from its further mining. Pattern Space Pruning with Anti-Monotonicity Constraints: A constraint C is anti- monotone if the super pattern satisfies C, all of its sub-patterns do so too. In other words, anti-monotonicity: If an itemset S violates the constraint, so does any of its superset. Pattern Space Pruning with Monotonicity Constraints: A constraint C is monotone if the pattern satisfies C, we do not need to check C in subsequent mining. Alternatively, monotonicity: If an itemset S satisfies the constraint, so does any of its superset. Data Space Pruning with Data Anti-monotonicity : A constraint c is data anti-monotone if for a pattern p cannot satisfy a transaction t under c, p’s superset cannot satisfy t under c either. The key for data anti-monotone is recursive data reduction. Pattern Space Pruning with Succinctness: Succinctness : Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1. Idea : Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items. min(S.Price) is succinct sum(S.Price) is not. Optimization : If C is succinct, C is pre-counting pushable. [Classification Basic] Supervised learning : labels known. Unsupervised : label unknown. Application : Credit/loan approval; Medical diagnosis; Fraud detection; Web page cat.

Measure Definition Range^ NI

2

(a, b) ⌃i,j=0, 1

e(ai,bj )o(ai,bj ))^2

e(ai,bj ) [0,^1 ]^

N

Lif t(a, b)

P (ab) P (a)P (b)

[0, 1 ] N

AllConf (a, b)

sup(ab) max(sup(a),sup(b))

[0, 1] Y

Coherence(a, b)

sup(ab)

sup(a)+sup(b)sup(ab) [0,^ 1]^ Y

Cosine(a, b)

sup(ab)

p

sup(a)sup(b)

[0, 1] Y

Kulc(a, b) sup(ab)

2 (^

1

sup(a) +^

1

sup(b) )^

[0, 1] Y

M axConf (a, b) max( sup(ab)

sup(a) ,^

sup(ab)

sup(b) )^

[0, 1] Y

form of P1 ^ p2 … ^ pl : class = C” (conf, sup); 3. Organize the rules to form a rule-based

classifier. Why Effective: 1. Explores highly confident associations and may overcome some

constraints introduced by decision-tree induction; 2. Associative classification is more

accurate than some traditional classification methods, such as C4.5. Accuracy issue:

Increase the discriminative power; Increase the expressive power of the feature space.

Scalability issue: Infeasible to generate all feature combinations and filter with an info gain

threshold. Feature Selection: Given a set of frequent patterns can cause overfitting; need to

Single out the discriminative patterns and remove redundant ones; Max Marginal

Relevance (MMR) is borrowed. A doc has high marginal relevance if it is both relevant to

the query and contains minimal marginal similarity to previously selected documents.

DDPMine: Branch-and-Bound Search: during the recursive FP-growth mining, use a global

variable to record the most discriminative itemset. Before proceeding to construct a

conditional FP-tree, first estimate the upper bound of the info gain. If the upper bound

value is not greater than the current best value, we can safely skip this conditional FP-tree

as well as any FP-tree recursively constructed from this one. It integrates the feature

selection mechanism into the mining framework.

[CLUSTERING]

Cluster: A collection of data objs, similar in same group, dissimilar in other groups. C-

Analysis: unsupervised; Applications: Biology: taxonomy. Information retrieval: document

clustering. Land use: Id of similar land use; Marketing: discover distinct customer groups.

City-planning: Id groups of houses; Earthquake studies: Cluster along continent faults;

Climate: understanding earth climate; Economic Science : market research.

Dendtrogram : A tree like diagram to show how clusters are merged. A clustering of the

data objects is obtained by cutting the dendrogram at the desired level, then each

connected component forms a cluster.

Distance b/w 2 clusters: Single link : smallest dist b/w elems, min(tip, tjq); Complete link:

largest dist b/w elems, max(tip, tjq); Average : avg dist b/w elems, avg(tip, tjq); Centroid :

dist b/w the centroids, dist(Ci, Cj); Medoid : dist b/w the medoids, dist(Mi, Mj); Medoid : a

chosen, centrally located obj in the cluster.

Centroid : the “middle” of a cluster; Radius : sqrt of avg dist from any point of the cluster to

its centroid; Diameter : sqrt of avg mean squared dist b/w all pairs of pts in the cluster.

1. PARTITIONING: Construct various partitions and then evaluate them by some criterion,

e.g., minimizing the sum of square errs. k-means : 1. Partition objs into k nonempty

subsets; 2. Compute seed pts as the centroids (mean) of the clusters; 3. Assign each obj to

the cluster w/ the nearest seed point; 4. goto 2, or stop when no change. Pros: efficient,

O(tkn). Comparing: PAM: O(k(n-k)^2), CLARA: O(ks^2 + k(n-k)). Cons: Continuous; k-

medoids can be applied to a wide range of data; Need to specify k in advance; Sensitive to

noisy data and outliers ; Not suitable for non-convex shapes. Variations: k-modes: mode

instead , k-medoids : most centrally located located instead. PAM : Starts from an init set of

medoids and iteratively replaces one of the medoids by one of the non-medoids if it

improves the total dist; CLARA: PAM on samples; CLARANS : Randomized re-sampling.

2. HIERARCHICAL: Create a hierarchical decomposition of the data. Use dist matrix as

clustering criteria. Needs a termination condition. AGNES : 1. Use the single-link and the

dissimilarity matrix; 2. Merge nodes that have the least dissimilarity; 3. Go on in a non-

descending fashion; 4. Eventually all nodes -> 1 cluster. DIANA : Inverse order of AGNES.

Eventually each node forms a cluster. Pros: Does not require k. Cons : Can never undo

what was done previously; Do not scale: O ( n^2 ). Nontrivial to choose a good dist measure;

Hard to handle missing attr; Optimization goal not clear: heuristic, local search.

3. INTEGRATION OF HIERARCHICAL & DIST-BASED CLUSTERING: BIRCH (1996): uses CF-tree and

incrementally adjusts the quality of sub-clusters. Algo: 1: scan DB to build an init in-

memory CF tree; Phase 2: use an arbitrary clustering algo to cluster the leaf nodes of the

CF-tree. Pros: Scales linearly, O(n), finds a good clustering w/ a single scan and improves

the quality w/ a few additional scans; Cons : only numeric data; sensitive to the order; we

fix size of leaf nodes -> clusters may not be so natural; Clusters tend to be spherical given

the radius and diameter measures. A CF-tree is a height-balanced tree tores CF = (#,

LinearSum, SqSum). CF-Tree Params: Branching factor: max #/ children; Threshold : max

diameter of sub-clusters stored at the leaf nodes;

Cluster Diameter :

q

1

n(n1) ⌃(xi^ ^ xj^ )

2 ; For each point in the input: 1. Find closest leaf

entry; 2. Add point to leaf entry and update CF; 3. If entry diameter > max_diameter, then

split leaf, and possibly parents. CHAMELEON : hierarchical clustering using dynamic

modeling. Measures the similarity based on a dynamic model: two clusters are merged

only if the interconnectivity and closeness (proximity) b/w 2 clusters are high relative to

the internal ... w/in the clusters. Algo : 1.Use a graph-partitioning algorithm: cluster objs

into a large #/ small sub-clusters; 2. Use an agglomerative hierarchical clustering algo:

Merge sub-clusters. Inter-connectivity: Relative closeness: Merge sub-clusters:

4. PROBABILISTIC HIERARCHICAL CLUSTERING: (see cons of 2.) Use probabilistic models to

measure dists b/w clusters. Easy to understand, same efficiency as algorithmic

agglomerative clustering method. Can handle partially observed data. Generative model :

Regard the set of data objs to be clustered as a sample of the underlying data generation

mechanism to be analyzed. adopt common dist funcs, e.g., Gaussian, Bernoulli. The

likelihood that X is generated by model N: L(N:X) = P(X|N). [dist = - log (P(Ci ∪ Cj)/

(P(Ci)P(Cj))] Algo: [ Create a cluster for each obj Ci = {oi}, 1 ≤ i ≤ n; For (i = 1 to n) { Find pair of clusters Ci and Cj such that Ci,Cj = argmax(i≠j) (-dist);

If ((-dist) > 0) then { merge Ci and Cj }} ].

5. DENSITY-BASED APPROACH: Based on connectivity and density functions. Pro: Discover

clusters of arbitrary shape. Handle noise. One scan. Cons: Need density parameters as

termination condition. Two params: Eps : Min radius of neighborhood; MinPts : Min #/ pts in

a neighborhood. NEps(q): {p belongs to D | dist(p,q) ≤ Eps}. Directly density- reachable: 1. p in NEps(q); 2. core point condition: |NEps (q)| >= MinPts. Density- reachable: there is a chain of pts q to p, that p(i+1) is directly density-reachable from pi. Density-connected : if there is a point o such that both, p and q are density-reachable from o. DBSCAN : A cluster is defined as a max set of density-connected pts. Pro: Support arbitrary shape w/ noise. Cons: Sensitive to params. Algo: 1. Arbitrary select a point p; 2. Retrieve all pts density-reachable from p; 3. If p is a core point, a cluster is formed; 4. If p is a border point, no pts are density-reachable from p, visit the next pt; 5. Process all pts. Time: O(nlogn) or O(n^2). OPTICS: Ordering pts to id the clustering structure. Produces a special order of the db based its density-based clustering structure -> info equiv to the density-based clusterings corresponding to a broad range of param settings. Pros: Good for auto and interactive cluster analysis: finding intrinsic clustering structure. Easy visualization. Time: O(nlogn). Core Distance: (MinPts_dist(p)) (dist from p to its MinPts’ neighbour). Reachability Dist: max(core-dist(o), distance(o, p)). DENCLUE : Using stat density func. Pros : Solid math foundation; Good for large noise; Allows a compact math description of arbitrarily shaped clusters in high-dim; Significant faster; Cons : Needs a large #/ parameters.

fG(x, y) = e

d(x,y)

2 2 ^2

(influence of y on x), f^

D

G (x, y) =^ ⌃fG(x, xi)^ (total influence

on x), rf^

D

G (x, xi) =^ ⌃(xi^ ^ x)fG(x, xi)^ (gradient of x in the direction of x_i)

Uses grid cells but only keeps information about grid cells that do actually contain data pts and manages these cells in a tree-based access structure. Influence function : describes the impact of a data point w/in its neighborhood. Overall density of the data space: the sum of the influence function of all data pts. Clusters can be determined mathly by iding density attractors. Density attractors are local max of the overall density function. Center defined clusters : assign to each density attractor the pts density attracted to it. Arbitrary shaped cluster : merge density attractors connected through paths of high density (>threshold).

6. GRID-BASED APPROACH: based on a multiple-level granularity structure. STING : Divide into rectangular cells; Levels of cells <-> levels of resolution; Stat of each cell is pre-calculated; LOW->HIGH, easy; Top-down approach to answer query; Start from a pre-selected layer— typically w/ a small #/ cells; For each cell in the current level compute the confidence interval; Algo : 1. Remove the irrelevant cells; 2. When finish examining the current layer, proceed to the next lower level; 3. Repeat this process until the bottom. Pros : Query- independent, easy to parallelize, incremental update; O(K), K: #/ grid cells at the lowest level. Cons : Horizontal or Vertical boundaries, no diagonal. WaveCluster : A multi-resolution clustering approach using wavelet method. CLIQUE : density-based+grid-based. Algo : 1. Partition the data space and find the #/ pts that lie inside each cell of the partition; 2. Identify the subspaces that contain clusters using the Apriori principle; 3. Identify clusters: Determine dense units in all subspaces of interests; Determine connected dense units in all subspaces of interests. 4. Gen min description for the clusters: Determine max regions that cover a cluster of connected dense units for each cluster; Determination of min cover for each cluster. Pros : auto finds subspaces of the highest dimality such that high density clusters exist in those subspaces; insensitive to the order of records in input and does not presume some canonical data distribution; scales linearly; has good scalability as the #/ dim in the data increases. Cons : The accuracy of the clustering result may be degraded. 7. MODEL-BASED: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other. eg : EM, SOM, COBWEB. 8. FREQUENT PATTERN-BASED: Based on the analysis of frequent patterns. eg : p-Cluster. 9. USER-GUIDED OR CONSTRAINT- BASED: Clustering by considering user-specified or application-specific constraints eg : COD (obstacles), constrained clustering. 10. LINK-BASED: Objects are often linked together in various ways. Massive links can be used to cluster objs: SimRank, LinkClus Assessing Clustering Tendency: Hopkins Static: 1. Given D regarded as a sample of a random variable o, determine how far away o is from being uniformly distributed in the data space. 2. Sample n points, p1, …, pn. For each pi, find its nearest neighbor in D. 2. Sample n points, q1, …, qn, find its nearest neighbor in D but not in the sample. 3.

Calculate the Hopkins Statistic:

H =

⌃yi ⌃xi+⌃yi (^). 4. If D is uniformly distributed: H~0.5. If

D is clustered, H ~ 1. Determine the Number of Clusters: Empirical: sqrt(n)/2. Elbow: Use the turning point in the curve of sum of within cluster variance. Cross validation: 1. Divide into m parts; 2. Use m – 1 parts to cluster; 3. Use the remaining part to test the quality; 4. For any k > 0, repeat it m times, compare the overall quality measure. Measuring Clustering Quality: Extrinsic : supervised. Compare result to the truth. Homogeneity : purer; Completeness : all obj same category -> same cluster; Rag bag : putting a heterogeneous obj into a pure cluster should be penalized more than putting it into a rag bag (i.e., “misc” or “other” category); Small cluster preservation : splitting a small category is worse than splitting a large one. Eg: BCubed precision and recall metrics. Intrinsic : unsupervised. Considering how well the clusters are separated, and how compact. Eg: Silhouette coefficient.