Mean: Sum / N; trimmed: w/o extreme values; Median: middle val; approx: group intervals

by freq, take median intv.

L1+(N/2(⌃freq)l

freqmedian )width

L_1 lower bd, N all, \Sigma sum of freq lower than median intv.

Mode: Most freq value. mean - mode = 3 * (mean - median).

Midrange: (max+min) / 2. Range: max-min.

k-th q-quantile: at most (k/q) less than x, (q-k)/q greater than x. IQR: Q_3 - Q_1. 5 num

summary: Min, Q1, Median, Q3, Max. Boxplot: End at the quartiles, len = IQR. Mark Median.

Whiskers extend to min and max. Outliers: more than 1.5*IQR.

Variance: (STD deviation square)

N⌃(xi¯x)2=(1

N⌃x2

i)¯x2

Quantilie plot: x against f. x: inc sorted data. f_i: f_i * 100% of data below x_i.

f_i = (i-0.5)/N, from 1/(2N) to 1-1/(2N). Q-Q plot: Between two sets of observations. M=N

then simply plot x against y. M<N, (i-0.5)/M quantile x against y.

Visualization methods: Basic: Boxplot, Histogram, Quantile Plot, QQ plot, Scatter plot. Pixel-

oriented: m-dim m-windows (pixels), color reflecting values. Circle segment. Geometric

projection: Direct visualization, Scatterplot and scatterplot matrices, Landscapes, Projection

pursuit: Help users find meaningful projections of multidimal data, Prosection views,

Hyperslice, Parallel coordinates. Icon-based: Chernoff Faces: 10-dim, Stick Figures: 5-piece.

Shape, color, tile bars. Hierarchical: Dimal Stacking (n-dim in 2D. >9D hard), Worlds-within-

Worlds, Tree-Map, Cone Trees (up to 1k nodes, overlapping 2D), InfoCube. Visualizing

complex data and relations: tag cloud. Data Correlation: attrs implies each other. neg, pos,

or null. Similarity: How alike 2 are. [0,1]. Dissimilarity: How diff. min 0.

d(i,j) = (r+s) / (q+r+s+t); d(i,j) = (r+s) / (q+r+s); Sim_Jaccard(i,j)=q/(q+r+s);

Data matrix: n pts with p dim; Two modes. Dissimilarity matrix: n pts, but

registers only the distance (dis b/w pts); A triangular matrix n; Single

mode. Minkowski Dist:

p|xi1xj1|h+···+|xip xjp|h

. >0 if i!=j

&& d(i,i)=0 (pos def); d(i,j)=d(j,i); d(i,j)=d(i,k)+d(k,j). (h=1 Manhattan, Hamming dis; h=2

Euclidean; h>2 supremum max diff b/w any attr of the vectors)

Prox Measure: (Nomial Attr) 1. simple match: #/match, #/total; 2. create a new bin attr.

(Bin Attr) 1. Create a contingency table q(1,1), r(1,0), s(0,1), t(0,0). 2. Distance measure for

symmetric (gender) binary variables: (r+s) / (q+r+s+t); 3. Distance measure for asymmetric

(symptom) binary variables (r+s) / (q+r+s); 4. Jaccard coefficient (similarity measure for

asymmetric binary variables): q / (q+r+s). (Mixed Type) Cos Similarity: (eg Doc, vector obj)

If d1, d2 are vectors.

cos(d1,d

2)=(d1·d2)/(||d1||||d2||)

DATA QUALITY: Accuracy: correct/wrong, accurate/not. Completeness: not recorded,

unavailable. Consistency: some modified, dangling. Timeliness: timely update? Believability:

trustable? Interpretability: easy to understand? Inconsistent: age/birthday; rating in

number / rating in letters

Data Cleaning: Missing attr tuple: ignore tuple, filling in manually, use a global constance

(Unknown), (use global/class mean/median, use most probable value) → bias data.

Process: 1. discrepancy detection; 2. data transformation. Noisy Data: A random error or

variance in the measured variable. “Smooth.” Binning: Consulting values around. Local. Put

data into bins and smoothing by bin means, medians or boundaries (replace data with

closest boundary). equal-width part, equal-depth part. Regression: Conforms data values to

a function. Outlier analysis: Detect by clustering. Outsider of a cluster are outliers. Data

Integration: merging data from multiple storage. Entity id prob: Entities (attrs) from diff src

match up. Rebundancy: An attr can be derived from another. Detect by correlation analysis.

x^2 Correlation Test: Test the correlation relationship between two attrs, A and B. Does not

imply causality. Make a table, A’s c values on columns to B’s r values on rows.

2=⌃c

i=1⌃r

j=1

(oijeij )2

eij

o_{ij}: freq of (Ai, Bj). e_{ij}: expected frequency

=(count(A=a_i) * count(B=b_j)) / n. Higher-> corre.

Correlation Coefficient (Pearson’s prod moment coefficient): For numeric data.

rA,B =⌃(ai¯

A)(bi¯

B)/(nAB)=(⌃(aibi)n¯

A¯

B)/(nAB)

(=0: indep,

<0 neg correlated; >0 pos). Covariance: Compare to expected val (mean),

CovA,B =⌃(ai¯

A)(bi¯

B)/n

(>0 A,B both tend to > expected val; <0 A > expected

val → b < expected val). Some pairs of random variables may have a covariance of 0 but

are not independent. Only under some additional assumptions (e.g., the data follow

multivariate normal distributions) does a covariance of 0 imply independence.

Data Reduction: Dimenionality: (eg remove unimportant attrs) Wavelet transforms:

Decomposes a signal into different frequency, preserve relative distance between objects

at different levels of resolution, Similar to discrete Fourier transform (DFT), but better lossy

compression, localized in space. Principal Components Analysis (PCA): Find a projection

that captures the largest amount of variation in data, The original data are projected onto

a much smaller space, resulting in dimensionality reduction. 1. Normalize input data, 2.

Compute k orthonormal (unit) vectors, 3 Each input data (vector) is a linear combination, 4

Sort, delete weak.. Works for numeric data only. Feature subset selection: remove

redundant and irrelevant attributes, Best single attribute, Best step-wise selection, remove.

feature creation. Numerosity (Data Reduction): n Regression and Log-Linear Models,

[Histograms, clustering, sampling], Data cube aggregation; Data compression.

Wavelet: O(n). DWT (for linear signal processing, multi-resolution analysis) store only a

small fraction of the strongest of the wavelet coefficients. Similar to DFT, but better lossy

compression, localized in space. PCA: Find a projection that captures the largest amount of

variation in data. Project into smaller space. Attr Subset Selection: Reduce 1. Redundant

attrs (duplicate much info, eg price+tax); 2. Irrelevant attrs (no useful info eg student_id

+GPA). Heuristic Search: 2^d comb of d attrs. 1. Best single attr under the attr indep

assumption: choose by significance tests; 2. Best step-wise feature selection: The best

single-attr is picked first; Then next best attr condition to the first, …; 3. Step-wise attr

elimination: Repeatedly eliminate the worst attr; 4. Best combined attr selection and

elimination; 5. Optimal branch and bound: Use attr elimination and backtracking.

Attr/Feature Gen: Create new attr to capture info better: Attr extraction; Mapping to new

space (data reduction); Attr construction: combine, discretization.

Data Transformation: Min-Max Norm:

viminA

maxAmin A(nmax Anmin A)

. z-score Norm:

(vi¯

A)/A

. Decimal Scaling Norm:

vi/10j

(j guarantees all norms < 1).

Data Discretization: Divide the range of a continuous attribute into intervals. Binning (top-

down, unsupervised), histogram (td, un), cluster (td/bu, un), decision tree (td, su),

correlation analyses (BU merge, un).

Classfication vs Correlation Analysis: C: supervised, with class; top-down, using entropy to

split. CA: su; BU merge: find neighbors (x^2 values) to merge.

Concept Hierarchy Generation: specify by user; by explicit data grouping; spec of a set of

attr; spec of only a partial set.

Data Warehouse: subject-oriented, integrated, time-variant, and nonvolatile collection of

data in support of management’s decision-making process. DB (OLTP) vs Data Warehouses

(OLAP): Users and sys orientation: customer-o by clients, clerks vs market-o by knowledge

workers; Data contents, db design, view, access patterns. 3-tier: warehouse db serv; OLAP;

front-end client. 3-models: Enterprise warehouse: collects all of the info about subjects

spanning the entire org; Data Mart: a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to specific, selected groups, such as

marketing data mart; Virtual warehouse: A set of views over operational db. Only some of

the possible summary views may be. Metadata Repo: DW structure; op metadata;

algorithms for sum; mapping from the op env to DW; perf; business.

Dim tables: such as item (item_name, brand, type). Fact table: contains measures (such as

dollars_sold) and keys to each of the related dim tables.

Measures: Distributive: sum(), count(), min(), max(). Algebraic: avg(), min_N(), max_N(),

stddev(). Holistic: median(), mode(), rank().

OLAP Op: roll-up: up on concept hierarchy; drill-down: more detail; slice: get 1 dim; dice:

get 2+ dim; pivot: rotate; drill across: involving (across) >1 fact table. drill through: bottom

lvl to its back-end relational tables (using SQL). Typical DW design process: 1. Business

process: eg orders, invoices. 2. Grain (atomic level of data). 3. Dim in each fact-t 4.

Measurements. SDB: Privacy, hidden hierarchy. Indexing: Bitmap: Index on a particular

column; Each value in the column has a bit vector: bit-op is fast; The length of the bit

vector: # of records in the base table; The i-th bit is set if the i-th row of the base table

has the value for the indexed column; not suitable for high cardinality domains. Join

Indices: Join index: JI(R-id, S-id) where R (R-id, ...) S (S-id, ...). It materializes relational join

in JI file and speeds up relational join. E.g. fact table: Sales and two dimensions city and

product: A join index on city maintains for each distinct city a list of R-IDs of the tuples

recording the Sales in the city. Join indices can span multiple dimensions

Cubes: If there is no concept hierarchy, n dim, p base cells, c common dim. Total Cuboid #

= 2^n, aggregated # = 2^n - 1. Total Cell = Total - overlapping cells * overlapping times.

Total = p * 2^n. Overlapping: count * 2^{common dim}. Closed cell: if there exists no d,

that d is a descendant of c and has the same measure value. Cannot be specialized to

obtain a same measure value. All base cells, apex cell is usually NOT. Iceberg cells: having

count >= min support. Cube Shell: Precompute only the cuboids with a small # of dim.

Fragment cubes’ space req:

O(T⇥ceil(D/F )(2F1))

. Multi-way Array Aggr: Array-

based BU algorithm; Using multi-dim chunks; No direct tuple comparisons; Simultaneous

aggr on multiple dim; Intermediate aggr values are re-used for computing ancestor

cuboids; Can’t do Apriori pruning -> No iceberg optimization. Best order: min the mem and

I/Os; Method: the planes should be sorted and computed according to their size in

ascending order. Idea: keep the smallest plane in the main memory, fetch and compute

only one chunk at a time for the largest plane. Limitation: computing well only for small #/

dim. For large: TD and iceberg cube computation. BUC: BU cube comp; Divides dims into

parts and facilitates iceberg pruning; If a part does not satisfy min_sup, its descendants

can be pruned; If minsup = 1, comp full cube; No simultaneous aggregation. Star-Cubing:

Explore shared dimensions; eg dim A is the shared dim of ACD and AD. Allows for shared

computations; cuboid AB is comped simly as ABD. Aggregate in a TD manner but with the

BU sub-layer underneath: allow Apriori pruning. Shared dim grow in BU. (Loseless

compression.) High-Dimensional OLAP: ONLY one handles high dim. Challenge: high-dim;

Iceberg cube and compressed cube: only delay the inevitable explosion; Full

materialization still significant overhead in accessing results on disk. Step: 1. Part the set

of dim into shell fragments; 2. Comp data cubes for each shell frag while retaining inverted

indices or value-list indices; 3. Given the pre-computed frag cubes, dynamically compute

cube cells of the high- dimensional data cube online. Properties: Partitions the data

vertically. Reduces high-dimensional cube into a set of lower dimensional cubes. Online re-

construction of original high-dimensional space. Lossless reduction. Offers tradeoffs

between the amount of pre-processing and the speed of online computation. Frag-Shells:

Part set of dim (A1,...,An) into a set of k frags (P1,...,Pk).

!Scan base table once and do the following

!!insert <tid, measure> into ID_measure table.

!!for each attribute value ai of each dimension Ai

!!!build inverted index entry <ai, tidlist>

!For each farg part Pi

Cheat Sheet for Introduction to Data Mining | CS 412, Study notes of Computer Science

Partial preview of the text

Download Cheat Sheet for Introduction to Data Mining | CS 412 and more Study notes Computer Science in PDF only on Docsity!

L 1 + (

freqmedian^ )width

N ⌃(xi^ ^ x¯)

N ⌃x

i )^ ^ ¯x

p

|xi 1 xj 1 |h^ + · · · + |xip xjp|h^. >0 if i!=j

If d1, d2 are vectors. cos(d^1 , d^2 ) = (d^1 ·^ d^2 )/(||d^1 ||||d^2 ||).

^2 = ⌃c i=1⌃r j=

rA,B = ⌃(ai A¯)(bi B¯)/(nAB ) = (⌃(aibi) n A¯ B¯)/(nAB ) (=0: indep,

Cov A,B = ⌃(ai A¯)(bi B¯)/n (>0 A,B both tend to > expected val; <0 A > expected

max AminA (n^ max^ A^ ^ n^ minA)^. z-score Norm:

(vi A¯)/A . Decimal Scaling Norm: vi/ 10

F 1))

Confidence Interval: ¯x^ ±^ tc^ ˆ¯x^. sigma=s/sqrt(l)x is a sample of data set; is the mean of

Measure Definition Range^ NI

(a, b) ⌃i,j=0, 1

e(ai,bj ) [0,^1 ]^

[0, 1 ] N

[0, 1] Y

sup(a)+sup(b)sup(ab) [0,^ 1]^ Y

p

[0, 1] Y

2 (^

sup(a) +^

sup(b) )^

[0, 1] Y

M axConf (a, b) max( sup(ab)

sup(a) ,^

sup(b) )^

[0, 1] Y

q

n(n1) ⌃(xi^ ^ xj^ )

fG(x, y) = e

(influence of y on x), f^

G (x, y) =^ ⌃fG(x, xi)^ (total influence

on x), rf^

G (x, xi) =^ ⌃(xi^ ^ x)fG(x, xi)^ (gradient of x in the direction of x_i)

H =