
Mean: Sum / N; trimmed: w/o extreme values; Median: middle val; approx: group intervals
by freq, take median intv.
L1+(N/2(⌃freq)l
freqmedian )width
L_1 lower bd, N all, \Sigma sum of freq lower than median intv.
Mode: Most freq value. mean - mode = 3 * (mean - median).
Midrange: (max+min) / 2. Range: max-min.
k-th q-quantile: at most (k/q) less than x, (q-k)/q greater than x. IQR: Q_3 - Q_1. 5 num
summary: Min, Q1, Median, Q3, Max. Boxplot: End at the quartiles, len = IQR. Mark Median.
Whiskers extend to min and max. Outliers: more than 1.5*IQR.
Variance: (STD deviation square)
.
Quantilie plot: x against f. x: inc sorted data. f_i: f_i * 100% of data below x_i.
f_i = (i-0.5)/N, from 1/(2N) to 1-1/(2N). Q-Q plot: Between two sets of observations. M=N
then simply plot x against y. M<N, (i-0.5)/M quantile x against y.
Visualization methods: Basic: Boxplot, Histogram, Quantile Plot, QQ plot, Scatter plot. Pixel-
oriented: m-dim m-windows (pixels), color reflecting values. Circle segment. Geometric
projection: Direct visualization, Scatterplot and scatterplot matrices, Landscapes, Projection
pursuit: Help users find meaningful projections of multidimal data, Prosection views,
Hyperslice, Parallel coordinates. Icon-based: Chernoff Faces: 10-dim, Stick Figures: 5-piece.
Shape, color, tile bars. Hierarchical: Dimal Stacking (n-dim in 2D. >9D hard), Worlds-within-
Worlds, Tree-Map, Cone Trees (up to 1k nodes, overlapping 2D), InfoCube. Visualizing
complex data and relations: tag cloud. Data Correlation: attrs implies each other. neg, pos,
or null. Similarity: How alike 2 are. [0,1]. Dissimilarity: How diff. min 0.
d(i,j) = (r+s) / (q+r+s+t); d(i,j) = (r+s) / (q+r+s); Sim_Jaccard(i,j)=q/(q+r+s);
Data matrix: n pts with p dim; Two modes. Dissimilarity matrix: n pts, but
registers only the distance (dis b/w pts); A triangular matrix n; Single
mode. Minkowski Dist:
h
p|xi1xj1|h+···+|xip xjp|h
. >0 if i!=j
&& d(i,i)=0 (pos def); d(i,j)=d(j,i); d(i,j)=d(i,k)+d(k,j). (h=1 Manhattan, Hamming dis; h=2
Euclidean; h>2 supremum max diff b/w any attr of the vectors)
Prox Measure: (Nomial Attr) 1. simple match: #/match, #/total; 2. create a new bin attr.
(Bin Attr) 1. Create a contingency table q(1,1), r(1,0), s(0,1), t(0,0). 2. Distance measure for
symmetric (gender) binary variables: (r+s) / (q+r+s+t); 3. Distance measure for asymmetric
(symptom) binary variables (r+s) / (q+r+s); 4. Jaccard coefficient (similarity measure for
asymmetric binary variables): q / (q+r+s). (Mixed Type) Cos Similarity: (eg Doc, vector obj)
If d1, d2 are vectors.
cos(d1,d
2)=(d1·d2)/(||d1||||d2||)
.
DATA QUALITY: Accuracy: correct/wrong, accurate/not. Completeness: not recorded,
unavailable. Consistency: some modified, dangling. Timeliness: timely update? Believability:
trustable? Interpretability: easy to understand? Inconsistent: age/birthday; rating in
number / rating in letters
Data Cleaning: Missing attr tuple: ignore tuple, filling in manually, use a global constance
(Unknown), (use global/class mean/median, use most probable value) → bias data.
Process: 1. discrepancy detection; 2. data transformation. Noisy Data: A random error or
variance in the measured variable. “Smooth.” Binning: Consulting values around. Local. Put
data into bins and smoothing by bin means, medians or boundaries (replace data with
closest boundary). equal-width part, equal-depth part. Regression: Conforms data values to
a function. Outlier analysis: Detect by clustering. Outsider of a cluster are outliers. Data
Integration: merging data from multiple storage. Entity id prob: Entities (attrs) from diff src
match up. Rebundancy: An attr can be derived from another. Detect by correlation analysis.
x^2 Correlation Test: Test the correlation relationship between two attrs, A and B. Does not
imply causality. Make a table, A’s c values on columns to B’s r values on rows.
2=⌃c
i=1⌃r
j=1
(oijeij )2
eij
o_{ij}: freq of (Ai, Bj). e_{ij}: expected frequency
=(count(A=a_i) * count(B=b_j)) / n. Higher-> corre.
Correlation Coefficient (Pearson’s prod moment coefficient): For numeric data.
rA,B =⌃(ai¯
A)(bi¯
B)/(nAB)=(⌃(aibi)n¯
A¯
B)/(nAB)
(=0: indep,
<0 neg correlated; >0 pos). Covariance: Compare to expected val (mean),
CovA,B =⌃(ai¯
A)(bi¯
B)/n
(>0 A,B both tend to > expected val; <0 A > expected
val → b < expected val). Some pairs of random variables may have a covariance of 0 but
are not independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence.
Data Reduction: Dimenionality: (eg remove unimportant attrs) Wavelet transforms:
Decomposes a signal into different frequency, preserve relative distance between objects
at different levels of resolution, Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space. Principal Components Analysis (PCA): Find a projection
that captures the largest amount of variation in data, The original data are projected onto
a much smaller space, resulting in dimensionality reduction. 1. Normalize input data, 2.
Compute k orthonormal (unit) vectors, 3 Each input data (vector) is a linear combination, 4
Sort, delete weak.. Works for numeric data only. Feature subset selection: remove
redundant and irrelevant attributes, Best single attribute, Best step-wise selection, remove.
feature creation. Numerosity (Data Reduction): n Regression and Log-Linear Models,
[Histograms, clustering, sampling], Data cube aggregation; Data compression.
Wavelet: O(n). DWT (for linear signal processing, multi-resolution analysis) store only a
small fraction of the strongest of the wavelet coefficients. Similar to DFT, but better lossy
compression, localized in space. PCA: Find a projection that captures the largest amount of
variation in data. Project into smaller space. Attr Subset Selection: Reduce 1. Redundant
attrs (duplicate much info, eg price+tax); 2. Irrelevant attrs (no useful info eg student_id
+GPA). Heuristic Search: 2^d comb of d attrs. 1. Best single attr under the attr indep
assumption: choose by significance tests; 2. Best step-wise feature selection: The best
single-attr is picked first; Then next best attr condition to the first, …; 3. Step-wise attr
elimination: Repeatedly eliminate the worst attr; 4. Best combined attr selection and
elimination; 5. Optimal branch and bound: Use attr elimination and backtracking.
Attr/Feature Gen: Create new attr to capture info better: Attr extraction; Mapping to new
space (data reduction); Attr construction: combine, discretization.
Data Transformation: Min-Max Norm:
viminA
maxAmin A(nmax Anmin A)
. z-score Norm:
. Decimal Scaling Norm:
(j guarantees all norms < 1).
Data Discretization: Divide the range of a continuous attribute into intervals. Binning (top-
down, unsupervised), histogram (td, un), cluster (td/bu, un), decision tree (td, su),
correlation analyses (BU merge, un).
Classfication vs Correlation Analysis: C: supervised, with class; top-down, using entropy to
split. CA: su; BU merge: find neighbors (x^2 values) to merge.
Concept Hierarchy Generation: specify by user; by explicit data grouping; spec of a set of
attr; spec of only a partial set.
Data Warehouse: subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process. DB (OLTP) vs Data Warehouses
(OLAP): Users and sys orientation: customer-o by clients, clerks vs market-o by knowledge
workers; Data contents, db design, view, access patterns. 3-tier: warehouse db serv; OLAP;
front-end client. 3-models: Enterprise warehouse: collects all of the info about subjects
spanning the entire org; Data Mart: a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific, selected groups, such as
marketing data mart; Virtual warehouse: A set of views over operational db. Only some of
the possible summary views may be. Metadata Repo: DW structure; op metadata;
algorithms for sum; mapping from the op env to DW; perf; business.
Dim tables: such as item (item_name, brand, type). Fact table: contains measures (such as
dollars_sold) and keys to each of the related dim tables.
Measures: Distributive: sum(), count(), min(), max(). Algebraic: avg(), min_N(), max_N(),
stddev(). Holistic: median(), mode(), rank().
OLAP Op: roll-up: up on concept hierarchy; drill-down: more detail; slice: get 1 dim; dice:
get 2+ dim; pivot: rotate; drill across: involving (across) >1 fact table. drill through: bottom
lvl to its back-end relational tables (using SQL). Typical DW design process: 1. Business
process: eg orders, invoices. 2. Grain (atomic level of data). 3. Dim in each fact-t 4.
Measurements. SDB: Privacy, hidden hierarchy. Indexing: Bitmap: Index on a particular
column; Each value in the column has a bit vector: bit-op is fast; The length of the bit
vector: # of records in the base table; The i-th bit is set if the i-th row of the base table
has the value for the indexed column; not suitable for high cardinality domains. Join
Indices: Join index: JI(R-id, S-id) where R (R-id, ...) S (S-id, ...). It materializes relational join
in JI file and speeds up relational join. E.g. fact table: Sales and two dimensions city and
product: A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city. Join indices can span multiple dimensions
Cubes: If there is no concept hierarchy, n dim, p base cells, c common dim. Total Cuboid #
= 2^n, aggregated # = 2^n - 1. Total Cell = Total - overlapping cells * overlapping times.
Total = p * 2^n. Overlapping: count * 2^{common dim}. Closed cell: if there exists no d,
that d is a descendant of c and has the same measure value. Cannot be specialized to
obtain a same measure value. All base cells, apex cell is usually NOT. Iceberg cells: having
count >= min support. Cube Shell: Precompute only the cuboids with a small # of dim.
Fragment cubes’ space req:
. Multi-way Array Aggr: Array-
based BU algorithm; Using multi-dim chunks; No direct tuple comparisons; Simultaneous
aggr on multiple dim; Intermediate aggr values are re-used for computing ancestor
cuboids; Can’t do Apriori pruning -> No iceberg optimization. Best order: min the mem and
I/Os; Method: the planes should be sorted and computed according to their size in
ascending order. Idea: keep the smallest plane in the main memory, fetch and compute
only one chunk at a time for the largest plane. Limitation: computing well only for small #/
dim. For large: TD and iceberg cube computation. BUC: BU cube comp; Divides dims into
parts and facilitates iceberg pruning; If a part does not satisfy min_sup, its descendants
can be pruned; If minsup = 1, comp full cube; No simultaneous aggregation. Star-Cubing:
Explore shared dimensions; eg dim A is the shared dim of ACD and AD. Allows for shared
computations; cuboid AB is comped simly as ABD. Aggregate in a TD manner but with the
BU sub-layer underneath: allow Apriori pruning. Shared dim grow in BU. (Loseless
compression.) High-Dimensional OLAP: ONLY one handles high dim. Challenge: high-dim;
Iceberg cube and compressed cube: only delay the inevitable explosion; Full
materialization still significant overhead in accessing results on disk. Step: 1. Part the set
of dim into shell fragments; 2. Comp data cubes for each shell frag while retaining inverted
indices or value-list indices; 3. Given the pre-computed frag cubes, dynamically compute
cube cells of the high- dimensional data cube online. Properties: Partitions the data
vertically. Reduces high-dimensional cube into a set of lower dimensional cubes. Online re-
construction of original high-dimensional space. Lossless reduction. Offers tradeoffs
between the amount of pre-processing and the speed of online computation. Frag-Shells:
Part set of dim (A1,...,An) into a set of k frags (P1,...,Pk).
!Scan base table once and do the following
!!insert <tid, measure> into ID_measure table.
!!for each attribute value ai of each dimension Ai
!!!build inverted index entry <ai, tidlist>
!For each farg part Pi