












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Association analysis and rules in machine learning for data science. It covers the Apriori algorithm, which is used to find subsets of objects that often appear together, and the lattice representation of the problem. The document also includes examples and explanations of prevalence, confidence, and lift. The content is relevant to university topics such as machine learning, data science, and electrical engineering. Columbia University is a likely university that offers courses related to this document.
Typology: Lecture notes
1 / 20
This page cannot be seen from the preview
Don't miss anything!













Prof. John Paisley
Department of Electrical Engineering & Data Science Institute Columbia University
Association analysis is the task of understanding these patterns.
For example consider the following “market baskets” of five customers.
Using such data, we want to analyze patterns of co-occurance within it. We can use these patterns to define association rules. For example,
{diapers} ⇒ {beer}
Imagine we have: I (^) p different objects indexed by { 1 ,... , p} I (^) A collection of subsets of these objects Xn ⊂ { 1 ,... , p}. Think of Xn as the index of things purchased by customer n = 1 ,... , N.
Association analysis: Find subsets of objects that often appear together. For example, if K ⊂ { 1 ,... , p} indexes objects that frequently co-occur, then
#{n such that K ⊆ Xn} N
is large relatively speaking
Example: K = {peanut_butter, jelly, bread}
Association rules: Learn correlations. Let A and B be disjoint sets. Then A ⇒ B means purchasing A increases likelihood of also purchasing B.
Example: {peanut_butter, jelly} ⇒ {bread}
Want to find subsets that occur with probability above some threshold.
For example, does {bread, milk} occur relatively frequently? I (^) Go to each of the 5 baskets and count the number that contain both. I (^) Divide this number by 5 to get the frequency. I (^) Aside: Notice that the basket might have more items in it.
When N = 5 and p = 6 as in this case, we can easily check every possible combination. However, real problems might have N ≈ 108 and p ≈ 104.
Some combinatorial analysis will show that brute-force search isn’t possible.
Q: How many different subsets K ⊆ { 1 ,... , p} are there?
A: Each subset can be represented by a binary indicator vector of length p. The total number of possible vectors is 2p.
Q: Nobody will have a basket with every item in it, so we shouldn’t check every combination. How about if we only check up to k items?
A: The number of sets of size k picked from p items is
(p k
= (^) k!(pp−!k)!. For example, if p = 104 and k = 5, then
(p k
Takeaway: Though the problem only requires counting, we need an algorithm that can tell us which K we should count and which we can ignore.
For example, let
K = {peanut_butter, jelly, bread},
A = {peanut_butter, jelly}, B = {bread}
I (^) A prevalence of 0.03 means that peanut_butter, jelly and bread appeared together in 3% of baskets.
I (^) A confidence of 0.82 means that when both peanut_butter and jelly were purchased, 82% of the time bread was also purchased.
I (^) A lift of 1.95 means that it’s 1.95 more probable that bread will be purchased given that peanut_butter and jelly were purchased.
The goal of the Apriori algorithm is to quickly find all of the subsets K ⊂ { 1 ,... , p} that have probability greater than a predefined threshold t. I (^) Such a K will contain items that appear in at least N · t of the N baskets. I (^) A small fraction of such K should exist out of the 2p^ possibilities.
Apriori uses properties about P(K) to reduce the number of subsets that need to be checked to a small fraction of all 2p^ sets. I (^) It starts with K containing 1 item. It then moves to 2 items, etc. I (^) Sets of size k − 1 that “survive” help determine sets of size k to check. I (^) Important: Apriori finds every set K such that P(K) > t.
Next slide: The structure of the problem can be organized in a lattice.
We can use two properties to develop an algorithm for efficiently counting.
e.g., Let K = {a, b}. If these items appear together in x baskets, then the set of items K′^ = {a, b, c} appears in ≤ x baskets since K ⊂ K′.
Mathematically: P(K′) = P(K, A) = P(A|K)P(K) ≤ P(K) < t
Here is a basic version of the algorithm. It can be improved in clever ways.
Apriori algorithm Set a threshold N · t, where 0 < t < 1 (but relatively small).
It should be clear that as k increases, we can hope that the number of sets that survive decrease. At a certain k < p, no sets will survive and we’re done.
We can show that this algorithm returns every set K for which P(K) > t. I (^) Imagine we know every set of size k − 1 for which P(K) > t. Then every potential set of size k that could have P(K) > t will be checked. e.g. Let k = 3: The set {a, b, c} appears in > N · t baskets. Will we check it? Known: {a, b} and {c} must appear in > N · t baskets. Assumption: We’ve found K = {a, b} as a set satisfying P(K) > t. Apriori algorithm: We know P({c}) > t and so will check {a, b} ∪ {c}. Induction: We have all |K| = 1 by brute-force search (start induction).
As written, this can lead to duplicate sets for checking, e.g., {a, b} ∪ {c} and {a, c} ∪ {b}. Indexing methods can ensure we create {a, b, c} once.
For each proposed K, should we iterate through each basket for checking? There are tricks to make this faster that takes structure into account.
Data N = 6876 questionnaires 14 questions coded into p = 50 items For example: I (^) ordinal (2 items): Pick the item based on value being ≶ median I (^) categorical: item = category x categories → x items
I (^) Based on the item encoding, it’s clear that no “basket” can have every item.
I (^) We see that association analysis extends to more than consumer analysis.