Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Machine Learning for Data Science: Association Analysis and Rules, Lecture notes of Machine Learning

Columbia College Machine Learning

Association analysis and rules in machine learning for data science. It covers the Apriori algorithm, which is used to find subsets of objects that often appear together, and the lattice representation of the problem. The document also includes examples and explanations of prevalence, confidence, and lift. The content is relevant to university topics such as machine learning, data science, and electrical engineering. Columbia University is a likely university that offers courses related to this document.

Typology: Lecture notes

2020/2021

Uploaded on 05/11/2023

tarquin 🇺🇸

4.3

(15)

260 documents

1 / 20

This page cannot be seen from the preview

Don't miss anything!

COMS 4721: Machine Learning for Data Science

Lecture 23, 4/20/2017

Prof. John Paisley

Department of Electrical Engineering

& Data Science Institute

Columbia University

Discover Lecture notes of Machine Learning Columbia College

Partial preview of the text

Download Machine Learning for Data Science: Association Analysis and Rules and more Lecture notes Machine Learning in PDF only on Docsity!

COMS 4721: Machine Learning for Data Science

Lecture 23, 4/20/

Prof. John Paisley

Department of Electrical Engineering & Data Science Institute Columbia University

ASSOCIATION ANALYSIS

MARKET BASKET ANALYSIS

Association analysis is the task of understanding these patterns.

For example consider the following “market baskets” of five customers.

Using such data, we want to analyze patterns of co-occurance within it. We can use these patterns to define association rules. For example,

{diapers} ⇒ {beer}

ASSOCIATION ANALYSIS AND RULES

Imagine we have: I (^) p different objects indexed by { 1 ,... , p} I (^) A collection of subsets of these objects Xn ⊂ { 1 ,... , p}. Think of Xn as the index of things purchased by customer n = 1 ,... , N.

Association analysis: Find subsets of objects that often appear together. For example, if K ⊂ { 1 ,... , p} indexes objects that frequently co-occur, then

P(K) =

#{n such that K ⊆ Xn} N

is large relatively speaking

Example: K = {peanut_butter, jelly, bread}

Association rules: Learn correlations. Let A and B be disjoint sets. Then A ⇒ B means purchasing A increases likelihood of also purchasing B.

Example: {peanut_butter, jelly} ⇒ {bread}

PROCESSING THE BASKET

Want to find subsets that occur with probability above some threshold.

For example, does {bread, milk} occur relatively frequently? I (^) Go to each of the 5 baskets and count the number that contain both. I (^) Divide this number by 5 to get the frequency. I (^) Aside: Notice that the basket might have more items in it.

When N = 5 and p = 6 as in this case, we can easily check every possible combination. However, real problems might have N ≈ 108 and p ≈ 104.

SOME COMBINATORICS

Some combinatorial analysis will show that brute-force search isn’t possible.

Q: How many different subsets K ⊆ { 1 ,... , p} are there?

A: Each subset can be represented by a binary indicator vector of length p. The total number of possible vectors is 2p.

Q: Nobody will have a basket with every item in it, so we shouldn’t check every combination. How about if we only check up to k items?

A: The number of sets of size k picked from p items is

(p k

= (^) k!(pp−!k)!. For example, if p = 104 and k = 5, then

(p k

Takeaway: Though the problem only requires counting, we need an algorithm that can tell us which K we should count and which we can ignore.

EXAMPLE

For example, let

K = {peanut_butter, jelly, bread},

A = {peanut_butter, jelly}, B = {bread}

I (^) A prevalence of 0.03 means that peanut_butter, jelly and bread appeared together in 3% of baskets.

I (^) A confidence of 0.82 means that when both peanut_butter and jelly were purchased, 82% of the time bread was also purchased.

I (^) A lift of 1.95 means that it’s 1.95 more probable that bread will be purchased given that peanut_butter and jelly were purchased.

APRIORI ALGORITHM

The goal of the Apriori algorithm is to quickly find all of the subsets K ⊂ { 1 ,... , p} that have probability greater than a predefined threshold t. I (^) Such a K will contain items that appear in at least N · t of the N baskets. I (^) A small fraction of such K should exist out of the 2p^ possibilities.

Apriori uses properties about P(K) to reduce the number of subsets that need to be checked to a small fraction of all 2p^ sets. I (^) It starts with K containing 1 item. It then moves to 2 items, etc. I (^) Sets of size k − 1 that “survive” help determine sets of size k to check. I (^) Important: Apriori finds every set K such that P(K) > t.

Next slide: The structure of the problem can be organized in a lattice.

FREQUENCY DEPENDENCE

We can use two properties to develop an algorithm for efficiently counting.

If the set K is not big enough, then K′^ = K ∪ A with A ⊂ { 1 ,... , p} is not big enough. In other words: P(K) < t implies P(K′) < t

e.g., Let K = {a, b}. If these items appear together in x baskets, then the set of items K′^ = {a, b, c} appears in ≤ x baskets since K ⊂ K′.

Mathematically: P(K′) = P(K, A) = P(A|K)P(K) ≤ P(K) < t

By the converse, if P(K) > t and A ⊂ K, then P(A) > P(K) > t.

APRIORI ALGORITHM (ONE VERSION)

Here is a basic version of the algorithm. It can be improved in clever ways.

Apriori algorithm Set a threshold N · t, where 0 < t < 1 (but relatively small).

|K| = 1: Check each object and keep those that appear in ≥ N · t baskets.
|K| = 2: Check all pairs of objects that survived Step 1 and keep the sets . that appear in^ ≥^ N^ ·^ t^ baskets. .. k. |K| = k: Using all sets of size k − 1 that appear in ≥ N · t baskets, I (^) Increment each set with an object surviving Step 1 not already in the set. I (^) Keep all sets that appear in ≥ N · t baskets

It should be clear that as k increases, we can hope that the number of sets that survive decrease. At a certain k < p, no sets will survive and we’re done.

MORE CONSIDERATIONS

We can show that this algorithm returns every set K for which P(K) > t. I (^) Imagine we know every set of size k − 1 for which P(K) > t. Then every potential set of size k that could have P(K) > t will be checked. e.g. Let k = 3: The set {a, b, c} appears in > N · t baskets. Will we check it? Known: {a, b} and {c} must appear in > N · t baskets. Assumption: We’ve found K = {a, b} as a set satisfying P(K) > t. Apriori algorithm: We know P({c}) > t and so will check {a, b} ∪ {c}. Induction: We have all |K| = 1 by brute-force search (start induction).
As written, this can lead to duplicate sets for checking, e.g., {a, b} ∪ {c} and {a, c} ∪ {b}. Indexing methods can ensure we create {a, b, c} once.
For each proposed K, should we iterate through each basket for checking? There are tricks to make this faster that takes structure into account.

EXAMPLE

Data N = 6876 questionnaires 14 questions coded into p = 50 items For example: I (^) ordinal (2 items): Pick the item based on value being ≶ median I (^) categorical: item = category x categories → x items

I (^) Based on the item encoding, it’s clear that no “basket” can have every item.

I (^) We see that association analysis extends to more than consumer analysis.

Machine Learning for Data Science: Association Analysis and Rules, Lecture notes of Machine Learning

Related documents

Partial preview of the text

Download Machine Learning for Data Science: Association Analysis and Rules and more Lecture notes Machine Learning in PDF only on Docsity!

COMS 4721: Machine Learning for Data Science

Lecture 23, 4/20/

ASSOCIATION ANALYSIS

MARKET BASKET ANALYSIS

ASSOCIATION ANALYSIS AND RULES

P(K) =

PROCESSING THE BASKET

SOME COMBINATORICS

EXAMPLE

APRIORI ALGORITHM

FREQUENCY DEPENDENCE

APRIORI ALGORITHM (ONE VERSION)

MORE CONSIDERATIONS

EXAMPLE

EXAMPLE