Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Mining Notes 6: Frequent Patterns, Associations, and Correlations, Cheat Sheet of Computer Science

Mahatma Gandhi University Computer Science

Database Management System, DBMS Study Materials, Engineering Class handwritten notes, exam notes, previous year questions, PDF free download

Typology: Cheat Sheet

2021/2022

Uploaded on 12/30/2021

marianinu-antony 🇮🇳

5 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

6/30/2021

[Data Mining Notes 6] Mining frequent patterns, associations and correlations: basic concepts and methods - Programmer Sought

https://www.programmersought.com/article/93814060695/

1/18

ProgrammerSought ☰

search

[Data Mining Notes 6] Mining frequent patterns, association

s and correlations: basic concepts and methods

6. Mining frequent patterns,

associations and relevance:

basic concepts and

methods

Frequent patterns (frequent patterns) are patterns that frequently appear in the data

set.

6.1 Basic concepts

Frequent pattern mining searches for recurring connections in a given data set in

order to nd interesting associations or correlations between items in large-scale

transactions or relational data sets. A typical example is shopping basket analysis.

Shopping basket analysis assumes that the whole domain is a collection of

products, and each product has a boolean variable, indicating whether the product

appears in the shopping basket. Each shopping basket is represented by a Boolean

vector. By analyzing the Boolean vector, you can get purchase patterns that reectTOP

Not ifi cat ion s P owe red by iZ oot o

Discover Cheat Sheet of Computer Science Mahatma Gandhi University

Partial preview of the text

Download Data Mining Notes 6: Frequent Patterns, Associations, and Correlations and more Cheat Sheet Computer Science in PDF only on Docsity!

ProgrammerSought

search

[Data Mining Notes 6] Mining frequent patterns, association

s and correlations: basic concepts and methods

6. Mining frequent patterns,

associations and relevance:

basic concepts and

methods

Frequent patterns (frequent patterns) are patterns that frequently appear in the data set.

6.1 Basic concepts

Frequent pattern mining searches for recurring connections in a given data set in order to nd interesting associations or correlations between items in large-scale transactions or relational data sets. A typical example is shopping basket analysis. Shopping basket analysis assumes that the whole domain is a collection of products, and each product has a boolean variable, indicating whether the product appears in the shopping basket. Each shopping basket is represented by a Boolean vector. By analyzing the Boolean vector, you can get purchase patterns that reect TOP Notifications Powered by iZooto

frequent associations or simultaneous purchases. These patterns can be expressed in the form of association rules. If buying a computer also tends to buy antivirus software at the same time, it is expressed by the following rule: computer=>antivirus_software[support=2%;condence=60%]. Rule support and condence are two measures of rule interest, which reect the usefulness and certainty of the discovered rules, respectively. If the rule meets the minimum support threshold and minimum condence threshold, the association rule is interesting. Rules that meet the minimum support threshold (min_sup) and minimum condence threshold (min_conf) at the same time are called strong rules. The set of items is called the item set, and the item set containing k items is called the k item set. The frequency of occurrence of an item set is the number of transactions that contain the item set, which is called the frequency, support count, or count of the item set. The item set support is also called relative support, and the frequency of occurrence is called absolute support. If the relative support of item set I meets the predened minimum support threshold (that is, the absolute support of I meets the corresponding minimum support count threshold), then I is a frequent TOP Notifications Powered by iZooto

no more frequent k item sets can be found. Find each LkA full scan of the database is required. Based on a priori properties from Lk-1Find Lk, Consisting of connecting steps and pruning steps: Connection step: To nd Lk, By changing Lk-1Connect to itself to generate a set of candidate k item sets, denoted as CkHere, the algorithm assumes that the items in the transaction or item set are sorted by eld order. Pruning step: CkYes LkSuperset of CkThe members in may or may not be frequent, but all frequent k itemsets are included in Ckin. The Apriori algorithm for mining frequent itemsets by mining Boolean association rules is as follows: TOP Notifications Powered by iZooto

Generate association rules from frequent itemsets Once the transactions in database D nd frequent itemsets, strong association rules can be directly generated (strong association rules satisfy minimum support and minimum condence).
Improve the eciency of Apriori algorithm Improve eciency based on Apriori mining: TOP Notifications Powered by iZooto

FP tree mining: Start with a frequent pattern of length 1 (initial sux pattern) and construct its conditional pattern base (a subdatabase consisting of the prex path set that appears with this sux pattern in the FP tree). Then, construct its conditional FP tree and mine recursively on the tree. The pattern growth is achieved by connecting the sux pattern with the frequent pattern generated by the conditional FP tree. The FP-growth method converts the problem of nding long frequent patterns into recursively searching for some shorter patterns in a smaller conditional database, and then connecting the suxes. Using the least frequent item as a sux provides good selectivity and signicantly reduces search overhead. Algorithm: FP-Growth, using FP tree, mining frequent patterns through pattern growth Input: D: Transaction database Min_sup: minimum support threshold Output: complete set of frequent patterns method:

Construct the FP tree according to the following steps: A. Scan the transaction database D once, collect the set F of frequent items and their support count, and sort F in descending order of support count, the result is a list of frequent items L. B. Create the root node of the FP tree and mark it with null. For each transaction Trans in D, execute: Select the frequent items in Trans and sort them in order of L. Let the list of frequent items sorted by Trans be [p|P], where p is the rst element and P is the list of TOP Notifications Powered by iZooto

remaining elements. Call insert_tree([p|P],T). If T has children N such that N.item_name=p.item_name, then the count of N is increased by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and through the node chain structure Link it to a node with the same item_name. If P is not empty, insert_tree(P,N) is called recursively.

The FP tree mining isomorphism calls FP_growth(PF_tree, null) to achieve, the process is implemented as follows: Procedure FP_growth(Tree,a) If Tree contains a single path Pthen For each combination of nodes in path P (denoted as b) The generation mode a ∪ b, whose support count support_count is equal to the minimum support count of the nodes in b Each of the a for the tree's head tablei{ Create a model b=a ∪ ai, Its support count support_count=ai.support_count； Construct the conditional pattern base of b, and then construct the conditional FP tree of bb； if Treeb ≠ ∅ then Call FP_growth(Treeb,b); } The performance study of the FP-growth method shows that it is eective and scalable for mining long frequent patterns and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm. TOP Notifications Powered by iZooto

whether the newly found item set is a subset of the closed item sets that have been found to have the same degree of support.

6.3 Those patterns are interesting: pattern assessment

methods

Basically, association rule mining algorithms use support-condence framework. However, when mining low-support thresholds or mining long patterns, many uninteresting rules will be generated, which is one of the bottlenecks of association rule mining applications.

Strong rules are not necessarily interesting The strong association rules identied based on the support-condence framework are not enough to lter out uninteresting association rules, and the correlation and implication relationships need to be measured.
From correlation analysis to correlation analysis In order to identify the interestingness of rules, correlation measures need to be used to expand the support-condence framework of association rules. Correlation rules are measured not only by support and condence, but also by the correlation between item sets A and B. Relevance measurement methods are: Lift: The appearance of item set A is independent of the appearance of item set B, if P(A∪B)=P(A)P(B); otherwise, as events, item sets A and B are dependent ( dependent) and related (correlated), the lift between the occurrence of A and B is dened as: lift(A,B)=P(A∪B)/P(A)P(B) If lift(A,B)<1, it means that the occurrence of A and B are negatively correlated; if lift(A,B)>1, then A and B are positively correlated, meaning that each occurrence (^) TOP Notifications Powered by iZooto

implies another Appears; if lift(A,B)=1, it means that A and B are independent and have no correlation. These four metrics, the metric value is only aected by the support of A, B, and A∪B, or the conditional probability P(A|B) and P(B|A), but not the total number of transactions Number of inuences. Another common property of the four measurement methods is that each measurement value goes to [0,1], and the larger the value, the closer A and B are. Summary: Using only support and condence measures to mine associations may generate a large number of rules, among which there may be users who are not interested; therefore, the mode interest measure can be used to extend the support-condence framework, which helps to focus on strong modes Contact rules mining.

6.4 Summary

The discovery of frequent patterns, associations, and correlations in large amounts of data is useful in selective sales, decision analysis, and business management. A popular application area is shopping basket analysis, isomorphic (^) TOP Notifications Powered by iZooto

Use Disposition Data Format Mining Frequent Patterns (ECLAT) to transform a given horizontal data format transaction data set in the form of TID-item sets into a vertical data format in the form of item-TID-sets; according to a priori properties and additional Optimization techniques (such as diset), through the intersection of TID- set, mining the transformed data set.
Not all strong association rules are interesting. Use model evaluation metrics to extend the support-condence framework to identify eective interesting rules; a metric is zero invariant if its value is not subject to zero transactions ( That is, the impact of transactions that do not include the item set under consideration, in many model evaluation metrics, the lift, , Full condence, maximum condence, cosine and Kulczynski. The last four are zero invariant; the Kulczynski metric and the imbalance ratio can be used together to provide a pattern connection between item sets.

6.5 Project topics

The DBLP data set (http://www.informatik.unitrier.de/~ley/db/) includes more than 1 million articles published in computer science conferences and journals. Among these items, many authors have a co-authorship relationship and propose a method to mine the co-authorship relationship that is closely related (ie, often write articles together); based on the mining results and model evaluation metrics, it is more eective to analyze that measurement method. TOP Notifications Powered by iZooto

Intelligent Recommendation

Data Mining Learning Notes - Basic Concepts about Data

Data understanding Attributes: According to the type of property: 1. Quality (classication) Nominal \ Ordinal 2. Quantity (value) interval, ratio (Ratio) According to the number of attribute values:...

[10] data mining study notes frequent patt

ern mining base

First, the basic concept Frequent Pattern- patterns in the data set appears frequently - itemsets, substructure or subsequence motivation- found that data contains inherent law of things • item (...

Typical mining frequent item sets: Data Mi

ning Study Notes

table of Contents Layer by layer discovery algorithm Apriori The main steps How to generate candidate sets? Discovery algorithm without candidate set FP-growth The main steps Reference Notes Layer by ... Qt interprocess communication (1) -------- QProcess TOP Notifications Powered by iZooto

Study notes (b) Data Mining Concepts and

Techniques

1 measure of central tendency: the average (mean), median, mode Trimmed mean: after losing the mean high and low extremes Weighted arithmetic mean (weighted average): Median (Median) is the median of ...

Data Mining Concepts and Analysis Chapt

er 9 Notes

Original blogger blog:https://blog.csdn.net/u014593570/article/details/75987793 This chapter learns advanced techniques for data classication Bayesian belief network The book is more general, and be...

Data mining methods in data mining

In the eld of data mining, it is often encountered that there are various abnormal conditions in the mined feature data, such as Missing data, abnormal data values Wait. For these cases, if not d...

Data Mining Notes 6: Frequent Patterns, Associations, and Correlations, Cheat Sheet of Computer Science

Related documents

Partial preview of the text

Download Data Mining Notes 6: Frequent Patterns, Associations, and Correlations and more Cheat Sheet Computer Science in PDF only on Docsity!

ProgrammerSought

[Data Mining Notes 6] Mining frequent patterns, association

s and correlations: basic concepts and methods

6. Mining frequent patterns,

associations and relevance:

basic concepts and

methods

6.1 Basic concepts

6.3 Those patterns are interesting: pattern assessment

methods

6.4 Summary

6.5 Project topics

Intelligent Recommendation

Data Mining Learning Notes - Basic Concepts about Data

[10] data mining study notes frequent patt

ern mining base

Typical mining frequent item sets: Data Mi

ning Study Notes

Study notes (b) Data Mining Concepts and

Techniques

Data Mining Concepts and Analysis Chapt

er 9 Notes

Data mining methods in data mining

Related Posts

Popular Posts

Recommended Posts