






































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
intro Material Type: Notes; Professor: Han; Class: Introduction to Data Mining; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2017;
Typology: Study notes
1 / 46
This page cannot be seen from the preview
Don't miss anything!







































1.10 Bibliographic Notes.......................... 38
“We are living in the information age” is a popular saying, however we are actually living in the data age. Tera- or peta-bytes^1 of data pour into our computer networks, the World Wide Web (WWW), and various data storage devices every day from business, science and engineering, medicine, and almost every other aspect of our daily life. This explosive growth of available data volume is a result of the computerization of our society and the fast development of powerful data collection and storage tools. Businesses worldwide generate gigantic data sets, including sales transactions, stock trading records, product descriptions, sales promotions, company profiles and performance, and customer feedback. For example, large stores like Wal-Mart handle hundreds of millions of transactions per week at thousands of branches around the world. Scientific and engineering practices generate high-orders of peta-bytes of data in a continuous manner, from remote sensing, process measuring, scientific experiments, system performance, engineering observations, and environment surveillance. Global backbone telecommunication networks carry tens of petabyes of data traffic every day. The medical and health industry generates tremendous amounts of data from medical records, patient monitoring, and medical images. Billions of Web searches supported by Web search engines process tens of petabytes of data every day. Communities and social media have become increasingly important data sources, producing digital pictures and videos, Web blogs, Web communities, and various kinds of social networks. The list of sources generating huge amounts of data is endless. This explosively growing, widely available, and gigantic body of data makes our time truly the data age. Powerful and versatile tools are badly needed to automatically uncover valuable information from such tremendous amounts of data, and to transform such data into organized knowledge. This necessity led to the birth of data mining. The field is young and promising. Data mining has and will contribute to make great strides in our journey from the data age towards the coming information age.
Example 1.1 Data mining turns a large collection of data into knowledge. A Web search engine like Google receives hundreds of millions of queries every day. Each query can be viewed as a transaction where the user describes her/his information need. What novel and useful knowledge can a Web search engine learn from such a huge collection of search queries collected from users over time? Interestingly, some patterns found in user search queries can disclose in- valuable knowledge that cannot be obtained by reading individual data items alone. For example, Google Flu Trends uses specific search terms as indicators of flu activity. It found a close relationship between the number of people who search for flu-related information and the number of people who actually have flu symptoms. A pattern emerges when all of the search queries related to flu are aggregated. Using aggregated Google search data, Google Flu Trends can (^1) A petabyte is a unit of information or computer storage equal to one quadrillion bytes, or 1000 terabytes, or 1,000,000 gigabytes.
estimate flu activity up to two weeks faster than traditional systems^2. This ex- ample shows how data mining can turn a large collection of data into knowledge that can help meet a global challenge of our times.
Data mining can be viewed as a result of the natural evolution of information technology. The database and data management industry evolved in the de- velopment of several critical functionalities (Figure 1.1): data collection and database creation, data management (including data storage and retrieval, and database transaction processing), and advanced data analysis (involving data warehousing and data mining). The early development of data collection and database creation mechanisms served as a prerequisite for the later develop- ment of effective mechanisms for data storage and retrieval, as well as query and transaction processing. Nowadays numerous database systems offer query and transaction processing as common practice. Advanced data analysis has naturally become the next step. Since the 1960s, database and information technology has evolved system- atically from primitive file processing systems to sophisticated and powerful database systems. The research and development in database systems since the 1970s progressed from early hierarchical and network database systems to re- lational database systems (where data are stored in relational table structures; see Section 1.3.1), data modeling tools, and indexing and accessing methods. In addition, users gained convenient and flexible data access through query languages, user interfaces, query optimization, and transaction management. Efficient methods for online transaction processing (OLTP), where a query is viewed as a read-only transaction, contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data. After the establishment of database management systems, database technol- ogy moved towards the development of advanced database systems, data ware- housing and data mining for advanced data analysis, and Web-based databases. Advanced database systems, for example, resulted from an upsurge of research from the mid-1980s onwards. These systems incorporate new and powerful data models such as extended-relational, object-oriented, object-relational, and deductive models. Application-oriented database systems have flourished, in- cluding spatial, temporal, multimedia, active, stream and sensor, scientific and engineering databases, knowledge bases, and office information bases. Issues related to the distribution, diversification, and sharing of data have been exten- sively studied. Advanced data analysis sprang up from the late 1980s onwards. The steady and dazzling progress of computer hardware technology in the past three decades
(^2) This is reported in [GMP+09]
led to large supplies of powerful and affordable computers, data collection equip- ment, and storage media. This technology provides a great boost to the database and information industry, and enables a huge number of databases and infor- mation repositories available for transaction management, information retrieval, and data analysis. Data can now be stored in many different kinds of databases and information repositories. One emerging data repository architecture is the data warehouse (Section 1.3.2). A data warehouse is a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. Data warehouse technol- ogy includes data cleaning, data integration, and online analytical processing (OLAP), that is, analysis techniques with functionalities such as summariza- tion, consolidation, and aggregation as well as the ability to view information from different angles. Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for in-depth analysis, such as data mining tools that provide data classification, clustering, outlier/exception detection, and the characterization of changes in data over time. Huge volumes of data have been accumulated beyond databases and data warehouses. During the 1990s, the World Wide Web and Web-based databases (such as XML databases) began to appear. Heterogeneous database systems and Internet-based global information bases such as the World Wide Web have emerged and play a vital role in the information industry. The effective and efficient analysis of data from such different forms of data by integration of in- formation retrieval, data mining, and information network analysis technologies is a challenging task.
How can I analyze this data?
Figure 1.2: We are data rich, but information poor.
In summary, the abundance of data, coupled with the need for powerful data
analysis tools, has been described as a data rich but information poor situation (Figure 1.2). The fast-growing, tremendous amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without powerful tools. As a result, data collected in large data repositories become “data tombs”—data archives that are seldom visited. Consequently, important decisions are often made based not on the information- rich data stored in data repositories, but rather on a decision maker’s intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data. Although some efforts have been made on developing expert system and knowledge-base technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases, unfortunately, the manual knowledge input procedure is prone to biases and errors, and is extremely costly and time-consuming. The widening gap between data and information calls for a systematic development of data mining tools that can turn data tombs into “golden nuggets” of knowledge.
1.2 What Is Data Mining?
It is no surprise that data mining, as a truly interdisciplinary subject, can be defined in many different ways. Even the term “data mining” itself does not really present all major components in the picture. To refer to the mining of gold from rocks or sand, we say gold mining instead of rock or sand mining. In analog, data mining should have been more appropriately named “knowledge mining from data”, which is unfortunately somewhat long. However, the shorter term,“knowledge mining”, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material (Figure 1.3). Thus, such a misnomer carrying both “data” and “mining” became a popular choice. In addition, many other terms have a similar meaning as data mining, like knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD, while others view data mining as merely an essential step in the process of knowledge discovery. The knowledge discovery process is depicted in Figure 1.4 as an iterative sequence of the following steps:
Figure 1.4: Data mining as a step in the process of knowledge discovery.
from large amounts of data. The data sources can include databases, data ware- houses, the Web, other information repositories, or data that are streamed into the system dynamically.
1.3 What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a target application. The most basic forms of data for mining applications are database data (Section 1.3.1), data warehouse data (Sec- tion 1.3.2), and transactional data (Section 1.3.3). The concepts and techniques presented in this first volume of the book focus on such data. Data mining can also be applied to other forms of data, such as data streams, ordered/sequence data, graph or networked data, spatial data, text data, multimedia data, and the World Wide Web. We present an overview of such data in Section 1.3.4. Techniques for mining on these kinds of data are covered in volume two of this book. Data mining will certainly continue to embrace new data types as they emerge.
customer (cust ID, name, address, age, occupation, annual income, credit information, category,... ) item (item ID, brand, category, type, price, place made, supplier, cost,... ) employee (empl ID, name, category, group, salary, commission,... ) branch (branch ID, name, address,... ) purchases (trans ID, cust ID, empl ID, date, time, method paid, amount) items sold (trans ID, item ID, qty) works at (empl ID, branch ID)
Figure 1.5: Relational schema for a relational database AllElectronics.
A database system, also called a database management system (DBMS for short), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. The software programs provide mechanisms for defining database structures and data storage; for specifying and managing concurrent, shared, or distributed data access; and for ensuring consistency and security of the information stored despite system crashes or attempts at unauthorized access. A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships.
Example 1.2 A relational database for AllElectronics. The fictitious AllElectronics store is used to illustrate concepts throughout the text. The company is de- scribed by the following relation tables: customer, item, employee, and branch. Headers of the tables described here are shown in Figure 1.5. (A header is also called the schema of a relation).
Data source in Chicago
Data source in Toronto
Data source in Vancouver
Data source in New York (^) Data Warehouse
Clean Integrate Transform Load Refresh
Query and Analysis Tools
Client
Client
Figure 1.6: Typical framework of a data warehouse for AllElectronics.
past 5–10 years, and is typically summarized. For example, rather than storing the details of each sales transaction, the data warehouse may store a summary of the transactions per item type for each store or, summarized to a higher level, for each sales region. A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure, such as count or sum(sales amount). A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data.
Example 1.3 A data cube for AllElectronics. A data cube for summarized sales data of AllElectronics is presented in Figure 1.7(a). The cube has three dimensions: ad- dress (with city values Chicago, New York, Toronto, Vancouver ), time (with quar- ter values Q1, Q2, Q3, Q4 ), and item (with item type values home entertainment, computer, phone, security). The aggregate value stored in each cell of the cube is sales amount (in thousands). For example, the total sales for the first quarter, Q1, for the items related to security systems in Vancouver is $400, 000, as stored in cell 〈Vancouver, Q1, security〉. Additional cubes may be used to store aggregate sums over each dimension, corresponding to the aggregate values obtained using differ- ent SQL group-bys (e.g., the total sales amount per city and quarter, or per city and item, or per quarter and item, or per each individual dimension).
By providing multidimensional data views and the precomputation of sum- marized data, data warehouse systems can provide inherent support for on- line analytical processing (OLAP). OLAP operations make use of background knowledge regarding the domain of the data being studied in order to allow the presentation of data at different levels of abstraction. Such operations ac- commodate different user viewpoints. Examples of OLAP operations include
Q1 605 825 14 400 Q Q Q
Chicago New York Toronto
440 1560 395 Vancouver
time
(quarters)
address
(cities)
entertainment^ home
computer phone item (types)
security
Q1, security>^ <Vancouver,
Q Q Q Q
USA Canada
2000 1000
time
(quarters)
address
(countries)
entertainment^ home
computer phone item (types)
security
150 100 150
Jan Feb March
Chicago New York Toronto Vancouver
time
(months)
address
(cities)
entertainment^ home
computer phone item (types)
security
Drill-down on time data for Q
Roll-up on address
(a) (b)
Figure 1.7: A multidimensional data cube, commonly used for data warehousing, (a) showing summarized data for AllElectronics and (b) showing summarized data resulting from drill-down and roll-up operations on the cube in (a). For improved readability, only some of the cube cell values are shown.
drill-down and roll-up, which allow the user to view the data at differing degrees of summarization, as illustrated in Figure 1.7(b). For instance, we can drill down on sales data summarized by quarter to see the data summarized by month. Similarly, we can roll up on sales data summarized by city to view the data summarized by country. Although data warehouse tools help support data analysis, additional tools for data mining are often needed for in-depth analysis. Multidimensional data mining (also called exploratory multidimensional data mining) performs data mining in multidimensional space in an OLAP-style. That is, it allows the exploration of multiple combinations of dimensions at varying lev- els of granularity in data mining, and thus has greater potential for discovering interesting patterns representing knowledge. An overview of data warehouse and OLAP technology is provided in Chapter 4. Advanced issues regarding data
rather different semantic meanings. Such kinds of data can be seen in many ap- plications, such as time-related or sequence data (e.g., historical records, stock exchange data, time-series and biological sequence data), data streams (e.g., video surveillance and sensor data, which are continuously transmitted), spatial data (such as maps), engineering design data (such as the design of buildings, system components, or integrated circuits), hypertext and multimedia data (in- cluding text, image, video, and audio data), graph and networked data (such as social and information networks), and the World Wide Web (a huge, widely distributed information repository made available by the Internet). These ap- plications bring about new challenges, like how to handle data carrying special structures (such as sequences, trees, graphs and networks) and specific seman- tics (such as ordering, image, audio and video contents, and connectivity), and how to mine patterns that carry rich structures and semantics. Various kinds of knowledge can be mined from such kinds of data. Here, we list just a few. Regarding temporal data, for instance, we can mine banking data for trends of changes, which may aid in the scheduling of bank tellers according to the volume of customer traffic. Stock exchange data can be mined to uncover trends that could help you plan investment strategies (e.g., when is the best time to purchase AllElectronics stock?). We could mine computer network data streams to detect intrusions based on the anomaly of message flows, which may be discovered by clustering, dynamic construction of stream models, or comparing the current frequent patterns with those at a previous time. With spatial data, we may look for patterns that describe changes in metropolitan poverty rates based on city distances from major highways. The relationships among a set of spatial objects can be examined in order to discover which subsets of objects are spatially auto-correlated or associated. Mining text data, such as the literature on the subject “data mining” from the past ten years, we can identify the evolution of hot topics in the field. By mining user comments on products (which are often submitted as short text messages), we can assess customer sentiments and understand how well a product is embraced by a market. From multimedia data, we can mine images to identify objects and classify them by assigning semantic labels or tags. By mining video data of a hockey game, we can detect video sequences corresponding to goals. Web mining can help us learn about the distribution of information on the Web in general, characterize and classify Web pages, and uncover Web dynamics and the association and other relationships among different Web pages, users, communities, and Web-based activities. It is important to keep in mind that, in many applications, multiple types of data are present. For example, in Web mining, there often exist text data on Web pages, multimedia data (e.g., pictures and videos) on Web pages, graph data like Web graphs, and map data on some Web sites. In bioinformatics, genomic sequences, biological networks, and 3-D spatial structures of genomes may co-exist for certain biological objects. Mining multiple data sources of complex data often leads to fruitful findings due to the mutual enhancement and consolidation of such multiple sources. On the other hand, it is also challenging due to the difficulties in data cleaning and data integration, as well as the
complex interactions among the multiple sources of such data. While such data require sophisticated facilities for efficient storage, retrieval, and updating, they also provide fertile grounds and raise challenging research and implementation issues for data mining. Data mining on advanced types of data is the topic of volume two of this book.
1.4 What Kinds of Patterns Can Be Mined?
We have observed various types of data and information repositories on which data mining can be performed. Let us now examine the kinds of patterns that can be mined. There are a number of data mining functionalities. These include char- acterization and discrimination (Section 1.4.1); the mining of frequent pat- terns, associations, and correlations (Section 1.4.2); classification and regression (Section 1.4.3); clustering analysis (Section 1.4.4); and outlier analysis (Sec- tion 1.4.5). Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. In general, such tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks char- acterize properties of the data in a target data set. Predictive mining tasks perform induction on the current data in order to make predictions. Data mining functionalities, and the kinds of patterns they can discover, are described below. In addition, Section 1.4.6 looks at what makes a pattern interesting. Interesting patterns represent knowledge.
Data entries can be associated with classes or concepts. For example, in the AllElectronics store, classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders. It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions. These descriptions can be derived via (1) data characterization, by summarizing the data of the class under study (often called the target class) in general terms, or (2) data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes), or (3) both data characterization and discrimination. Data characterization is a summarization of the general characteristics or features of a target class of data. The data corresponding to the user-specified class is typically collected by a query. For example, to study the characteristics of software products whose sales were increased by 10% last year, the data related to such products can be collected by executing an SQL query on the sales database. There are several methods for effective data summarization and character- ization. Simple data summaries based on statistical measures and plots are
features between the two classes.
Concept description, including characterization and discrimination, is de- scribed in Chapter 4.
Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences (also known as sequential patterns), and frequent sub- structures. A frequent itemset typically refers to a set of items that frequently appear together in a transactional data set, such as milk and bread, which are frequently bought together in grocery stores by many customers. A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a (frequent) se- quential pattern. A substructure can refer to different structural forms, such as graphs, trees, or lattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
Example 1.7 Association analysis. Suppose, as a marketing manager at AllElectronics, you would like to know which items are frequently purchased together (i.e., within the same transaction). An example of such a rule, mined from the AllElectronics transactional database, is
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%, confidence = 50%]
where X is a variable representing a customer. A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well. A 1% support means that 1% of all of the transactions under analysis show that computer and software are purchased together. This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single predicate are referred to as single-dimensional association rules. Dropping the predicate notation, the above rule can be written simply as “computer ⇒ software [1%, 50%]”. Suppose, instead, that we are given the AllElectronics relational database related to purchases. A data mining system may find association rules like
age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”) [support = 2%, confidence = 60%]
The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years of age with an income of 40,000 to 49,000 and have purchased a laptop (computer) at AllElectronics. There is a probability of 60% that a customer in this age and income group will purchase a laptop. Note that this is
an association involving more than one attribute or predicate (i.e., age, income, and buys). Adopting the terminology used in multidimensional databases, where each attribute is referred to as a dimension, the above rule can be referred to as a multidimensional association rule.
Typically, association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold. Additional analysis can be performed to uncover interesting statis- tical correlations between associated attribute-value pairs. Frequent itemset mining is a fundamental form of frequent pattern mining. The mining of frequent patterns, associations, and correlations is discussed in Chapters 6 and 7, where particular emphasis is placed on efficient algorithms for frequent itemset mining. Sequential pattern mining and structured pattern mining are considered advanced topics, which will be discussed in the second volume of this book.
Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose class labels are known). “How is the derived model presented?” The derived model may be represented in various forms, such as classification rules (i.e., IF-THEN rules), decision trees, mathematical formulae, or neural networks (Figure 1.9). A decision tree is a flow- chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions. Decision trees can easily be converted to classification rules. A neu- ral network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units. There are many other methods for constructing classification models, such as na¨ıve Bayesian clas- sification, support vector machines, and k-nearest neighbor classification. Whereas classification predicts categorical (discrete, unordered) labels, re- gression models continuous-valued functions. That is, regression is used to predict missing or unavailable numerical data values rather than (discrete) class labels. The term prediction refers to both numeric prediction and class label prediction. Regression analysis is a statistical methodology that is most often used for numeric prediction, although other methods exist as well. Regression also encompasses the identification of distribution trends based on the available data. Classification and regression may need to be preceded by relevance anal- ysis, which attempts to identify attributes that are significantly relevant to the classification and regression process. Such attributes will be selected for the classification and regression process. The other attributes, which are irrelevant, can then be excluded from consideration.