Data Mining and Machine Learning: A Comprehensive Overview, Lecture notes of Machine Learning

The field of data mining, which involves using historical data to discover regularities and improve future decisions. It explains how the falling cost of large data storage devices, the development of efficient machine learning algorithms, and the falling cost of computational power have contributed to the growth of interest in data mining. The document also provides examples of practical applications of data mining and the algorithms used to learn rules. It was written by Tom M. Mitchell from the Center for Automated Learning and Discovery at Carnegie Mellon University in 1999.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

zeb
zeb 🇺🇸

4.6

(27)

231 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
To appear in
Communications of the ACM
,
Vol. 42, No. 11, November 1999.
Machine Learning and Data Mining
Tom M. Mitchell
Center for Automated Learning and Discovery
School of Computer Science
Carnegie Mellon University
1 Introduction
Over the past decade many organizations have begun to routinely capture
huge volumes of historical data describing their operations, their pro ducts,
and their customers. At the same time, scientists and engineers in many
elds nd themselves capturing increasingly complex experimental datasets,
such as the gigabytes of functional MRI data that describe brain activity
in humans. The eld of data mining addresses the question of how best to
use this historical data to discover general regularities and to improve future
decisions.
Data Mining:
using historical data to discover regularities and
improve future decisions.
The rapid growth of interest in data mining follows from the conuence
of several recent trends: (1) the falling cost of large data storage devices and
the increasing ease of collecting data over networks, (2) the developmentof
robust and ecient machine learning algorithms to process this data, and (3)
the falling cost of computational power, enabling the use of computationally
intensive methods for data analysis.
The eld of data mining, sometimes referred to as knowledge discovery
from databases, machine learning, or advanced data analysis, has already
produced highly practical applications in areas such as credit card fraud de-
tection, medical outcomes analysis, predicting customer purchase behavior,
predicting the interests of web users, and optimizing manufacturing pro-
cesses. It has also led to a set of fascinating scientic questions about how
computers might automatically learn from experience.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Data Mining and Machine Learning: A Comprehensive Overview and more Lecture notes Machine Learning in PDF only on Docsity!

To app ear in Communications of the ACM, Vol. 42, No. 11, Novemb er 1999.

Machine Learning and Data Mining

Tom M. Mitchell Center for Automated Learning and Discovery Scho ol of Computer Science Carnegie Mellon University

1 Intro duction

Over the past decade many organizations have b egun to routinely capture huge volumes of historical data describing their op erations, their pro ducts, and their customers. At the same time, scientists and engineers in many elds nd themselves capturing increasingly complex exp erimental datasets, such as the gigabytes of functional MRI data that describ e brain activity in humans. The eld of data mining addresses the question of how b est to use this historical data to discover general regularities and to improve future decisions.

Data Mining: using historical data to discover regularities and improve future decisions. The rapid growth of interest in data mining follows from the con uence of several recent trends: (1) the falling cost of large data storage devices and the increasing ease of collecting data over networks, (2) the development of robust and ecient machine learning algorithms to pro cess this data, and (3) the falling cost of computational p ower, enabling the use of computationally intensive metho ds for data analysis. The eld of data mining, sometimes referred to as knowledge discovery from databases, machine learning, or advanced data analysis, has already pro duced highly practical applications in areas such as credit card fraud de- tection, medical outcomes analysis, predicting customer purchase b ehavior, predicting the interests of web users, and optimizing manufacturing pro- cesses. It has also led to a set of fascinating scienti c questions ab out how computers might automatically learn from exp erience.

2 Data Mining Examples

A prototypical example of a data mining problem is illustrated in Figure 1. Here we are provided a set of historical data and asked to use this data to make improved decisions in the future. In this example the data consists of a set of medical records describing 9,714 pregnant women. The decision we wish to improve is our ability to identify future high risk pregnancies (more sp eci cally, pregnancies that are at high risk of requiring an emergency Ce- sarean section delivery). In this database, each pregnant woman is describ ed in terms of 215 distinct features, such as her age, whether this is a rst preg- nancy, whether she is diab etic, and so on. As shown in the top p ortion of Figure 1, these features together describ e the evolution of the pregnancy over time. The b ottom p ortion of Figure 1 illustrates a typical result of data mining. It shows one of the rules that has b een learned automatically from this set of data. This particular rule predicts a 60 p ercent risk of emergency C-section for mothers that exhibit a particular combination of three features (e.g., \no previous vaginal delivery") out of the 215 p ossible features. Among women known to exhibit these three features, the data indicates that 60 p ercent have historically given birth by emergency C-section. As summarized at the b ottom of the gure, this regularity holds b oth over the training data used to formulate the rule, and over a separate set of test data used to verify the reliability of the rule over new data. Physicians may wish to consider this rule as a useful factual statement ab out past patients when they consider treatment of similar new patients. What algorithms are used to learn rules such as the one in Figure 1? This rule was learned by a symb olic rule learning algorithm similar to Clark and Nisb ett's CN2 [3]. Decision tree learning algorithms such as Quinlan's C4.5 [9] are also frequently used to formulate rules of this typ e. When rules must b e learned from extremely large data sets, sp ecialized algorithms that stress computational eciency may b e used [1, 4]. Other machine learning algorithms commonly applied to this kind of data mining problem include neural networks [2], inductive logic programming [8], and Bayesian learning algorithms [5]. Mitchell's textb o ok [7] provides a description of a broad range of machine learning algorithms, as well as the statistical principles on which they are based. Although machine learning algorithms such as these are central to the

Data:

Patient103 (^) time=1 Patient103 (^) time=2 ... Patient103 time=n

Age: 23 FirstPregnancy: no Anemia: no Diabetes: no PreviousPrematureBirth: no

...

Elective C−Section:? Emergency C−Section:?

Age: 23 FirstPregnancy: no Anemia: no PreviousPrematureBirth: no

Diabetes: YES

... Emergency C−Section:?

Ultrasound: abnormal Elective C−Section: no

Age: 23 FirstPregnancy: no Anemia: no PreviousPrematureBirth: no

...

Elective C−Section: no

Ultrasound:?

Diabetes: no

Emergency C−Section: Yes

Ultrasound:?

Learned rule:

If No previous vaginal delivery, and Abnormal 2nd Trimester Ultrasound, and Malpresentation at admission Then Probability of Emergency C-Section is 0.

Training set accuracy: 26/41 =. Test set accuracy: 12/20 =.

Figure 1: A typical data mining application. A historical set of 9714 medical records describ es pregnant women over time. The top p ortion of the gure illustrates a typical patient record, where ?" indicates that the feature value is unknown. The task here is to identify classes of patients at high risk of receiving an emergency Cesarean section. The b ottom p ortion of the gure shows one of many rules discovered from this data. Whereas 7% of all pregnant women in this dataset received emergency C-sections, this rule identi es a sub class at 60% risk.

Data:

Customer103: (time=t0) Customer103: (time=t1) ... Customer103: (time=tn)

...

Own House: Yes Other delinquent accts: 2

Loan balance: $2, Income: $52k

Max billing cycles late: 3

Years of credit: 9

...

Own House: Yes

Years of credit: 9

...

Own House: Yes

Years of credit: 9 Loan balance: $3, Income:?

Other delinquent accts: 2 Max billing cycles late: 4

Loan balance: $4, Income:?

Other delinquent accts: 3 Max billing cycles late: 6 Repay loan?:? Repay loan?:? Repay loan?: No

Rules learned from synthesized data:

If Other-Delinquent-Accounts > 2, and Number-Delinquent-Billing-Cycles > 1 Then Repay-Loan? = No

If Other-Delinquent-Accounts = 0, and (Income > $30k) OR (Years-of-Credit > 3) Then Repay-Loan? = Yes

Figure 2: Typical data and rules for credit risk analysis.

is describ ed by numeric or symb olic features, where the data do es not con- tain text and image features interleaved with these numeric and symb olic features, and where the data has b een carefully and cleanly collected with a particular decision making task in mind. While this rst generation of data mining algorithms is already of signi - cant practical value, data mining metho ds are still in their infancy. We might well exp ect the next decade to pro duce an order of magnitude advance in the state of the art, through development of new algorithms that will accomo date dramatically more diverse sources and typ es of data, that will automate a broader range of the steps involved in the data mining pro cess, and that will supp ort mixed-initiative data mining in which human exp erts collab orate with the computer to form hyp otheses and test them against the data. To illustrate one imp ortant research issue, consider again the problem of predicting risk of emergency C-section for pregnant women. One key lim- itation of current data mining metho ds is that in fact they cannot utilize the full patient record that is already routinely captured in hospital medical records! This is b ecause current hospital records for pregnant women often contain sequences of images (e.g., the ultrasound images taken during preg- nancy), other raw instrument data (e.g., fetal distress monitors), text (e.g., the notes made by physicians during p erio dic checkups during pregnancy), and even sp eech (e.g., recordings of phone calls), in addition to the numeric and symb olic features describ ed in Figure 1. Although our rst generation data mining algorithms work well with the numeric and symb olic features, and although some learning algorithms are available for learning to classify images, or to classify text, the fact is that we currently lack e ective algo- rithms for learning from data that is represented by a combination of these various media. As a result, the current state of the art in medical outcomes analysis is to ignore the image, text, and raw sensor p ortion of the medi- cal record, or at b est to summarize these in some oversimpli ed form (e.g., lab eling the complex ultrasound image as simply \normal" or \abnormal"). However, it is clear that if predictions could b e based on the full medical record, we would exp ect much greater prediction accuracy. Therefore, a topic of considerable current research interest is the development of algo- rithms that can learn regularities over rich, mixed media data. This issue is imp ortant in many data mining applications, ranging from mining historical equipment maintenance records, to mining records at customer call centers, to analyzing fMRI data on brain activity during di erent tasks.

This issue of learning from mixed media data is just one of many current research issues in data mining. The left hand side of Figure 4 lists a numb er of additional research topics, while the right hand side of this gure indicates a variety of applications for which these research issues are imp ortant. Below we discuss these additional research issues in turn:

 Optimizing decisions rather than predictions. The goal here is to use historical data to improve the choice of actions in addition to the more usual goal of predicting outcomes. For example, consider again the birth data set mentioned earlier. Although it is clearly helpful to learn to predict which women su er a high risk of birth complications, it would b e even more useful to learn which pre-emptive actions could b e taken to reduce this risk. Similarly, in mo deling bank customers it is one thing to predict which customers may close their accounts and move to a new bank, but even more useful to learn which actions may b e taken to retain the customer b efore they depart. This problem of learning which actions acheive a desired outcome, given only previously acquired data, is much more subtle than it may rst app ear. The dif- cult issue is that the available data often represents a biased sample; for instance, whereas the data may show that mothers giving birth at home su er fewer complications than women who give birth in the hospital, one cannot necessarily conclude that sending a woman home will reduce her risk of complications. This empirical regularity might instead b e due to the fact that a disprop ortionate numb er of high risk women cho ose to give birth in the hospital. Thus, the problem of learn- ing to cho ose actions raises imp ortant and basic questions such as how to learn from biased samples of data, and how to incorp orate conjec- tures by human exp erts ab out the e ectiveness of various intervention actions. If successful, this research will allow applying historical data much more directly to the questions faced by decision-makers.

 Scaling to extremely large data sets. Whereas most learning algorithms p erform acceptably on datasets with tens of thousands of training ex- amples, data sets such as large retail customer data bases, and the Hubble telescop e data can easily reach a terabyte or more. To provide reasonably ecient data mining metho ds for such large data sets re- quires additional research. Research during the past few has already

the accuracy of predictions can b e improved by inventing a more ap- propriate set of features to describ e the available data. For example, consider the problem of detecting the imminent failure of a piece of equipment based on the time series of sensor data collected from the equipment. It is easy to generate millions of features that describ e this time series by taking di erences, sums, ratios, averages, etc. of primitive sensor readings and previously de ned features. Our con- jecture is that given a suciently large and long-duration data set it should b e feasible to automatically explore this large space of p ossible de ned features in order to identify the small fraction of these features most useful for future learning. If successful, this work would lead to increased accuracy in many prediction problems, such as predict- ing equipment failure, customer attrition, credit repayment, medical outcomes, etc.

There are many other directions of active research as well, including work on how to provide more useful data visualization to ols, how to supp ort mixed- initiative human-machine exploration of large data sets, and how to reduce the e ort needed for data warehousing and for combining information from di erent legacy databases. Still, the interesting fact is that even current rst- generation approaches to data mining are b eing put to routine use by many organizations, pro ducing imp ortant gains in many applications. We might sp eculate that as the future of this eld unfolds, we will see several directions in which it will advance including (1) new algorithms that learn more accurately, that are able to utilize data from dramatically more diverse data sources available over the internet and intranets, and that are able to incorp orate more human input as they work (2) integration of these data mining algorithms into standard database systems, (3) an increasing e ort within many organizations on capturing, warehousing and utilizing historical data to supp ort evidence-based decision making. We can also exp ect to see more universities react to the severe short- age of trained exp erts in this area, by creating new academic programs for students wishing to sp ecialize in data mining. In fact, several univer- sities have recently announced graduate degree programs in data mining, machine learning, and computational statisics, including Carnegie Mellon University (see www.cs.cmu.edu/cald), University of California at Irvine (www.ics.uci.edu/gcounsel/masterreqs.html), George Mason University (van-

Basic Technologies

Applications

Active experimentation, exploration

Medicine

Manufacturing

Marketing

Public policy

Intelligence analysis

Scientific Issues,

Financial

Learning from mixed media data, e.g.,

numeric, text, image, voice, sensor, ...

Inventing new features to improve

accuracy

Learning from multiple databases and

the world wide web

Optimizing decisions rather than

predictions

Figure 4: Research on basic scienti c issues (left) will impact future data mining applications in many areas (right).

  1. Muggleton, S. (1995) Foundations of inductive logic programming, En- glewo o d Cli s, NJ: Prentice Hall.
  2. Quinlan J.R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA.