







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The field of data mining, which involves using historical data to discover regularities and improve future decisions. It explains how the falling cost of large data storage devices, the development of efficient machine learning algorithms, and the falling cost of computational power have contributed to the growth of interest in data mining. The document also provides examples of practical applications of data mining and the algorithms used to learn rules. It was written by Tom M. Mitchell from the Center for Automated Learning and Discovery at Carnegie Mellon University in 1999.
Typology: Lecture notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








To app ear in Communications of the ACM, Vol. 42, No. 11, Novemb er 1999.
Tom M. Mitchell Center for Automated Learning and Discovery Scho ol of Computer Science Carnegie Mellon University
Over the past decade many organizations have b egun to routinely capture huge volumes of historical data describing their op erations, their pro ducts, and their customers. At the same time, scientists and engineers in many elds nd themselves capturing increasingly complex exp erimental datasets, such as the gigabytes of functional MRI data that describ e brain activity in humans. The eld of data mining addresses the question of how b est to use this historical data to discover general regularities and to improve future decisions.
Data Mining: using historical data to discover regularities and improve future decisions. The rapid growth of interest in data mining follows from the con uence of several recent trends: (1) the falling cost of large data storage devices and the increasing ease of collecting data over networks, (2) the development of robust and ecient machine learning algorithms to pro cess this data, and (3) the falling cost of computational p ower, enabling the use of computationally intensive metho ds for data analysis. The eld of data mining, sometimes referred to as knowledge discovery from databases, machine learning, or advanced data analysis, has already pro duced highly practical applications in areas such as credit card fraud de- tection, medical outcomes analysis, predicting customer purchase b ehavior, predicting the interests of web users, and optimizing manufacturing pro- cesses. It has also led to a set of fascinating scienti c questions ab out how computers might automatically learn from exp erience.
A prototypical example of a data mining problem is illustrated in Figure 1. Here we are provided a set of historical data and asked to use this data to make improved decisions in the future. In this example the data consists of a set of medical records describing 9,714 pregnant women. The decision we wish to improve is our ability to identify future high risk pregnancies (more sp eci cally, pregnancies that are at high risk of requiring an emergency Ce- sarean section delivery). In this database, each pregnant woman is describ ed in terms of 215 distinct features, such as her age, whether this is a rst preg- nancy, whether she is diab etic, and so on. As shown in the top p ortion of Figure 1, these features together describ e the evolution of the pregnancy over time. The b ottom p ortion of Figure 1 illustrates a typical result of data mining. It shows one of the rules that has b een learned automatically from this set of data. This particular rule predicts a 60 p ercent risk of emergency C-section for mothers that exhibit a particular combination of three features (e.g., \no previous vaginal delivery") out of the 215 p ossible features. Among women known to exhibit these three features, the data indicates that 60 p ercent have historically given birth by emergency C-section. As summarized at the b ottom of the gure, this regularity holds b oth over the training data used to formulate the rule, and over a separate set of test data used to verify the reliability of the rule over new data. Physicians may wish to consider this rule as a useful factual statement ab out past patients when they consider treatment of similar new patients. What algorithms are used to learn rules such as the one in Figure 1? This rule was learned by a symb olic rule learning algorithm similar to Clark and Nisb ett's CN2 [3]. Decision tree learning algorithms such as Quinlan's C4.5 [9] are also frequently used to formulate rules of this typ e. When rules must b e learned from extremely large data sets, sp ecialized algorithms that stress computational eciency may b e used [1, 4]. Other machine learning algorithms commonly applied to this kind of data mining problem include neural networks [2], inductive logic programming [8], and Bayesian learning algorithms [5]. Mitchell's textb o ok [7] provides a description of a broad range of machine learning algorithms, as well as the statistical principles on which they are based. Although machine learning algorithms such as these are central to the
Data:
Patient103 (^) time=1 Patient103 (^) time=2 ... Patient103 time=n
Age: 23 FirstPregnancy: no Anemia: no Diabetes: no PreviousPrematureBirth: no
...
Elective C−Section:? Emergency C−Section:?
Age: 23 FirstPregnancy: no Anemia: no PreviousPrematureBirth: no
Diabetes: YES
... Emergency C−Section:?
Ultrasound: abnormal Elective C−Section: no
Age: 23 FirstPregnancy: no Anemia: no PreviousPrematureBirth: no
...
Elective C−Section: no
Ultrasound:?
Diabetes: no
Emergency C−Section: Yes
Ultrasound:?
Learned rule:
If No previous vaginal delivery, and Abnormal 2nd Trimester Ultrasound, and Malpresentation at admission Then Probability of Emergency C-Section is 0.
Training set accuracy: 26/41 =. Test set accuracy: 12/20 =.
Figure 1: A typical data mining application. A historical set of 9714 medical records describ es pregnant women over time. The top p ortion of the gure illustrates a typical patient record, where ?" indicates that the feature value is unknown. The task here is to identify classes of patients at high risk of receiving an emergency Cesarean section. The b ottom p ortion of the gure shows one of many rules discovered from this data. Whereas 7% of all pregnant women in this dataset received emergency C-sections, this rule identi es a sub class at 60% risk.
Data:
Customer103: (time=t0) Customer103: (time=t1) ... Customer103: (time=tn)
...
Own House: Yes Other delinquent accts: 2
Loan balance: $2, Income: $52k
Max billing cycles late: 3
Years of credit: 9
...
Own House: Yes
Years of credit: 9
...
Own House: Yes
Years of credit: 9 Loan balance: $3, Income:?
Other delinquent accts: 2 Max billing cycles late: 4
Loan balance: $4, Income:?
Other delinquent accts: 3 Max billing cycles late: 6 Repay loan?:? Repay loan?:? Repay loan?: No
Rules learned from synthesized data:
If Other-Delinquent-Accounts > 2, and Number-Delinquent-Billing-Cycles > 1 Then Repay-Loan? = No
If Other-Delinquent-Accounts = 0, and (Income > $30k) OR (Years-of-Credit > 3) Then Repay-Loan? = Yes
Figure 2: Typical data and rules for credit risk analysis.
is describ ed by numeric or symb olic features, where the data do es not con- tain text and image features interleaved with these numeric and symb olic features, and where the data has b een carefully and cleanly collected with a particular decision making task in mind. While this rst generation of data mining algorithms is already of signi - cant practical value, data mining metho ds are still in their infancy. We might well exp ect the next decade to pro duce an order of magnitude advance in the state of the art, through development of new algorithms that will accomo date dramatically more diverse sources and typ es of data, that will automate a broader range of the steps involved in the data mining pro cess, and that will supp ort mixed-initiative data mining in which human exp erts collab orate with the computer to form hyp otheses and test them against the data. To illustrate one imp ortant research issue, consider again the problem of predicting risk of emergency C-section for pregnant women. One key lim- itation of current data mining metho ds is that in fact they cannot utilize the full patient record that is already routinely captured in hospital medical records! This is b ecause current hospital records for pregnant women often contain sequences of images (e.g., the ultrasound images taken during preg- nancy), other raw instrument data (e.g., fetal distress monitors), text (e.g., the notes made by physicians during p erio dic checkups during pregnancy), and even sp eech (e.g., recordings of phone calls), in addition to the numeric and symb olic features describ ed in Figure 1. Although our rst generation data mining algorithms work well with the numeric and symb olic features, and although some learning algorithms are available for learning to classify images, or to classify text, the fact is that we currently lack e ective algo- rithms for learning from data that is represented by a combination of these various media. As a result, the current state of the art in medical outcomes analysis is to ignore the image, text, and raw sensor p ortion of the medi- cal record, or at b est to summarize these in some oversimpli ed form (e.g., lab eling the complex ultrasound image as simply \normal" or \abnormal"). However, it is clear that if predictions could b e based on the full medical record, we would exp ect much greater prediction accuracy. Therefore, a topic of considerable current research interest is the development of algo- rithms that can learn regularities over rich, mixed media data. This issue is imp ortant in many data mining applications, ranging from mining historical equipment maintenance records, to mining records at customer call centers, to analyzing fMRI data on brain activity during di erent tasks.
This issue of learning from mixed media data is just one of many current research issues in data mining. The left hand side of Figure 4 lists a numb er of additional research topics, while the right hand side of this gure indicates a variety of applications for which these research issues are imp ortant. Below we discuss these additional research issues in turn:
Optimizing decisions rather than predictions. The goal here is to use historical data to improve the choice of actions in addition to the more usual goal of predicting outcomes. For example, consider again the birth data set mentioned earlier. Although it is clearly helpful to learn to predict which women su er a high risk of birth complications, it would b e even more useful to learn which pre-emptive actions could b e taken to reduce this risk. Similarly, in mo deling bank customers it is one thing to predict which customers may close their accounts and move to a new bank, but even more useful to learn which actions may b e taken to retain the customer b efore they depart. This problem of learning which actions acheive a desired outcome, given only previously acquired data, is much more subtle than it may rst app ear. The dif- cult issue is that the available data often represents a biased sample; for instance, whereas the data may show that mothers giving birth at home su er fewer complications than women who give birth in the hospital, one cannot necessarily conclude that sending a woman home will reduce her risk of complications. This empirical regularity might instead b e due to the fact that a disprop ortionate numb er of high risk women cho ose to give birth in the hospital. Thus, the problem of learn- ing to cho ose actions raises imp ortant and basic questions such as how to learn from biased samples of data, and how to incorp orate conjec- tures by human exp erts ab out the e ectiveness of various intervention actions. If successful, this research will allow applying historical data much more directly to the questions faced by decision-makers.
Scaling to extremely large data sets. Whereas most learning algorithms p erform acceptably on datasets with tens of thousands of training ex- amples, data sets such as large retail customer data bases, and the Hubble telescop e data can easily reach a terabyte or more. To provide reasonably ecient data mining metho ds for such large data sets re- quires additional research. Research during the past few has already
the accuracy of predictions can b e improved by inventing a more ap- propriate set of features to describ e the available data. For example, consider the problem of detecting the imminent failure of a piece of equipment based on the time series of sensor data collected from the equipment. It is easy to generate millions of features that describ e this time series by taking di erences, sums, ratios, averages, etc. of primitive sensor readings and previously de ned features. Our con- jecture is that given a suciently large and long-duration data set it should b e feasible to automatically explore this large space of p ossible de ned features in order to identify the small fraction of these features most useful for future learning. If successful, this work would lead to increased accuracy in many prediction problems, such as predict- ing equipment failure, customer attrition, credit repayment, medical outcomes, etc.
There are many other directions of active research as well, including work on how to provide more useful data visualization to ols, how to supp ort mixed- initiative human-machine exploration of large data sets, and how to reduce the e ort needed for data warehousing and for combining information from di erent legacy databases. Still, the interesting fact is that even current rst- generation approaches to data mining are b eing put to routine use by many organizations, pro ducing imp ortant gains in many applications. We might sp eculate that as the future of this eld unfolds, we will see several directions in which it will advance including (1) new algorithms that learn more accurately, that are able to utilize data from dramatically more diverse data sources available over the internet and intranets, and that are able to incorp orate more human input as they work (2) integration of these data mining algorithms into standard database systems, (3) an increasing e ort within many organizations on capturing, warehousing and utilizing historical data to supp ort evidence-based decision making. We can also exp ect to see more universities react to the severe short- age of trained exp erts in this area, by creating new academic programs for students wishing to sp ecialize in data mining. In fact, several univer- sities have recently announced graduate degree programs in data mining, machine learning, and computational statisics, including Carnegie Mellon University (see www.cs.cmu.edu/cald), University of California at Irvine (www.ics.uci.edu/gcounsel/masterreqs.html), George Mason University (van-
Figure 4: Research on basic scienti c issues (left) will impact future data mining applications in many areas (right).