Download Weka Tool used for comparison and more Lab Reports Digital Marketing in PDF only on Docsity!
EXECUTIVE SUMMARY
The purpose of our study is to conduct a text mining using Weka Tool for text pre- processing, remove stop words, tokenizing data and stemming data. In addition to, we’ve identified distinct keywords, generating document matrix and comparing similarity between documents. For this tutorial, we have used two datasets named diseases and lifestyle. The diseases dataset contains related topics such as Cancer, Diabetes all forms part of disease category. For the unrelated dataset quality, the unrelated topics include health, fashion, transportation, education and entertainment. This report is prepared to analyse the various concepts of text mining whereby we’ve produced results based on methodology used to discover patterns that indicate trends or connections between the datasets.
1. INTRODUCTION
1.1 DISEASES DATASET
We’ve created a Diseases dataset that consist of related subtopics such as Cancer, Diabetes, Haemodialysis, Tuberculosis and Pneumonia. These subtopics all form part of the category diseases. The purpose for creating the disease dataset is to improve public health, to prevent illness, and to eliminate threats to physical and mental health. It consists of various factors such as the symptoms and even factors contributing to death. We created five different text files for the diseases and merge them into an .arff format to be loaded in Weka. Below shows our dataset. The relation is named diseases and it has 2 attributes namely text to store the string characters and the other one is named diseases to link to the appropriate text.
- Pre-Processing [Removing stop words] Stop words are words which are filtered out prior to, or after, processing of natural language data (text). For some search engines, these are some of the most common, short function words, such as the , is , at , which , and on. Below shows some of the stop words we have removed for our dataset diseases such as is, of, or and not where we uses the StringToVector filter.
- Tokenizing Data Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. For our dataset, we chose the wordTokenizer with delimiters .,;:'"()?!
- Stemming The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. For example stemmed words of our dataset diseases includes histor(for history, immun for immune).
2. DISCUSSION
- Identifying Distinct Keywords Keyword extraction is a text analysis technique that consists of automatically extracting the most important words and expressions in a text. It helps summarize the content of a text and recognize the main topics which are being discussed. The distinct keywords related to our disease dataset are listed below:
- Checking for occurrences in same document To check for occurrences, we made use of SimpleKMeans to determine clustering output in order to check for similar numerical values for application in pattern recognition or data analysis. Below shows our clustering output for our dataset Diseases. chronic
• Calculating TF(d, t) – Term Frequency
The Term Frequency measures how frequently a term occurs in a
document. Each document is iterated and how often each words appear is
counted.
TF(t) = (Number of times term t appears in a document) /( Total number
of terms in the document)
For our disease dataset, the word appears in each doc is as follows:
The proper noun related to diseases include symptom and it is an important
characteristic word for each text. We have 5 documents that contains 100
words whereby the word symptom appears 5 times.
The term frequency for symptom is (5/100) = 0.05.
- Calculate IDF(d, t) – Inverse Document Frequency
IDF (Inverse Document Frequency) measures the rank of the specific word for
its relevancy within the text. This measures how important a term is. The formula to
calculate it is illustrated below: IDF(t) = log_e(Total number of documents / Number of documents with term t in it) So the IDF for our disease dataset is shown below: Number of documents = 5 Number of documents with term t n it = 1 Log (5/1) = 0.69897000433601880478626110527551 ~ 0.
- Calculate TF_IDF(d, t): TF_IDF can be calculate as: TF_IDF = TF(d,t) * IDF(t) TF(d,t) for our disease dataset = 0. IDF(t) = 0. So our TF_IDF(d,t) = 0.05 * 0. = 0. Thus, we conclude that the TF_IDF weight is the product of these quantities: 0.05 * 0.699 = 0.
- Generate TF-IDF Matrix. Below shows the matrix of TF-IDF word TF IDF TF * IDF Death 5/100 5/100 log(5/5)=0 0 0 Factor 5/100 0 (log 5/1) = 0.699 0.035 0 sympton 0 5/100 log(5/1)=0.699 0 0.
- Loading quality dataset into Weka
- Pre-Processing [special characters] This is done to ensure there is no noisy data or any unstructured data such as special symbols % is removed to have only text data to proceed further. For this quality dataset no special symbols have been detected.
- Pre-Processing [Removing stop words] Stop words are words which are filtered out prior to, or after, processing of natural language data (text). For some search engines, these are some of the most common, short function words, such as the , is , at , which , and on. Below shows some of the stop words we have removed for our dataset diseases such as is, of, or and not where we use the StringToVector filter.
- Tokenizing Data Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. For our dataset, we chose the wordTokenizer with delimiters .,;:'"()?!
- Stemming The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. For example stemmed words of our dataset quality includes teach(for teaching, becaus for because).
- Generating Document Representation When creating a database of terms that appear in a set of documents the document- term matrix contains rows corresponding to the documents and columns corresponding to the terms. It is displayed in form of relative frequency.
- Generating Document Matrix [SimpleKMeans] We have used SimpleKMeans clustering whereby the data objects ('n') are classified into 'k' number of clusters in which each observation belongs to the cluster with nearest mean. For example, cluster 0 represents health.
• Calculating TF(d, t) – Term Frequency
The Term Frequency measures how frequently a term occurs in a
document. Each document is iterated and how often each word appear is
counted.
TF(t) = (Number of times term t appears in a document) /(Total number of
terms in the document)
For our quality dataset, the word appears in each doc is as follows:
The proper noun related to diseases include perform and it is an important
characteristic word for each text. We have 5 documents that contains 100
words whereby the word perform appears 8 times.
The term frequency for perform is ( 8 /100) = 0.08.
- Calculate IDF(d, t) – Inverse Document Frequency
IDF (Inverse Document Frequency) measures the rank of the specific word for
its relevancy within the text. This measures how important a term is. The formula to
calculate it is illustrated below: IDF(t) = log_e(Total number of documents / Number of documents with term t in it) So, the IDF for our quality dataset is shown below: Number of documents = 5 Number of documents with term t n it = 2 Log (5/ 2 ) 0.397940008 6720376095725 ~ 0. 398
5. CONCLUSION
The diagnostic numeric valued variable is whether the patient shows sign of different diseases according to World Health Organization (WHO) Criteria. The result marks a real valued prediction between 0 and 1. This was transformed into a numeric decision using a cut-off of 0.699. For the unrelated subtopics, the result accuracy appears to a real valued prediction about 60 % between 0 and 1. This was transformed into a numeric decision using a cut-off of 0.398. Overall the analysis produced varying effect size and the five documents of disease dataset found a statistically significant relationship while the quality dataset of unrelated subtopics did not.
6. REFERENCES
• www.who.int
• www.time.com/tag/lifestyle