Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Weka Tool used for comparison, Lab Reports of Digital Marketing

University Nord (UN)Digital Marketing

Weka can be used for comparing two datasets for having a goal as if they are similar or not. Weka has the best tools to calculate clustering, sampling of large datasets.

Typology: Lab Reports

2019/2020

Uploaded on 04/28/2020

hashil-ramnial 🇲🇺

1 document

1 / 16

This page cannot be seen from the preview

Don't miss anything!

Page | 3

EXECUTIVE SUMMARY

The purpose of our study is to conduct a text mining using Weka Tool for text pre-

processing, remove stop words, tokenizing data and stemming data. In addition to,

we’ve identified distinct keywords, generating document matrix and comparing

similarity between documents.

For this tutorial, we have used two datasets named diseases and lifestyle. The diseases

dataset contains related topics such as Cancer, Diabetes all forms part of disease

category. For the unrelated dataset quality, the unrelated topics include health, fashion,

transportation, education and entertainment.

This report is prepared to analyse the various concepts of text mining whereby we’ve

produced results based on methodology used to discover patterns that indicate trends or

connections between the datasets.

Discover Lab Reports of Digital Marketing University Nord (UN)

Partial preview of the text

Download Weka Tool used for comparison and more Lab Reports Digital Marketing in PDF only on Docsity!

EXECUTIVE SUMMARY

The purpose of our study is to conduct a text mining using Weka Tool for text pre- processing, remove stop words, tokenizing data and stemming data. In addition to, we’ve identified distinct keywords, generating document matrix and comparing similarity between documents. For this tutorial, we have used two datasets named diseases and lifestyle. The diseases dataset contains related topics such as Cancer, Diabetes all forms part of disease category. For the unrelated dataset quality, the unrelated topics include health, fashion, transportation, education and entertainment. This report is prepared to analyse the various concepts of text mining whereby we’ve produced results based on methodology used to discover patterns that indicate trends or connections between the datasets.

1. INTRODUCTION

1.1 DISEASES DATASET

We’ve created a Diseases dataset that consist of related subtopics such as Cancer, Diabetes, Haemodialysis, Tuberculosis and Pneumonia. These subtopics all form part of the category diseases. The purpose for creating the disease dataset is to improve public health, to prevent illness, and to eliminate threats to physical and mental health. It consists of various factors such as the symptoms and even factors contributing to death. We created five different text files for the diseases and merge them into an .arff format to be loaded in Weka. Below shows our dataset. The relation is named diseases and it has 2 attributes namely text to store the string characters and the other one is named diseases to link to the appropriate text.

Diseases Dataset

Pre-Processing [Removing stop words] Stop words are words which are filtered out prior to, or after, processing of natural language data (text). For some search engines, these are some of the most common, short function words, such as the , is , at , which , and on. Below shows some of the stop words we have removed for our dataset diseases such as is, of, or and not where we uses the StringToVector filter.
Tokenizing Data Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. For our dataset, we chose the wordTokenizer with delimiters .,;:'"()?!
Stemming The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. For example stemmed words of our dataset diseases includes histor(for history, immun for immune).

2. DISCUSSION

Identifying Distinct Keywords Keyword extraction is a text analysis technique that consists of automatically extracting the most important words and expressions in a text. It helps summarize the content of a text and recognize the main topics which are being discussed. The distinct keywords related to our disease dataset are listed below:
Checking for occurrences in same document To check for occurrences, we made use of SimpleKMeans to determine clustering output in order to check for similar numerical values for application in pattern recognition or data analysis. Below shows our clustering output for our dataset Diseases. chronic

• Calculating TF(d, t) – Term Frequency

The Term Frequency measures how frequently a term occurs in a

document. Each document is iterated and how often each words appear is

counted.

TF(t) = (Number of times term t appears in a document) /( Total number

of terms in the document)

For our disease dataset, the word appears in each doc is as follows:

The proper noun related to diseases include symptom and it is an important

characteristic word for each text. We have 5 documents that contains 100

words whereby the word symptom appears 5 times.

The term frequency for symptom is (5/100) = 0.05.

Calculate IDF(d, t) – Inverse Document Frequency

IDF (Inverse Document Frequency) measures the rank of the specific word for

its relevancy within the text. This measures how important a term is. The formula to

calculate it is illustrated below: IDF(t) = log_e(Total number of documents / Number of documents with term t in it) So the IDF for our disease dataset is shown below: Number of documents = 5 Number of documents with term t n it = 1 Log (5/1) = 0.69897000433601880478626110527551 ~ 0.

Calculate TF_IDF(d, t): TF_IDF can be calculate as: TF_IDF = TF(d,t) * IDF(t) TF(d,t) for our disease dataset = 0. IDF(t) = 0. So our TF_IDF(d,t) = 0.05 * 0. = 0. Thus, we conclude that the TF_IDF weight is the product of these quantities: 0.05 * 0.699 = 0.
Generate TF-IDF Matrix. Below shows the matrix of TF-IDF word TF IDF TF * IDF Death 5/100 5/100 log(5/5)=0 0 0 Factor 5/100 0 (log 5/1) = 0.699 0.035 0 sympton 0 5/100 log(5/1)=0.699 0 0.

Loading quality dataset into Weka
Pre-Processing [special characters] This is done to ensure there is no noisy data or any unstructured data such as special symbols % is removed to have only text data to proceed further. For this quality dataset no special symbols have been detected.
Pre-Processing [Removing stop words] Stop words are words which are filtered out prior to, or after, processing of natural language data (text). For some search engines, these are some of the most common, short function words, such as the , is , at , which , and on. Below shows some of the stop words we have removed for our dataset diseases such as is, of, or and not where we use the StringToVector filter.

Tokenizing Data Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. For our dataset, we chose the wordTokenizer with delimiters .,;:'"()?!
Stemming The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. For example stemmed words of our dataset quality includes teach(for teaching, becaus for because).

Generating Document Representation When creating a database of terms that appear in a set of documents the document- term matrix contains rows corresponding to the documents and columns corresponding to the terms. It is displayed in form of relative frequency.
- Generating Document Matrix [SimpleKMeans] We have used SimpleKMeans clustering whereby the data objects ('n') are classified into 'k' number of clusters in which each observation belongs to the cluster with nearest mean. For example, cluster 0 represents health.

• Calculating TF(d, t) – Term Frequency

The Term Frequency measures how frequently a term occurs in a

document. Each document is iterated and how often each word appear is

counted.

TF(t) = (Number of times term t appears in a document) /(Total number of

terms in the document)

For our quality dataset, the word appears in each doc is as follows:

The proper noun related to diseases include perform and it is an important

characteristic word for each text. We have 5 documents that contains 100

words whereby the word perform appears 8 times.

The term frequency for perform is ( 8 /100) = 0.08.

Calculate IDF(d, t) – Inverse Document Frequency

IDF (Inverse Document Frequency) measures the rank of the specific word for

its relevancy within the text. This measures how important a term is. The formula to

calculate it is illustrated below: IDF(t) = log_e(Total number of documents / Number of documents with term t in it) So, the IDF for our quality dataset is shown below: Number of documents = 5 Number of documents with term t n it = 2 Log (5/ 2 ) 0.397940008 6720376095725 ~ 0. 398

5. CONCLUSION

The diagnostic numeric valued variable is whether the patient shows sign of different diseases according to World Health Organization (WHO) Criteria. The result marks a real valued prediction between 0 and 1. This was transformed into a numeric decision using a cut-off of 0.699. For the unrelated subtopics, the result accuracy appears to a real valued prediction about 60 % between 0 and 1. This was transformed into a numeric decision using a cut-off of 0.398. Overall the analysis produced varying effect size and the five documents of disease dataset found a statistically significant relationship while the quality dataset of unrelated subtopics did not.

Weka Tool used for comparison, Lab Reports of Digital Marketing

Related documents

Partial preview of the text

Download Weka Tool used for comparison and more Lab Reports Digital Marketing in PDF only on Docsity!

EXECUTIVE SUMMARY

1. INTRODUCTION

1.1 DISEASES DATASET

2. DISCUSSION

• Calculating TF(d, t) – Term Frequency

The Term Frequency measures how frequently a term occurs in a

document. Each document is iterated and how often each words appear is

counted.

TF(t) = (Number of times term t appears in a document) /( Total number

of terms in the document)

For our disease dataset, the word appears in each doc is as follows:

The proper noun related to diseases include symptom and it is an important

characteristic word for each text. We have 5 documents that contains 100

words whereby the word symptom appears 5 times.

The term frequency for symptom is (5/100) = 0.05.

IDF (Inverse Document Frequency) measures the rank of the specific word for

its relevancy within the text. This measures how important a term is. The formula to

• Calculating TF(d, t) – Term Frequency

The Term Frequency measures how frequently a term occurs in a

document. Each document is iterated and how often each word appear is

counted.

TF(t) = (Number of times term t appears in a document) /(Total number of

terms in the document)

For our quality dataset, the word appears in each doc is as follows:

The proper noun related to diseases include perform and it is an important

characteristic word for each text. We have 5 documents that contains 100

words whereby the word perform appears 8 times.

The term frequency for perform is ( 8 /100) = 0.08.

IDF (Inverse Document Frequency) measures the rank of the specific word for

its relevancy within the text. This measures how important a term is. The formula to

5. CONCLUSION

6. REFERENCES

• www.who.int

• www.time.com/tag/lifestyle