Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Text Classification and Knowledge Extraction from Web Documents using Docsity.com, Slides of Fundamentals of E-Commerce

Biju Patnaik University of Technology Fundamentals of E-Commerce

The use of text classification techniques to extract information from web documents and generate knowledge bases. Docsity.com is a system that trains machine-learning subsystems to predict classes and relations, populates the knowledge base with data collected from the web, and provides ontology and training examples as inputs. The document also covers knowledge extraction, which consists of assigning a new web page to a class and filling in class attributes by extracting relevant information. Various classification methods, including naive bayes, are applied to different datasets, such as news stories and email filtering.

Typology: Slides

2012/2013

Uploaded on 07/30/2013

post_box 🇮🇳

4.7

(3)

113 documents

1 / 10

This page cannot be seen from the preview

Don't miss anything!

Applications

•Text categorization methods use

–document vector or ‘bag of words’

•Domain specific aspects of the web

–for e.g., sports, citations related to AI improves

classification performance

Docsity.com

Discover Slides of Fundamentals of E-Commerce Biju Patnaik University of Technology

Partial preview of the text

Download Text Classification and Knowledge Extraction from Web Documents using Docsity.com and more Slides Fundamentals of E-Commerce in PDF only on Docsity!

Applications • Text categorization methods use – document vector or ‘bag of words’ • Domain specific aspects of the web – for e.g., sports, citations related to AI improvesclassification performance

Classification of Web Pages • Use of text classification to – extract information from web documents – automatically generate knowledge bases • Web^ →^ KB systems (Cravern et al.) – train machine-learning subsystems •^ predict about classes and relations •^ populate KB from data collected from web – provide ontolgy and training examples as inputs

Example

Experimental Results PredictedActual Categorycateforycou^ stu^ fac^ sta^ pro^

Cou^202 17 dep oth Precision
0 1 0 552 26.
- Stu^0 421
- 17 2 0 519 43.
  - Fac^5 56
  - 16 3 0 264 17.
    - Sta^0 15
    - 4 0 0 45 6.
      - Pro^8 9
      - 5 62 0 384 13.
        
        Dep^10 8
        
        1 5 4 209 1.
        
        Oth^19 32
        
        3 12 0 1064 93.
        
        Recall 82.8 75.4 77.
        
        8.7 72.9 100.0 35.
Experimental Results • ModApte split (Joachims 1998) – 9603 training data and 3299 test data, 90 categories Prediction Method Performancebreakeven (%)Naïve Bayes 73.4Rocchio 78.7Decision tree 78.9K-NN 82.0Rule induction 82.0Support vector (RBF) 86.3Multiple decision trees 87.

Email and News Filtering • ‘Bag of words’ representation – removes important order information – need to hand-program terms, for e.g., ‘confidentialmessage’, ‘urgent and personal’ • Naïve Bayes classifier is applied for junk emailfiltering • Feature selection is performed by – eliminating rare words – retaining important terms, determined by mutualinformation

Supervised Learning with Unlabeled Data • Assigning labels to training set is^ –^ expensive^ –^ time consuming • Abundance of unlabeled data^ –^ suggests possible use to improve learning