Text Classification and Knowledge Extraction from Web Documents using Docsity.com, Slides of Fundamentals of E-Commerce

The use of text classification techniques to extract information from web documents and generate knowledge bases. Docsity.com is a system that trains machine-learning subsystems to predict classes and relations, populates the knowledge base with data collected from the web, and provides ontology and training examples as inputs. The document also covers knowledge extraction, which consists of assigning a new web page to a class and filling in class attributes by extracting relevant information. Various classification methods, including naive bayes, are applied to different datasets, such as news stories and email filtering.

Typology: Slides

2012/2013

Uploaded on 07/30/2013

post_box
post_box 🇮🇳

4.7

(3)

113 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Applications
Text categorization methods use
document vector or ‘bag of words’
Domain specific aspects of the web
for e.g., sports, citations related to AI improves
classification performance
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Text Classification and Knowledge Extraction from Web Documents using Docsity.com and more Slides Fundamentals of E-Commerce in PDF only on Docsity!

Applications • Text categorization methods use – document vector or ‘bag of words’ • Domain specific aspects of the web – for e.g., sports, citations related to AI improvesclassification performance

Classification of Web Pages • Use of text classification to – extract information from web documents – automatically generate knowledge bases • Web^ →^ KB systems (Cravern et al.) – train machine-learning subsystems •^ predict about classes and relations •^ populate KB from data collected from web – provide ontolgy and training examples as inputs

Example

Experimental Results PredictedActual Categorycateforycou^ stu^ fac^ sta^ pro^

  • Cou^202 17 dep oth Precision
  • 0 1 0 552 26.
    • Stu^0 421
    • 17 2 0 519 43.
      • Fac^5 56
      • 16 3 0 264 17.
        • Sta^0 15
        • 4 0 0 45 6.
          • Pro^8 9
          • 5 62 0 384 13.
            • Dep^10 8
            • 1 5 4 209 1.
              • Oth^19 32
              • 3 12 0 1064 93.
                • Recall 82.8 75.4 77.
                • 8.7 72.9 100.0 35.
  • Experimental Results • ModApte split (Joachims 1998) – 9603 training data and 3299 test data, 90 categories Prediction Method Performancebreakeven (%)Naïve Bayes 73.4Rocchio 78.7Decision tree 78.9K-NN 82.0Rule induction 82.0Support vector (RBF) 86.3Multiple decision trees 87.

Email and News Filtering • ‘Bag of words’ representation – removes important order information – need to hand-program terms, for e.g., ‘confidentialmessage’, ‘urgent and personal’ • Naïve Bayes classifier is applied for junk emailfiltering • Feature selection is performed by – eliminating rare words – retaining important terms, determined by mutualinformation

Supervised Learning with Unlabeled Data • Assigning labels to training set is^ –^ expensive^ –^ time consuming • Abundance of unlabeled data^ –^ suggests possible use to improve learning