

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Information Retrieval, Exercises - Computer Science, Prof. Paul McNamee, Text Classification, Binary classification using Reuters
Typology: Exercises
1 / 2
This page cannot be seen from the preview
Don't miss anything!


There will be four more homework assignments (including this one). The other three assignments will cover internet search, multilingual retrieval, and applications of NLP to IR. These assignments should involve less programming work compared to the first three assignments to give you more time to devote to your independent project. Also, you only need to hand in three of the remaining assignments; you can skip the assignment of your choice.
(20 pts) Explain the following concepts and what they mean in terms of text classification: (1) bias-variance tradeoff ; (2) kernel trick ; (3) and cross-validation. (20 pts) Compute how a naïve Bayes classifier would classify the following 'documents' using the binomial (or Bernoulli) model. The two classes are 'Good' and 'Spam'. Recall in the binomial model that estimates of P(word|class) are based on the percentage of documents of the class containing the word. Estimates of P(class) and P(word|class) are given in the tables below. Document 1: "free drugs willy" Document 2: "free willy baker" Words P(w|Good) P(w|Spam) baker 0.03 0. drugs 0.03 0. free 0.01 0. willy 0.05 0. P(Good) = 0. P(Spam) = 0.
(60 pts) Download from the course web site training documents for three Reuters categories (coffee, ship, and wheat) and build a classifier for each. You can use any method (e.g., kNN, naïve Bayes, decision trees, or SVMs). I suggest using the SVMlight tool which is available from http://svmlight.joachims.org/ (binaries are available for modern operating systems). To use SVMlight you should process the text files, which are similarly formatted to HWs 1-3, and write out files of vectors, one document vector per line. For example: +1 5:1 13:1 78:1 … 15008: +1 5:1 45:1 78:1 15000:
Extra credit (4 pts; 2 pts): Compute and report the macro-average for the three classes (i.e., average of the three F 1 scores). The student with the highest average gets +4 points; second place gets +2 points.