An Introduction to Named Entity Recognition and Learning Based Java | CS 446, Assignments of Computer Science

Material Type: Assignment; Professor: Roth; Class: Machine Learning; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2008;

Typology: Assignments

Pre 2010

Uploaded on 03/16/2009

koofers-user-748
koofers-user-748 🇺🇸

5

(2)

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS446: Pattern Recognition and Machine Learning Fall 2008
Problem Set 3
Handed Out: September 25, 2008 Due: October 9, 2008
Feel free to talk to other members of the class in doing the homework. I am more concerned that
you learn how to solve the problem than that you demonstrate that you solved it entirely on your
own. You should, however, write down your solution yourself. Please try to keep the solution brief
and clear.
Feel free to send me email or come to ask questions.
Please, no handwritten solutions. Be sure your name appears on the top of each page.
Please present your algorithms in both pseudocode and English. That is, give a precise formulation of
your algorithm as pseudocode and also explain in one or two concise paragraphs what your algorithm
does. Be aware that pseudocode is much simpler and more abstract than real code. Take a look at
the textbook pseudocode (e.g. Table 2.5 on page 33) to get an idea about the appropriate level of
abstraction.
The homework is due at 4:00 pm on the due date. Email BOTH your write-up and your code to
the TA. Please do NOT hand in a hard copy of your write-up. Please put <userid>CS446 hw3
submission” as the subject line of the email when you submit your homework to([email protected])
Introduction
In this problem set we will study the Named Entity Recognition (NER) problem using
Learning Based Java (LBJ). The goal of this problem set is to allow you to experience the
process of developing a classifier for a real world problem and appreciate the importance
of feature engineering as a significant component in the development process as well as a
crucial factor in the eventual performance of the classifier.
Training data, some Java sources, and a Makefile that will come in handy are provided on the
CS446 website. Javadoc for those sources will also be available there. The data is given in a
file containing tab delimited lines of text. Each line describes a single word, and sentences
are separated by newlines. For this problem set, we will only be interested in the information
in the first and sixth fields of each line. You can get the data from:
http://l2r.cs.uiuc.edu/danr/Teaching/CS446-08/Hw/hw3/Reuters2003.tgz
Named Entity Recognition
The NER problem is the problem of identifying which phrases in natural language text refer
to named entities. Named entities can be identified at various levels of coarseness. In this
problem set, we will label each entity as either a person, a location, an organization, or
miscellaneous. For example:
[PER Bob Denver]went to [LOC Denver ]for a meeting of the [ORG Denver
Musicians Association].
Annotated data for this task is typically presented with separate annotations for each word.
Thus, a tag (such as PER) is split into B-tag and I-tag to indicate the first word (or beginning)
1
pf3
pf4
pf5

Partial preview of the text

Download An Introduction to Named Entity Recognition and Learning Based Java | CS 446 and more Assignments Computer Science in PDF only on Docsity!

CS446: Pattern Recognition and Machine Learning Fall 2008

Problem Set 3

Handed Out: September 25, 2008 Due: October 9, 2008

  • Feel free to talk to other members of the class in doing the homework. I am more concerned that you learn how to solve the problem than that you demonstrate that you solved it entirely on your own. You should, however, write down your solution yourself. Please try to keep the solution brief and clear.
  • Feel free to send me email or come to ask questions.
  • Please, no handwritten solutions. Be sure your name appears on the top of each page.
  • Please present your algorithms in both pseudocode and English. That is, give a precise formulation of your algorithm as pseudocode and also explain in one or two concise paragraphs what your algorithm does. Be aware that pseudocode is much simpler and more abstract than real code. Take a look at the textbook pseudocode (e.g. Table 2.5 on page 33) to get an idea about the appropriate level of abstraction.
  • The homework is due at 4:00 pm on the due date. Email BOTH your write-up and your code to the TA. Please do NOT hand in a hard copy of your write-up. Please put “ CS446 hw submission” as the subject line of the email when you submit your homework to([email protected])

Introduction

In this problem set we will study the Named Entity Recognition (NER) problem using Learning Based Java (LBJ). The goal of this problem set is to allow you to experience the process of developing a classifier for a real world problem and appreciate the importance of feature engineering as a significant component in the development process as well as a crucial factor in the eventual performance of the classifier.

Training data, some Java sources, and a Makefile that will come in handy are provided on the CS446 website. Javadoc for those sources will also be available there. The data is given in a file containing tab delimited lines of text. Each line describes a single word, and sentences are separated by newlines. For this problem set, we will only be interested in the information in the first and sixth fields of each line. You can get the data from: http://l2r.cs.uiuc.edu/∼danr/Teaching/CS446-08/Hw/hw3/Reuters2003.tgz

Named Entity Recognition

The NER problem is the problem of identifying which phrases in natural language text refer to named entities. Named entities can be identified at various levels of coarseness. In this problem set, we will label each entity as either a person, a location, an organization, or miscellaneous. For example:

[PER Bob Denver ] went to [LOC Denver ] for a meeting of the [ORG Denver Musicians Association].

Annotated data for this task is typically presented with separate annotations for each word. Thus, a tag (such as PER) is split into B-tag and I-tag to indicate the first word (or beginning)

of a tag and an inside word of a tag respectively. Our approach will be to learn a classifier named NETagger that predicts such a B- or I- tag when given a word as input. Of course, NETagger must also be capable of predicting that a given word is not part of any named entity. We will use the tag O for this purpose (you’ll see it in the data).

Given labeled testing data, there are different ways in which one might want to test the performance of our NETagger. We could test word by word, counting the prediction of NETagger as correct if and only if it produces exactly the same tag as appears in the testing data. Thus, if the true label of a given word is B-LOC and NETagger produces I-LOC, we would count the prediction as incorrect. Then, even though the first word of that named entity was mispredicted, NETagger would still have the opportunity to get the rest of the words in the named entity correct.

Alternatively, we could test phrase by phrase. To do so, we look collectively at all the predictions made by NETagger on the words in a given sentence. Any time we see a prediction of B-tag followed by zero or more I-tag predictions, we consider the entire group of words involved as one entity predicted as tag. We then say that a single correct prediction has been made whenever an entity predicted by NETagger matches an entity in the testing data. Furthermore, a single incorrect prediction occurs whenever an entity predicted by NETagger does not match any entity in the testing data and whenever an entity in the testing data does not match any predicted entity. (To “match”, the entities must have the same type and involve the same set of words.)

In either testing scenario, it is possible to simply compute the accuracy of NETagger by dividing the number of correct predictions by the total number of predictions. However, the vast majority of words in our data will be labeled O, so a classifier that predicts O for every word could have a relatively high accuracy while not helping us identify any named entites. Alternatively, we can compute the precision and recall associated with each possible tag. The precision associated with tag is the percentage of words (or phrases, depending on how we are testing) predicted as tag that have been correctly predicted. The recall associated with tag is the percentage of words (or phrases) whose true label is tag that have been correctly predicted. Similarly, we can compute overall precision and recall numbers for the classifier. Finally, it is also possible to combine precision and recall into a single statistic that summarizes the effectiveness of the classifier by taking their harmonic mean. This statistic is refered to as F 1 , and is computed as F 1 = (^) p^2 +prr.

Learning Based Java

LBJ is a new programming framework for the design of software systems that learn from experience and perform inference. The LBJ software and manual can be obtained from the Cognitive Computations Group website (http://l2r.cs.uiuc.edu/∼cogcomp). When downloading the software, you will be given two options: either (a) download the source code in a single gzipped tar file or (b) download two jar files. The second option will take less effort and will suffice for the purposes of this problem set.

Read Chapters 1 and 2, and also Section 3.1 of the LBJ manual. This should be enough to get you started programming in the language. It will also be use-

Your Task

For each learning algorithm you try, report the name of the learning algorithm and the parameter settings you chose. Then, show a table detailing your classifier’s performance when using that learning algorithm. The table should have the same rows and columns as the table produced by cs446.ne.NETester except for the O and Accuracy rows and the LCount and PCount columns, all of which should be omitted.

  1. Next, we will improve NETagger by allowing it to consider more information when making its decision. For all experiments in this problem, replace the with clause in your LBJ definition of NETagger with exactly this: with new SparseNetworkLearner(new SparseAveragedPerceptron(.1, 0, 2)) Each item below describes a new set of features to incorporate in the classifier. They should be added cumulatively. For example, your classifier for problem 2.i should also use the features from problem 1, your classifier for problem 2.ii should also use the features from problems 1 and 2.i, and so on.

i. The LBJ.nlp package already has some useful classifiers implemented that are easy to include and experiment with. These features include: 1. Capitalization of the target and surrounding words which may be a useful indicator of the presense of a named entity, 2. Affixes which indicates the target word’s 3 and 4 letter prefixes as well as 1 through 4 letter suffixes, and 3. WordTypeInformation which includes Boolean features indicating if the word contains only capital letters, con- tains only digits, contains only characters that aren’t letters. You should experi- ment with adding these features independently to find which helps the most, and how they perform altogether. See the LBJ documentation for more information regarding these features. ii. Next, we would like to use the NE tags of the previous two words as features. These are the very labels we are using to train NETagger. During training, we have this information available to us. After training, we do continue to use labeled data, but we must pretend that the labels are unavailable to our classifier as it makes its decision in order to simulate plain text input which surely will not be labeled. So, you will create a classifier that detects whether NETagger is currently being trained. If so, it will simply return the previous two NE tags for NETagger to use as features. Otherwise, it will evaluate NETagger itself on the previous two words and return the predictions as features. In order to achieve this effect, you will need to make use of the static, Boolean isTraining field that LBJ builds into every learning classifier as well as the cachedin keyword in the left hand side of a classifier assignment. See the LBJPOS code listed on the LBJ web page (a link to which is given in the instructions). iii. A powerful source of extra information for Named Entity Recognition is world knowledge regarding which words one expects to be names or places, etc. You have been provided with lists of NE words that indicate a possible PER, ORG, LOC or MISC in the Reuters2003 data package (lists directory). You should create a new

feature (or set of features) that loads these lists and indicates whether the target word exists in one of these lists, and which lists (PER, ORG, LOC, or MISC). For example the word “Denver” appears in both ned.list.LOC and ned.list.PER so it should have both the LOC list and PER list feature active. The type of data in each file is identified by its name, but also each line of the respective files specifies the tag of the word followed by the word. iv. Create your own feature(s). Try to think of what other clues may exist in the text that will indicate a named entity, or if the existing features can be combined (or further divided) in an interesting way. Note that the training and test data provides part of speech and chunking information in columns that have so far been unused by your classifiers. It is ok if the new feature does not improve the final NETagger’s performance.

Your Task

(a) Create a single graph containing five learning curves, one for each cumulative feature set described above and the baseline feature set of problem 1. A learning curve plots the performance (overall phrase by phrase F 1 ) of the learned classifier on the testing data as a function of the number of rounds of training (i.e., the number of times NETagger has processed the training data) up to a max of 50 rounds. (b) Create a table in which each row corresponds to a feature set and in which the first column identifies the feature set, the second indicates the best testing performance achieved during the 50 rounds of training, and the third indicates the number of training rounds at which that best performance was achieved. (c) Compare the individual features of 2.i to see which improves the most over baseline feature set. Create the same table as in (b), but for the individual features used in feature set 2.i (Capitalization, Affixes, and WordTypeInformation), as well as the baseline performance of word forms without these features (feature set 1) and their performance all together (complete 2.i). Which feature seems to be the most useful on its own? Why do you think that is the case for this task. (c) Explain why use of the cachedin keyword is crucial to making feature set ii work. (d) Explain the features you invented for part iv and why you believed they would improve NETagger’s performance.

The program cs446.ne.LearningCurve has been provided to assist you. It assumes that your classifier is named cs446.ne.NETagger and that it has not been trained at all yet. It will train your classifier for the number of rounds you specify, testing after each round and reporting the results. In addition, when the program has finished, NETagger will be saved with the parameter settings that resulted in the best performance. This will be useful for the next problem, but keep in mind that if you want to run the learning curve experiment again, you must first manually remove any previously saved parameter settings so that you will be starting fresh.