An Introduction to Named Entity Recognition and Learning Based Java | CS 446 | Assignments Computer Science

CS446: Pattern Recognition and Machine Learning Fall 2008

Problem Set 3

Handed Out: September 25, 2008 Due: October 9, 2008

•Feel free to talk to other members of the class in doing the homework. I am more concerned that

you learn how to solve the problem than that you demonstrate that you solved it entirely on your

own. You should, however, write down your solution yourself. Please try to keep the solution brief

and clear.

•Feel free to send me email or come to ask questions.

•Please, no handwritten solutions. Be sure your name appears on the top of each page.

•Please present your algorithms in both pseudocode and English. That is, give a precise formulation of

your algorithm as pseudocode and also explain in one or two concise paragraphs what your algorithm

does. Be aware that pseudocode is much simpler and more abstract than real code. Take a look at

the textbook pseudocode (e.g. Table 2.5 on page 33) to get an idea about the appropriate level of

abstraction.

•The homework is due at 4:00 pm on the due date. Email BOTH your write-up and your code to

the TA. Please do NOT hand in a hard copy of your write-up. Please put “<userid>CS446 hw3

submission” as the subject line of the email when you submit your homework to([email protected])

Introduction

In this problem set we will study the Named Entity Recognition (NER) problem using

Learning Based Java (LBJ). The goal of this problem set is to allow you to experience the

process of developing a classifier for a real world problem and appreciate the importance

of feature engineering as a significant component in the development process as well as a

crucial factor in the eventual performance of the classifier.

Training data, some Java sources, and a Makefile that will come in handy are provided on the

CS446 website. Javadoc for those sources will also be available there. The data is given in a

file containing tab delimited lines of text. Each line describes a single word, and sentences

are separated by newlines. For this problem set, we will only be interested in the information

in the first and sixth fields of each line. You can get the data from:

http://l2r.cs.uiuc.edu/∼danr/Teaching/CS446-08/Hw/hw3/Reuters2003.tgz

Named Entity Recognition

The NER problem is the problem of identifying which phrases in natural language text refer

to named entities. Named entities can be identified at various levels of coarseness. In this

problem set, we will label each entity as either a person, a location, an organization, or

miscellaneous. For example:

[PER Bob Denver]went to [LOC Denver ]for a meeting of the [ORG Denver

Musicians Association].

Annotated data for this task is typically presented with separate annotations for each word.

Thus, a tag (such as PER) is split into B-tag and I-tag to indicate the first word (or beginning)

An Introduction to Named Entity Recognition and Learning Based Java | CS 446, Assignments of Computer Science