






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Artificial Intelligence. Programming Assignment of Natural Language Processing, MaxEnt Sequence Classifiers and Treebank Parsing. Prof Manning - Stanford University
Typology: Exercises
1 / 10
This page cannot be seen from the preview
Don't miss anything!







This assignment looks at named entity recognition and parsing. The aim is to examine whether pre-chunking of named entities can improve the performance of a statistical parser trained on a single or hybrid biomedical corpus when applied to the task of parsing another set of biomedical research articles. You will build a maximum entropy classifier, which will be incorporated into a maximum entropy Markov model for doing named entity recognition on biomedical text. You will also implement the parsing algorithm for a broad coverage statistical treebank parser. We have included in the support code the ability to chunk entities into a single word, and then to pass this chunked sentence to the parser, so that you can then informally compare the performance of the parser on chunked and unchunked input.
We’ve put everything for both portions of this assigment (both Java starter code and data files) in the directory /afs/ir/class/cs224n/pa3/. The Java starter code is in pa3/java/, and the data is in pa3/data/. Copy the starter code to your local directory, and make sure you can compile it. Spend some time looking through the principal source files for this assignment. For the maximum entropy portion, these are:
src/cs224n/assignments/MaximumEntropyClassifierTester.java
You can do a quick test run for the first one using the command
java cs224n.assignments.MaximumEntropyClassifierTester -mini
For the parsing portion, the principal file is:
java/src/cs224n/assignments/PCFGParserTester.java
Make sure you can run the main method of the PCFGParserTester class. You can either run it from the data directory, or pass that directory in as the -path value. Use the -data value to specify which dataset to use. Running:
java -server -mx500m cs224n.assignments.PCFGParserTester -path /afs/ir/class/cs224n/pa3/data/ -data miniTest
will train and test your parser on a few sentences from a toy grammar. Running:
java -server -mx500m cs224n.assignments.PCFGParserTester -path /afs/ir/class/cs224n/pa3/data/ -data genia
or,
java -server -mx500m cs224n.assignments.PCFGParserTester -path /afs/ir/class/cs224n/pa3/data/ -data bioie
will train and test your parser on data from the GENIA or BioIE datasets, respectively. GENIA is about 4,000 sentences while BioIE is around 6,000. Training on a combination of both datasets simultaneously is possible with the -data combo flag.
The backbone of a maximum entropy Markov model (MEMM) is the maximum entropy classifier, which you will be responsible for building. As you learned in class, an MEMM is trained in exactly the same way as a regular maximum entropy model, where each word corresponds to one datum, and the correct label for the previous word can be used in features for the current word. The primary difference comes at test time, when a Viterbi decoder (which we have provided) must be used to find the best possible sequence of labels, instead of greedily finding the best label at each point.
Look through the code in MaximumEntropyClassifierTester.java. This class contains several subclasses. The most important two are:
MaximumEntropyClassifier (implements classify.ProbabilisticClassifier) MaximumEntropyClassifierFactory (creates the former)
Look at the main method to see the overall program flow. There are two modes you can run this main method in. If you supply -mini as the first command line argument, you’ll get a miniature classification problem from the miniTest() method. I recommend working with this branch first, since it’s easier to debug. miniTest() creates several training datums and a test datum. Each datum represents either a cat or a bear and has several features (which are just strings). The training data are passed to a MaximumEntropyClassifierFactory which uses them to learn a MaximumEntropyClassifer. This classifier is then applied to the test data, and an accuracy (and distribution over labels) is printed out.
While the starter code contains a fully-functioning pipeline for training and testing a classifier, the classifier it builds is as dumb as a rock, and does not use maximum entropy. Your job is to turn the placeholder code into a maximum entropy classifier by filling in the three chunks of code marked by “// TODO” lines. First, look at
MaximumEntropyClassifier.getLogProbabilities()
This method takes a datum, and produces the (log) distribution, according to the model, over the possible labels for the datum. There will be some interface shock here, because you’re looking
p(c | d, λ) =
exp [
∑^ i^ λifi(c, d)] c′^ exp [
i λifi(c ′, d)]
(Note that the is here correspond to the indexes generated by the IndexLinearizer.) The derivatives of the log likelihood are therefore:
∂F (λ) ∂λi
〈c,d〉∈〈C,D〉
fi(c, d)
〈c,d〉∈〈C,D〉
c′
p(c′^ | d, λ)fi(c′, d)
Recall that the left summation is the number of times the feature i actually occurs in examples with true class c in the training, while the right sum is the expectation of the same quantity using the label distributions the model predicts.
The current code just says that the objective is 42 and the derivatives are flat. Note that you don’t have to guess at x — that’s the job of the optimization code. All you have to do in the calculate() method is evaluate proposed x vectors. You have available as method arguments the data, the string-to-index encoding, and the linearizer we discussed before:
EncodedDatum[] data; Encoding encoding; IndexLinearizer indexLinearizer;
Write code to calculate the objective function and its derivatives, and return the Pair of those two quantities.
Now run miniTest() again. This time, the optimization should find a good solution, one that puts all of the mass onto the correct answer “cat.”
Almost done! Remember that putting probability 1.0 on “cat” is probably the wrong behavior here. To smooth, or regularize, our model, we’re going to modify the objective function to penalize large weights. In the calculate() method, you should now add code which modifies the objective function as follows:
G(λ) = F (λ) +
i
λ^2 i 2 σ^2
The derivatives change correspondingly:
∂G(λ) ∂λi
∂F (λ) ∂λi
λi σ^2
Run miniTest() one last time. You should now get less than 1.0 on “cat” (0.73 with the default sigma).
You will be doing named entity recognition on biomedical data, and the task is to identify the following types of entities: cell line, cell type, DNA, RNA, and protein. Don’t worry if you don’t know much (or anything) about biology, you’ll find that it really shouldn’t matter. We
have also included newswire data from the MUC shared task, which contains entities of the types PERSON, LOCATION, ORGANIZATION, MONEY, PERCENT, DATE, and TIME. You are absolutely not required to feature engineer for the newswire data, or even run your system on it, we have merely included it for those who are interested and feel like playing around with it.
Now that your classifier works, goodbye miniTest()! Run the main method with a single argument of
/afs/ir/class/cs224n/pa3/data/ner/genia
or wherever you may have copied the data. It now loads the named entity recognition data from the data directory and converts each data instance into a list of String features (replacing genia with muc7 will train on a corpus of newswire data instead). Currently, it adds only two features: one for the current word and one for the pervious label. This should train relatively quickly (around a few minutes). It will print output to standard out, using a very similar format to the imput data. There will be one word per line, with three columns. The first column will contain the word, the second column the correct answer, and the third column will contain the answer guessed by the model. We have provided a script for computing the per-entity F-score. This script is located at /afs/ir/class/cs224n/pa3/bin/nerEval. The model you just built won’t work well - that’s where you come in!
Your job here is to flesh out the feature extraction code in
MaximumEntropyClassifierTester.extractFeatures(List
This List
In this portion of the project, you will build a broad-coverage probabilistic parser. You’re free to build a beam-decoded shift-reduce parser, or implement any other general (P)CFG parsing solution you find interesting and manageable. However, the bulk of the support code assumes that you will be building a parser where the grammar rules have been pre-transformed into exclusively unary and binary grammar rules. The easiest thing to build would be a probabilisitic generalized CKY parser. Another good thing to attempt to build is an agenda-driven PCFG parser.
At the beginning of the main method in PCFGParserTester some training and test trees are read in. Currently, the training trees are used to construct a BaselineParser that implements the Parser interface (which only has one method: getBestParse()). The parser is then used
mentation we give you binarizes the trees in a way that doesn’t generalize the grammar at all. You should run some trees through the binarization process to get the idea of what’s going on. If you annotate/binarize the training trees, you should be able to construct a Grammar out of them using the constructor provided.
Your first job is to build a parser using this grammar. For the miniTest dataset your parser should match the given parse of the test sentence exactly. Once you’ve got this working you can move on to the genia, bioie, or ultimately combo datasets.
Scan through a few of the training trees in the genia or bioie dataset to get a sense of the range of inputs. Something you’ll notice is that the grammar has relatively few non-terminal symbols (27 plus part-of-speech tags) but thousands of rules, many ternary-branching or longer. Currently there are 500 GENIA files and 642 BioIE files, and the defaults settings are roughly to use 90% of each as training data with the remaining 10% split between a validation and a test set. (You can look in the data directory if you’re curious about the native format of these files.) At the moment, only the training and test set are used, but you are welcome to use the validation set too if you see a use for it. The static integer MAX LENGTH determines the maximum length of sentences to test on (it does not affect the training set). You can lower MAX LENGTH for preliminary experiments, but your final parser should work on sentences of at least length 20 in a reasonable time (5 seconds per 20-word sentence is achievable with some optimization).
Once you have a parser working on the treebanks, your next task is improve upon the supplied grammar by adding 2nd-order vertical markovization. This means using parent annotation symbols like NP^S to indicate a subject noun phrase instead of just NP. You can test your new grammar on the miniTest data set if you want, though the results won’t be very interesting. When you test it on the treebank the results should be pretty impressive: a 3–4% improvement over the a parser using the original grammar.
At this point an F1 performance of 80% is probably achievable (but don’t worry too much about this exact figure—it’s just a ballpark).
Whenever you run the java VM, you should invoke it with as much memory as you need, and in server mode:
java -server -mx500m package.ClassName
On many machines, you’ll get much faster performance than just running with no options.
If your parser is running very slowly, run the VM with the -Xprof command line option. This will result in a flat profile being output after your process completes. If you see that your program is spending a lot of time in hash map, hash code, or equals methods, you might be able to speed up your computation substantially by backing your sets, maps, and counters with IdentityHashMap instead of HashMap. This requires the use of something like an Interner for canonicalization. Ask around if you’re not sure what that would entail, some people in the class already know this trick.
Luckily for you, this last section requires no coding because the glue for plugging the two systems together is provided. Your parser was trained on possible two sets of biomedichal data but will
be tested on a third of an additional 500 sentences. In this portion of the assignment, you are to test your parser on this data and then examine the output. You will need to specify on the command line where the new test data is located:
java -server -mx500m cs224n.assignments.PCFGParserTester -path /afs/ir/class/cs224n/pa3/data/ -data {genia|bioie|combo} -testData gtb
You should then run the system again, but where the entities in the biomedical data are first chunked together into single words. You will do this by running the parser with an additional command line arguments which tells it where the NER training data is:
java -server -mx500m cs224n.assignments.PCFGParserTester -path /afs/ir/class/cs224n/pa3/data/ -data {genia|bioie|combo} -testData gtb -nerTrainFile /afs/ir/class/cs224n/pa3/data/ner/genia
It will then train your NER system after it trains the parser, and use that NER model for chunking. Do not pay attention to the scores that the parser gives you on this data, as they will be meaningless on account of the fact that the actual words in the sentence are now different than those in the gold-standard tree. Your task here is to examine both sets of outputs, and see what types of errors the chunking fixes, what types it introduces, and generally how it affects the output.
For the MEMM portion, the following is expected:
For the parsing portion, the following is expected:
And, finally, for the combination portion:
Anything beyond this is optional.
For the write-up, you should describe what you built, what choices you had to make, why you made the choices you did, how well they worked out, and what you might do to improve things further. In particular, for the maximum entropy model:
We will compile and run your program on the Leland systems, using ant and our standard build.xml to compile, and using java to run. So, please make sure your program compiles and runs without difficulty on the Leland machines. If there’s anything special we need to know about compiling or running your program, please include a README file with your submission. Your code doesn’t have to be beautiful but we should be able to scan it and figure out what you did without too much pain.
You should turn in a write-up of the work you’ve done, as well as the code. Your write-up must be submitted as a hard copy. There is no set length for write-ups, but a ballpark length might be 6 pages, including your evaluation results, a graph or two, and some interesting examples. They are to be turned in to a box outside of Professor Manning’s office.