



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Artificial Intelligence. Programming Assignment of Natural Language Processing, Language modeling. Prof Manning - Stanford University
Typology: Exercises
1 / 6
This page cannot be seen from the preview
Don't miss anything!




This assignment may be done individually or in groups of two. We strongly encourage collaboration, however your submission must include a statement describing the contributions of each collaborator. See the collaboration policy on the website (http://cs224n.stanford.edu/assignments.html#collab). Please read this assignment soon and go through the Setup section to ensure that you are able to access the relevant files and compile the code. Especially if your programming experience is limited, start working early so that you will have ample time to discover stumbling blocks and ask questions.
On the Leland machines (such as bramble.stanford.edu) 1 , make sure you can access the following directories:
/afs/ir/class/cs224n/pa1/java/ : the Java code provided for this course /afs/ir/class/cs224n/pa1/data/ : the data sets used in this assignment
Copy the pa1/java/ directory to your local directory and make sure you can compile the code without errors. The code compiles under JDK 1.5, which is the version installed on the Leland machines. To ease compilation, we’ve installed ant in the class bin/ directory. ant is similar in function to the Unix make command, but ant is smarter, is tailored to Java, and uses XML configuration files. When you invoke ant, it looks in the current directory for a file called build.xml which contains project-specific compilation instructions. The java/ directory contains a build.xml file suitable for this assignment (and a symlink to the ant executable). Thus, to copy the source files and compile them with ant, you can use the following sequence of commands:
cd ∼ mkdir -p cs224n/pa cd cs224n/pa cp -r /afs/ir/class/cs224n/pa1/java. cd java ./ant
If you don’t want to use ant, you are welcome to write a Makefile, or for a simple project like this one, you can just do
cd ∼/cs224n/pa1/java/ mkdir classes/ javac -source 5 -d classes src///*.java (^1) see http://www.stanford.edu/services/unixcomputing/environments.html for a list of Leland machines
See the collaboration policy on the website (http://cs224n.stanford.edu/assignments.html#collab).
ra ble.stanford.edu
Once you’ve compiled the code successfully, you need to make sure you can run it. In order to execute the compiled code, Java needs to know where to find your compiled class files. As should be familiar to every Java programmer, this is normally achieved by setting the CLASSPATH environment variable. If you have compiled with ant, your class files are in java/classes, and the following commands will do the trick. Type printenv CLASSPATH. If nothing is printed, your CLASSPATH is empty and you can set it as follows:
setenv CLASSPATH ./classes
Otherwise, if something was printed out, enter the following to append to the variable:
setenv CLASSPATH ${CLASSPATH}:./classes
Now you’re ready to run the test. From directory ∼/cs224n/pa1/java/ enter:
java cs224n.assignments.LanguageModelTester
If everything’s working, you’ll get some output describing the construction and testing of a (pretty bad) language model. The next section will help you make sense of what you’re seeing.
Take a look at the main() method of LanguageModelTester.java, and examine its output. This class has the job of managing data files and constructing and testing a language model. Its behavior is controlled via command-line options. Each command-line option has a default value, and the effective values are printed at the beginning of each run. You can use shell scripts to easily configure options for a run—we’ve supplied a shell script called run that will give you the idea. The -model option specifies the fully qualified class name of a language model to be tested. Its default value is cs224n.langmodel.EmpiricalUnigramLanguageModel, a bare-bones language model implementation we’ve provided. Although this is a very poor language model, it illustrates the interface (cs224n.langmodel.LanguageModel) that you’ll need to follow in implementing your own language models. A LanguageModel should implement a no-argument constructor, and must implement four other methods:
The -data option to LanguageModelTester specifies the directory in which to find data. By default, this is /afs/ir/class/cs224n/pa1/data/; if you copy data to your own machine, you’ll want to override this option.
(however, there are some anomalies in running the HUB problems for this assignment, so you should concentrate on improving your perplexity. If your HUB WER behaves contrary to your expectations, however, we would like you to explore why that is). Here’s what you should minimally build:
Some other things you might try:
A few programming tips:
You will submit your program code using a Unix script that we’ve prepared. To submit your program, first put all the files to be submitted in one directory on a Leland machine (or any machine from which you can access the Leland AFS filesystem). This should include all source code files, but should not include compiled class files or large data files. Normally, your submis- sion directory will have a subdirectory named src which contains all your source code. When you’re ready to submit, type:
/afs/ir/class/cs224n/bin/submit-pa /afs/ir/class/cs224n/bin/submit-pa
This will (recursively) copy everything in your submission directory into the official submission directory for the class. If you need to resubmit it type
/afs/ir/class/cs224n/bin/submit-pa1 -replace
We will compile and run your program on the Leland systems, using ant and our standard build.xml to compile, and using java to run. So, please make sure your program compiles and runs without difficulty on the Leland machines. If there’s anything special we need to know about compiling or running your program, please include a README file with your submission. Your code doesn’t have to be beautiful but we should be able to scan it and figure out what you did without too much pain.
You should turn in a write-up of the work you’ve done, as well as the code. The write-up should specify what you built and what choices you made, and should include the perplexities, accuracies, etc., of your systems. Your write-up must be submitted as a hard copy. There is no set length for write-ups, but a ballpark length might be 4 pages, including your evaluation results, a graph or two, and some interesting examples. It would be useful to show a learning curve indicating how your system performs when trained on different amounts of data. Embedded written exercise: An important expectation is that for each language model you build (and in particular, the three minimally required models), you should show that it defines a proper probability distribution. That is, you should show (with equations and argument, as appropriate) that the model satisfies
w P^ (w|h) = 1, where^ w^ is a word and^ h^ (for ‘history’) represents the words appearing before w. Error analysis: The most important part of your report will be error analysis on your final and intermediate results. This means examining the outputs from your language model and looking for systematic errors. For the speech regognition task, this might involve identifying classes of words, such as proper nouns, which are consistently misrecognized; for the sentence generation it might include identifying consistent anomalies, such as lack of agreement between subjects and verbs, in the randomly generated output. Note that these are just examples and we want you to look for all kinds of systematic errors. It may be tempting to wait until you have built all of your language models before you do your error analysis, but that would be a mistake. The point of error analysis is to find ways to improve your system, so it should be done frequently when developing your models. We would love to hear about classes of errors that you identified and fixed (and how you fixed them), as well as classes of errors still present in your models along with ideas for how they might be fixed.
While you are building your language models, hopefully lower perplexity will translate into better WER, but don’t be surprised if it doesn’t. A best language-modeler title and a lowest WER title are up for grabs, but the actual performance of your systems is not the major determinant of your grade on this assignment (though good performance may suggest that you are doing something right). What will impact your grade is:
/afs/ir/class/cs224n/bin/submit-pa1 -replace