





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Artificial Intelligence. Programming Assignment of Natural Language Processing, Final Programming Project. Prof Manning - Stanford University
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






This project is an opportunity for you to work on an NLP system in an area of your choice! The projects will be judged on creativity in defining the problem to be investigated, the methods used, thoroughness in considering and justifying your design decisions, and quality of your write-up, in- cluding your testing of the system, error and success analysis, and reporting of results. You will not be penalized if your system performs poorly, providing your initial design decisions weren’t obviously unjustifiable, and you have made reasonable attempts to analyze why it failed, and to examine how the system might be improved. The final project can be a group project. Indeed, we’d strongly encourage you to work as a group, so you can attempt something larger and more interesting. The amount of work should be appropriately scaled to the size of the group (though the expected scaling is sublinear), and you should include a brief statement on the responsibilities of different members of the team. Team members will normally get the same grade, but we reserve the right to differentiate in egregious cases. In general we would like group sizes of 2 or 3 – if you’re considering a bigger group, you must talk to us and convince us that a group of greater than 3 is manageable given the inherent parallelizability of the task, and the time available to organize and implement the system. Solo projects are allowed. For the final project, group size is considered in the grading, but even someone working alone has to complete a good project to get a good grade. You are free (and, where appropriate, encouraged) to make use of existing code and systems as part of your project, but you should make sure their use is properly acknowledged, and make clear what additional value your project is adding.
The first deadline is to submit a project abstract. This will not be graded, but is there to encourage you to get organized early, and work out what focused project you are working on. It is also a chance for dialog between the instructors and the team. You can tell us what you plan to do, anything you have achieved so far, and what you hope to achieve in the rest of the quarter. We can give you extra references, and also information on whether we think the scope of the project is too small or two big. So please do think about where you are and have something focused and concrete ready to submit. This milestone is to put some uniform structure into the process, but beyond that, we really encourage you to stop by one of our office hours to discuss projects in person. That allows longer and more productive discussion of projects. Talking about project plans is a particularly good place to get useful feedback and information from the course staff. Just due to greater experience or a different viewpoint, this can often help a lot. The project progress report should fit on one page and should be organized around these 4 sections:
It can either be handed in on paper or sent via plain text email to cs224n-spr0708-staff@lists. Please put your email address(es) on the final project proposal – this will make it easier for us to send feedback.
A quite large amount of natural language data of various sorts is available at Stanford. This includes collections from major publishers such as the Linguistic Data Consortium (http://www.ldc.upenn.edu/), and some smaller collections, such as text categorization and informa- tion extraction training and test sets. The biggest amount of this data is in English, but there is also some data in major foreign languages (Chinese, German, French, Arabic,... ), and some parallel text. You can access much of this data under AFS at /afs/ir/data/linguistic-data/, but there are other collections, such as most speech data, which are not online, so do ask or look around at catalogs, such as the LDC’s to see what else exists. The site that discusses corpora available at Stanford is at:
http://www.stanford.edu/dept/linguistics/corpora/
Again, please note that nearly all of this data is licensed to Stanford, and do not copy it to other machines or give it to other people. There is also a lot of data on the web (free corpora and bake-off data, books, blogs, and web pages). If you could use some resources such as tagged or parsed text, or aligned multilingual materials, and you are not sure what there is, let us know. Preferably as soon as possible. You can find some links to corpora and existing tools at:
http://nlp.stanford.edu/links/statnlp.html http://nlp.stanford.edu/fsnlp/
Always a difficult one to define! But roughly you should be aiming for each member of the team to do at least as much work as on one of the homeworks. You should aim to do something that is small but interesting (i.e., not just an exercise in programming). This may only be a fairly modest extension of an existing technique, but there should be a clear focus in terms of what you hope to achieve, or hope to show. It’s perfectly okay to extend something you did in an earlier homework. Your project write-up should be adequate, but doesn’t need to scale linearly in size. One person might want to write 6 pages. A three person project may well find that a 10 page write-up is quite sufficient. Think of the write-up as something like a conference paper, focussed on research questions and achievements, though you may want to include a bit more detail on methods used, examples, etc. You could even look at example computational linguistics conference papers: see the site at http://aclweb.org/anthology-new/. As usual, the quality of your write-up is very important. It’s hard to define exactly what the write-up should cover, because it depends on the project, but generally, we’re looking for:
to cs224n-spr0708-staff@lists.
The site that discusses corpora available at Stanford is at:
http://www.stanford.edu/dept/linguistics/corpora/
Information Extraction
Machine Translation
http://www.isi.edu/natural-language/projects/rewrite/decoder.pdf
But you can find much more detail on how to build a uniform cost decoder in Mike Jahr’s thesis (he was a symsys student at Stanford, who worked a couple of summers at ISI, and is now doing MT at Google):
http://dbpubs.stanford.edu:8090/pub/2001-
Rather than building your own decoder, an alternative would be to use an existing decoder. Recently, Marian Olteanu at UT Dallas released an open source Java decoder:
http://www.phramer.org
You could work out how to hook our language models and word alignment models with it.
Franz Josef Och, Hermann Ney: Improved Statistical Alignment Models. ACL 2000. http://acl.ldc.upenn.edu//P/P00/P00-1056.pdf Kristina Toutanova, H. Tolga Ilhan and Christopher D. Manning, Extensions to HMM- based StatisticalWord Alignment Models. http://www.stanford.edu/˜krist/papers/hmmalign.pdf
http://www.statmt.org/wpt05/mt-shared-task/ http://www.statmt.org/wmt06/shared-task/ http://www.statmt.org/wmt07/shared-task.html
As noted at http://www.statmt.org/wmt06/shared-task/baseline.html, there is pretty much a “standard tool chain” for baseline statistical MT, using the Giza++ alignment model toolkit, a language model builder (such as Carmel, or the SRI language modeling toolkit), and a decoder such as Pharoah or Phramer. You could also work from this baseline.
http://www.cs.unt.edu/˜rada/wpt05/
See also the earlier workshop at:
http://www.cs.unt.edu/˜rada/wpt/
Parsing and POS tagging
http://nlp.stanford.edu/downloads/lex-parser.shtml
to a new language. An example of trying to do this for Chinese is (Levy and Manning 2003).
http://nlp.stanford.edu/downloads/lex-parser.shtml
It’s not bad, but not the world’s best. It’d be good if it was better! Some ways that seem hopeful for bettering it include:
et al. 1994, Goldstein et al. 1999). Some interesting recent work (Jing and McKeown 1999) at- tempts to actually take parts of sentences in a sensible fashion. There’s a bibliography of work at:
http://www.ics.mq.edu.au/~swan/summarization/bibliography.htm
The List goes ever on and on, down from the door where it began. Now far ahead the List has gone, and I must follow, if I can
http://www.senseval.org/ http://www.cs.unt.edu/˜rada/downloads.html
http://www.cs.brown.edu/people/ec/ http://www.cs.brown.edu/people/sc/
Or (Riloff and Jones 1999) or http://ai.stanford.edu/˜rion/papers/hypernym_nips04.pdf.
http://ai.stanford.edu/˜rion/papers/semtax_acl06.pdf
or from other groups:
http://www.patrickpantel.com/cgi-bin/Web/Tools/getfile.pl?type=paper&id=2007/naacl07- 01.pdf http://www.patrickpantel.com/cgi-bin/Web/Tools/getfile.pl?type=paper&id=2002/kdd02.pdf
Such things can make good projects (but beware of trying to do something at a large scale than you can handle computationally!).
(a) CoNLL Shared Tasks: named entity recognition, semantic predicate argument structure, phrasal chunking etc. See the list at: http://cnts.uia.ac.be/signll/shared.html (b) Chinese Word Segmentation http://www.sighan.org/bakeoff2003/ (c) DUC: There have been competitions in document summarization, but it’s much harder to get the data for those (whole bunch of forms need signing... ) http://www-nlpir.nist.gov/projects/duc/index.html (d) BioMedical Named Entity Recognition and Information extraction tasks
http://nlp.stanford.edu/courses/cs224n/
Briscoe, T., and N. Waegner. 1992. Robust stochastic parsing using the inside-outside algorithm. In Proceedings of the AAAI ’92 Workshop on Probabilistically-Based Natural Language Processing Techniques , 39–53. AAAI. Revised version at http://www.cl.cam.ac.uk/Research/Papers/.
Charniak, E. 1993. Statistical Language Learning. Cambridge, MA: MIT Press.
Charniak, E. 1996. Tree-bank grammars. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI ’96) , 1031–1036.
Charniak, E. 1997a. Statistical parsing with a context-free grammar and word statistics. In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI ’97) , 598–603.
Charniak, E. 1997b. Statistical techniques for natural language parsing. AI Magazine 33–43.
Clark, A. 2003. Combining distributional and morphological information for part of speech induction. In EACL 2003.
Collins, M. J. 1996. A new statistical parser based on bigram lexical dependencies. In ACL 34 , 184–191.
Collins, M. J. 1997. Three generative, lexicalised models for statistical parsing. In ACL 35/EACL 8 , 16–23.