





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A lecture script from the Foundations of Natural Language Processing (FNLP) course, focusing on text corpora and sentiment analysis. It covers the definition and importance of corpora in NLP, examples of corpora, markup formats, sentiment analysis goals and methods, and the use of NLTK for sentiment analysis and tokenization.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






Alex Lascarides (slides based on those of Nathan Schneider, Alex Lascarides)
17 January 2020
Alex Lascarides FNLP Lecture 2 17 January 2020
This lecture:
Illustrative application: sentiment analysis
... and a bit about tokenisation
Alex Lascarides FNLP Lecture 2 1
corpus noun, plural corpora or, sometimes, corpuses.
Dictionary.com
Focusing on English; most released by the Linguistic Data Consortium (LDC):
Brown: 500 texts, 1M words in 15 genres. POS-tagged. SemCor subset (234K words) labelled with WordNet word senses.
WSJ: 6 years of Wall Street Journal; subsequently used to create Penn Treebank, PropBank, and more! Translated into Czech for the Prague Czech-English Dependency Treebank.
ECI: European Corpus Initiative, multilingual.
BNC: 100M words; balanced selection of written and spoken genres.
Redwoods: Treebank aligned to wide-coverage grammar; several genres.
Gigaword: 1B words of news text.
AMI: Multimedia (video, audio, synchronised transcripts).
Google Books N-grams: 5M books, 500B words (361B English).
Flickr 8K: images with NL captions
English Visual Genome: Images, bounding boxes ⇒ NL descriptions
Alex Lascarides FNLP Lecture 2 4
Alex Lascarides FNLP Lecture 2 5
Goal: Predict the opinion expressed in a piece of text. E.g., + or −. (Or a rating on a scale.)
Alex Lascarides FNLP Lecture 2 6
Je"rey Lyles (/critic/je"rey- lyles/) View All Critic Reviews (212) (/m/star_wars_episode_i_the_phantom_menace/reviews/ Lyles' Movie Files
AUDIENCE REVIEWS FOR STAR WARS EPISODE I - THE PHANTOM MENACE
Jay Hutchinson (/user/id/904627900/) Super Reviewer Matthew Samuel Mirliani (/user /id/896467979/) Super Reviewer
KJ Proulx (/user/id/896976177/) Super Reviewer
Chris Garman (/user/id/816762000/ Super Reviewer
View All Audience Reviews (40031) (/m/star_wars_episode_i_the_phantom_menace/reviews /?type=user)
STAR WARS EPISODE I - THE PHANTOM MENACE QUOTES
Full Review… (http://www.patheos.com/blogs/filmchat /1999/05/review-star-wars-episode-i-the-phantom- menace-dir-george-lucas-1999.html) | November 20,
to the original trilogy that this new film lacks.
^ ½ This movie is terrible
^ ½ Phantom is a frustrating watch, however there are elements worth admiring: its ambition plot, Williams' score, the art direction, and the iconic duel with Darth Maul.
Filled with horrific dialogue, laughable characters, a laughable plot, ad really no interesting stakes during this film, "Star Wars Episode I: The Phantom Menace" is not at all what I wanted from a film that is supposed to be the huge opening to the segue into the fantastic Original Trilogy. The positives include the score, the sound e"ects, and most of the
½ I've had a saying that I've used for almost 20 years now in relation to The Phantom Menace. I compare the film to waking up Christmas morning expecting some great present only to receive socks. Nothing against socks. They have a place and are quite needed, but there's no flash with it. The same goes for The Phantom Menace, a film that really doesn't live up to the
View All Critic Reviews (323) (/m/star_wars_episo
AUDIENCE REVIEWS FOR STAR WARS: THE FORCE AWAKENS
Matthew Samuel Mirliani (/user /id/896467979/) Super Reviewer
The Force Awakens is an exciting, nostalgic, powerful and moving film, that is capable of generating accelerated (^) Full Review
Star Wa ^ big scre
Extraordinarily faithful to the tone and style of the originals, The Force Awakens brings back the Old Trilogy's heart, humor, mystery, and fun. Since it is only the first piece in a new three-part journey it can't help but feel incomplete. But everything that's already there, from the stunning visuals, to the thrilling action sequences, to the charismatic new characters,
Rey, a young between the when she tea su!ered a cri The new entr profoundly d retelling of A backstory ab
½ JJ Abrams is very good and knowing what his audience wants and giving just that to them. He is not great, however, because he rarely shows us something we didn't know we wanted. This film derives a lot from the first Star Wars, and just goes along as you might expect, yet it is still very enjoyable because it's Star Wars. The old faces were cool to see, and the new ones do
½ Well, I always Star Wars mo of the origina the score - th have le# it at remake of the element, eve ripping o! se
RottenTomatoes.com
Alex Lascarides FNLP Lecture 2 7
Alex Lascarides FNLP Lecture 2 12
To evaluate and compare sentiment analyzers, we need reviews with gold labels (+ or −) attached. These can be
More on these issues later in the course!
Alex Lascarides FNLP Lecture 2 13
Once we have a dataset with gold (correct) labels, we can give the text of each review as input to our system and measure how often its output matches the gold label.
Simplest measure:
accuracy =
More measures later in the course!
We now have:
3 a definition of the sentiment analysis task (inputs and outputs)
3 a way to measure a sentiment analyzer (accuracy on gold data)
So we need:
Use a sentiment lexicon to count positive and negative words:
Positive: absolutely beaming calm adorable beautiful celebrated accepted believe certain acclaimed beneficial champ accomplish bliss champion achieve bountiful charming action bounty cheery active brave choice admire bravo classic adventure brilliant classical affirm bubbly clean
......
Negative: abysmal bad callous adverse banal can’t alarming barbed clumsy angry belligerent coarse annoy bemoan cold anxious beneath collapse apathy boring confused appalling broken contradictory atrocious contrary awful corrosive corrupt
...
From http://www.enchantedlearning.com/wordlist/ Simplest rule: Count positive and negative words in the text. Predict whichever is greater.
Alex Lascarides FNLP Lecture 2 16
I can’t stand this movie vs. I can’t believe how great this movie is
Alex Lascarides FNLP Lecture 2 17
Perhaps corpora can help address the first objection:
A data-driven method: Use frequency counts to ascertain which words tend to be positive or negative.
In this course, we will be using Python 2.7 and NLTK, the Natural Language Toolkit (http://nltk.org). NLTK
It will help if you familiarise yourself with Python strings and methods/libraries for manipulating them. Last year’s co-lecturer Nathan Schneider has produced a number of useful reference guides for NLP using Python: http://people. cs.georgetown.edu/nschneid/howtos.html
Perhaps corpora can help address the first objection:
A data-driven method: Use frequency counts from a training corpus to ascertain which words tend to be positive or negative.
Alex Lascarides FNLP Lecture 2 24
Let’s take another look at the movie reviews corpus:
>>> print ('\n'.join(' '.join(sent) for sent in ↪→ movie_reviews.sents()[:5])) plot : two teen couples go to a church party , drink ↪→ and then drive. they get into an accident. one of the guys dies , but his girlfriend continues to ↪→ see him in her life , and has nightmares. what ' s the deal? watch the movie and " sorta " find out.
What do you notice about spelling conventions? Spacing?
Alex Lascarides FNLP Lecture 2 25
Normal written conventions sometimes do not reflect the “logical” organisation of textual symbols. For example, some punctuation marks are written adjacent to the previous or following word, even though they are not part of it. (The details vary according to language and style guide!) Given a string of raw text, a tokeniser adds logical boundaries between separate word/punctuation tokens (occurrences) not already separated by spaces: Daniels made several appearances as C-3PO on numerous TV shows and commercials, notably on a Star Wars-themed episode of The Donny and Marie Show in 1977, Disneyland’s 35th Anniversary. ⇒ Daniels made several appearances as C-3PO on numerous TV shows and commercials , notably on a Star Wars - themed episode of The Donny and Marie Show in 1977 , Disneyland ’s 35th Anniversary.
To a large extent, this can be automated by rules. But there are always difficult cases.
>>> nltk.word_tokenise("Daniels made several ↪→ appearances as C-3PO on numerous TV shows and ↪→ commercials, notably on a Star Wars-themed episode ↪→ of The Donny and Marie Show in 1977, Disneyland's ↪→ 35th Anniversary.") ['Daniels', 'made', 'several', 'appearances', 'as', ↪→ 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and', ↪→ 'commercials', ',', 'notably', 'on', 'a', 'Star', ↪→ 'Wars-themed', 'episode', 'of', 'The', 'Donny', ↪→ 'and', 'Marie', 'Show', 'in', '1977', ',', ↪→ 'Disneyland', "'s", '35th', 'Anniversary', '.']
>>> nltk.word_tokenise("Daniels made several ↪→ appearances as C-3PO on numerous TV shows and ↪→ commercials, notably on a Star Wars-themed episode ↪→ of The Donny and Marie Show in 1977, Disneyland's ↪→ 35th Anniversary.") ['Daniels', 'made', 'several', 'appearances', 'as', ↪→ 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and', ↪→ 'commercials', ',', 'notably', 'on', 'a', 'Star', ↪→ 'Wars-themed', 'episode', 'of', 'The', 'Donny', ↪→ 'and', 'Marie', 'Show', 'in', '1977', ',', ↪→ 'Disneyland', "'s", '35th', 'Anniversary', '.']
English tokenisation conventions vary somewhat—e.g., with respect to:
Alex Lascarides FNLP Lecture 2 28
(Word-level) tokenisation is just part of the larger process of preprocessing or normalisation, which may include
NLTK provides nltk.sent tokenize() for sentence tokenisation, but it is far from perfect (and indeed the fact of the matter is not always clear).
Alex Lascarides FNLP Lecture 2 29
Consider the following Wikipedia extract (from https://en.wikipedia.org/ wiki/The_U.S.Air_Force%28song%29)
In April 1938, Bernarr A. Macfadden, publisher of Liberty magazine stepped in, offering a prize of $1,000 to the winning composer, stipulating that the song must be of simple “harmonic structure”, “within the limits of [an] untrained voice”, and its beat in “march tempo of military pattern”.
The contest rules required the winner to submit his entry in written form, and Crawford immediately complied. However his original title, What Do You think of the Air Corps Now?, was soon officially changed to The Army Air Corps.
The actual marked-up original for the latter part of the second paragraph above is actually the following (wihout the line breaks):
However his original title, What Do You think of the Air Corps Now?, was soon officially changed to The Army Air Corps.
It should be evident that a large number of decisions have to be made, many of them dependent on the eventual intended use of the output, before a satisfactory preprocessor for such data can be produced. Documenting those decisions and their implementation is then a key step in establishing the credibility of any subsequent experiments. Such documentation is especially important if some preprocessing has been done on a corpus before it is distributed publically. You may have noted, for example, that the movie review corpus we looked at earlier has already had case conversion (in this case, lower-casing) performed, as well as some separation of punctuation...