NLP Foundations: Lecture 2 - Text Corpora & Sentiment Analysis, Study notes of English Language

A lecture script from the Foundations of Natural Language Processing (FNLP) course, focusing on text corpora and sentiment analysis. It covers the definition and importance of corpora in NLP, examples of corpora, markup formats, sentiment analysis goals and methods, and the use of NLTK for sentiment analysis and tokenization.

Typology: Study notes

2021/2022

Uploaded on 08/01/2022

fioh_ji
fioh_ji 🇰🇼

4.5

(70)

814 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Foundations of Natural Language Processing
Lecture 2
Text Corpora
Alex Lascarides
(slides based on those of Nathan Schneider, Alex Lascarides)
17 January 2020
Alex Lascarides FNLP Lecture 2 17 January 2020
Corpora in NLP
This lecture:
What is a corpus?
Why do we need text corpora for NLP? (learning, evaluation)
How can we access corpora with NLTK?
Illustrative application: sentiment analysis
. . . and a bit about tokenisation
Alex Lascarides FNLP Lecture 2 1
Corpora in NLP
corpus
noun, plural corpora or, sometimes, corpuses.
1. a large or complete collection of writings: the entire corpus of Old English
poetry.
2. the body of a person or animal, especially when dead.
3. Anatomy. a body, mass, or part having a special character or function.
4. Linguistics. a body of utterances, as words or sentences, assumed to
be representative of and used for lexical, grammatical, or other linguistic
analysis.
5. a principal or capital sum, as opposed to interest or income.
Dictionary.com
Alex Lascarides FNLP Lecture 2 2
Corpora in NLP
To understand and model how language works, we need empirical evidence.
Ideally, naturally-occurring corpora serve as realistic samples of a language.
Aside from linguistic utterances, corpus datasets include metadata—side
information about where the language comes from, such as author, date,
topic, publication.
Of particular interest for core NLP, and therefore this course, are corpora
with linguistic annotations—where humans have read the text and marked
categories or structures describing their syntax and/or meaning.
Alex Lascarides FNLP Lecture 2 3
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download NLP Foundations: Lecture 2 - Text Corpora & Sentiment Analysis and more Study notes English Language in PDF only on Docsity!

Foundations of Natural Language Processing

Lecture 2

Text Corpora

Alex Lascarides (slides based on those of Nathan Schneider, Alex Lascarides)

17 January 2020

Alex Lascarides FNLP Lecture 2 17 January 2020

Corpora in NLP

This lecture:

  • What is a corpus?
  • Why do we need text corpora for NLP? (learning, evaluation)
  • How can we access corpora with NLTK?

Illustrative application: sentiment analysis

... and a bit about tokenisation

Alex Lascarides FNLP Lecture 2 1

Corpora in NLP

corpus noun, plural corpora or, sometimes, corpuses.

  1. a large or complete collection of writings: the entire corpus of Old English poetry.
  2. the body of a person or animal, especially when dead.
  3. Anatomy. a body, mass, or part having a special character or function.
  4. Linguistics. a body of utterances, as words or sentences, assumed to be representative of and used for lexical, grammatical, or other linguistic analysis.
  5. a principal or capital sum, as opposed to interest or income.

Dictionary.com

Corpora in NLP

  • To understand and model how language works, we need empirical evidence. Ideally, naturally-occurring corpora serve as realistic samples of a language.
  • Aside from linguistic utterances, corpus datasets include metadata—side information about where the language comes from, such as author, date, topic, publication.
  • Of particular interest for core NLP, and therefore this course, are corpora with linguistic annotations—where humans have read the text and marked categories or structures describing their syntax and/or meaning.

Examples of corpora (in choronological order)

Focusing on English; most released by the Linguistic Data Consortium (LDC):

Brown: 500 texts, 1M words in 15 genres. POS-tagged. SemCor subset (234K words) labelled with WordNet word senses.

WSJ: 6 years of Wall Street Journal; subsequently used to create Penn Treebank, PropBank, and more! Translated into Czech for the Prague Czech-English Dependency Treebank.

ECI: European Corpus Initiative, multilingual.

BNC: 100M words; balanced selection of written and spoken genres.

Redwoods: Treebank aligned to wide-coverage grammar; several genres.

Gigaword: 1B words of news text.

AMI: Multimedia (video, audio, synchronised transcripts).

Google Books N-grams: 5M books, 500B words (361B English).

Flickr 8K: images with NL captions

English Visual Genome: Images, bounding boxes ⇒ NL descriptions

Alex Lascarides FNLP Lecture 2 4

Markup

  • There are several common markup formats for structuring linguistic data, including XML, JSON, CoNLL-style (one token per line, annotations in tab- separated columns).
  • Some datasets, such as WordNet and PropBank, use custom file formats. NLTK provides friendly Python APIs for reading many corpora so you don’t have to worry about this.

Alex Lascarides FNLP Lecture 2 5

Sentiment Analysis

Goal: Predict the opinion expressed in a piece of text. E.g., + or −. (Or a rating on a scale.)

Alex Lascarides FNLP Lecture 2 6

Sentiment Analysis

Je"rey Lyles (/critic/je"rey- lyles/) View All Critic Reviews (212) (/m/star_wars_episode_i_the_phantom_menace/reviews/ Lyles' Movie Files

AUDIENCE REVIEWS FOR STAR WARS EPISODE I - THE PHANTOM MENACE

Jay Hutchinson (/user/id/904627900/) Super Reviewer Matthew Samuel Mirliani (/user /id/896467979/) Super Reviewer

KJ Proulx (/user/id/896976177/) Super Reviewer

Chris Garman (/user/id/816762000/ Super Reviewer

View All Audience Reviews (40031) (/m/star_wars_episode_i_the_phantom_menace/reviews /?type=user)

STAR WARS EPISODE I - THE PHANTOM MENACE QUOTES

Full Review… (http://www.patheos.com/blogs/filmchat /1999/05/review-star-wars-episode-i-the-phantom- menace-dir-george-lucas-1999.html) | November 20,

to the original trilogy that this new film lacks. 

^ ½ This movie is terrible

^ ½ Phantom is a frustrating watch, however there are elements worth admiring: its ambition plot, Williams' score, the art direction, and the iconic duel with Darth Maul.

 Filled with horrific dialogue, laughable characters, a laughable plot, ad really no interesting stakes during this film, "Star Wars Episode I: The Phantom Menace" is not at all what I wanted from a film that is supposed to be the huge opening to the segue into the fantastic Original Trilogy. The positives include the score, the sound e"ects, and most of the

 ½ I've had a saying that I've used for almost 20 years now in relation to The Phantom Menace. I compare the film to waking up Christmas morning expecting some great present only to receive socks. Nothing against socks. They have a place and are quite needed, but there's no flash with it. The same goes for The Phantom Menace, a film that really doesn't live up to the

View All Critic Reviews (323) (/m/star_wars_episo

AUDIENCE REVIEWS FOR STAR WARS: THE FORCE AWAKENS

Matthew Samuel Mirliani (/user /id/896467979/) Super Reviewer

The Force Awakens is an exciting, nostalgic, powerful and moving film, that is capable of generating accelerated (^) Full Review

Star Wa ^ big scre

 Extraordinarily faithful to the tone and style of the originals, The Force Awakens brings back the Old Trilogy's heart, humor, mystery, and fun. Since it is only the first piece in a new three-part journey it can't help but feel incomplete. But everything that's already there, from the stunning visuals, to the thrilling action sequences, to the charismatic new characters,

 Rey, a young between the when she tea su!ered a cri The new entr profoundly d retelling of A backstory ab

 ½ JJ Abrams is very good and knowing what his audience wants and giving just that to them. He is not great, however, because he rarely shows us something we didn't know we wanted. This film derives a lot from the first Star Wars, and just goes along as you might expect, yet it is still very enjoyable because it's Star Wars. The old faces were cool to see, and the new ones do

 ½ Well, I always Star Wars mo of the origina the score - th have le# it at remake of the element, eve ripping o! se

RottenTomatoes.com

Alex Lascarides FNLP Lecture 2 7

Where can you get a corpus?

  • Many corpora are prepared specifically for linguistic/NLP research with text from content providers (e.g., newspapers). In fact, there is an entire subfield devoted to developing new language resources.
  • You may instead want to collect a new one, e.g., by scraping websites. (There are tools to help you do this.)

Alex Lascarides FNLP Lecture 2 12

Annotations

To evaluate and compare sentiment analyzers, we need reviews with gold labels (+ or −) attached. These can be

  • derived automatically from the original data artifact (metadata such as star ratings), or
  • added by a human annotator who reads the text
    • Issue to consider/measure: How consistent are human annotators? If they often have trouble deciding or agreeing, how can this be addressed?

More on these issues later in the course!

Alex Lascarides FNLP Lecture 2 13

An evaluation measure

Once we have a dataset with gold (correct) labels, we can give the text of each review as input to our system and measure how often its output matches the gold label.

Simplest measure:

accuracy =

correct

total

More measures later in the course!

Catching our breath

We now have:

3 a definition of the sentiment analysis task (inputs and outputs)

3 a way to measure a sentiment analyzer (accuracy on gold data)

So we need:

  • an algorithm for predicting sentiment

A simple sentiment classification algorithm

Use a sentiment lexicon to count positive and negative words:

Positive: absolutely beaming calm adorable beautiful celebrated accepted believe certain acclaimed beneficial champ accomplish bliss champion achieve bountiful charming action bounty cheery active brave choice admire bravo classic adventure brilliant classical affirm bubbly clean

......

Negative: abysmal bad callous adverse banal can’t alarming barbed clumsy angry belligerent coarse annoy bemoan cold anxious beneath collapse apathy boring confused appalling broken contradictory atrocious contrary awful corrosive corrupt

...

From http://www.enchantedlearning.com/wordlist/ Simplest rule: Count positive and negative words in the text. Predict whichever is greater.

Alex Lascarides FNLP Lecture 2 16

Some possible problems with simple counting

  1. Hard to know whether words that seem positive or negative tend to actually be used that way. - sense ambiguity - sarcasm/irony - text could mention expectations or opposing viewpoints, in contrast to author’s actual opinon
  2. Opinion words may be describing (e.g.) a character’s attitude rather than an evaluation of the film.
  3. Some words act as semantic modifiers of other opinion-bearing words/phrases, so interpreting the full meaning requires sophistication:

I can’t stand this movie vs. I can’t believe how great this movie is

Alex Lascarides FNLP Lecture 2 17

What if we have more data?

Perhaps corpora can help address the first objection:

  1. Hard to know whether words that seem positive or negative tend to actually be used that way.

A data-driven method: Use frequency counts to ascertain which words tend to be positive or negative.

NLTK

In this course, we will be using Python 2.7 and NLTK, the Natural Language Toolkit (http://nltk.org). NLTK

  • is open-source, community-built software
  • was designed for teaching NLP: simple access to datasets, reference implementations of important algorithms
  • contains wrappers for using (some) state-of-the-art NLP tools in Python

It will help if you familiarise yourself with Python strings and methods/libraries for manipulating them. Last year’s co-lecturer Nathan Schneider has produced a number of useful reference guides for NLP using Python: http://people. cs.georgetown.edu/nschneid/howtos.html

What if we have more data?

Perhaps corpora can help address the first objection:

  1. Hard to know whether words that seem positive or negative tend to actually be used that way.

A data-driven method: Use frequency counts from a training corpus to ascertain which words tend to be positive or negative.

  • Why separate the training and test data (held-out test set)? Because otherwise, it’s just data analysis; no way to estimate how well the system will do on new data in the future.

Alex Lascarides FNLP Lecture 2 24

Tokenisation

Let’s take another look at the movie reviews corpus:

>>> print ('\n'.join(' '.join(sent) for sent in ↪→ movie_reviews.sents()[:5])) plot : two teen couples go to a church party , drink ↪→ and then drive. they get into an accident. one of the guys dies , but his girlfriend continues to ↪→ see him in her life , and has nightmares. what ' s the deal? watch the movie and " sorta " find out.

What do you notice about spelling conventions? Spacing?

Alex Lascarides FNLP Lecture 2 25

Tokenisation

Normal written conventions sometimes do not reflect the “logical” organisation of textual symbols. For example, some punctuation marks are written adjacent to the previous or following word, even though they are not part of it. (The details vary according to language and style guide!) Given a string of raw text, a tokeniser adds logical boundaries between separate word/punctuation tokens (occurrences) not already separated by spaces: Daniels made several appearances as C-3PO on numerous TV shows and commercials, notably on a Star Wars-themed episode of The Donny and Marie Show in 1977, Disneyland’s 35th Anniversary. ⇒ Daniels made several appearances as C-3PO on numerous TV shows and commercials , notably on a Star Wars - themed episode of The Donny and Marie Show in 1977 , Disneyland ’s 35th Anniversary.

To a large extent, this can be automated by rules. But there are always difficult cases.

Tokenisation in NLTK

>>> nltk.word_tokenise("Daniels made several ↪→ appearances as C-3PO on numerous TV shows and ↪→ commercials, notably on a Star Wars-themed episode ↪→ of The Donny and Marie Show in 1977, Disneyland's ↪→ 35th Anniversary.") ['Daniels', 'made', 'several', 'appearances', 'as', ↪→ 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and', ↪→ 'commercials', ',', 'notably', 'on', 'a', 'Star', ↪→ 'Wars-themed', 'episode', 'of', 'The', 'Donny', ↪→ 'and', 'Marie', 'Show', 'in', '1977', ',', ↪→ 'Disneyland', "'s", '35th', 'Anniversary', '.']

Tokenisation in NLTK

>>> nltk.word_tokenise("Daniels made several ↪→ appearances as C-3PO on numerous TV shows and ↪→ commercials, notably on a Star Wars-themed episode ↪→ of The Donny and Marie Show in 1977, Disneyland's ↪→ 35th Anniversary.") ['Daniels', 'made', 'several', 'appearances', 'as', ↪→ 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and', ↪→ 'commercials', ',', 'notably', 'on', 'a', 'Star', ↪→ 'Wars-themed', 'episode', 'of', 'The', 'Donny', ↪→ 'and', 'Marie', 'Show', 'in', '1977', ',', ↪→ 'Disneyland', "'s", '35th', 'Anniversary', '.']

English tokenisation conventions vary somewhat—e.g., with respect to:

  • clitics (contracted forms) ’s, n’t, ’re, etc.
  • hyphens in compounds like president-elect (fun fact: this convention changed between versions of the Penn Treebank!)

Alex Lascarides FNLP Lecture 2 28

Preprocessing/normalisation: The tip of the

iceberg

(Word-level) tokenisation is just part of the larger process of preprocessing or normalisation, which may include

  • encoding conversion
  • removal of markup
  • insertion of markup
  • case conversion
  • sentence boundary detection

NLTK provides nltk.sent tokenize() for sentence tokenisation, but it is far from perfect (and indeed the fact of the matter is not always clear).

Alex Lascarides FNLP Lecture 2 29

Preprocessing/normalisation: an example

Consider the following Wikipedia extract (from https://en.wikipedia.org/ wiki/The_U.S.Air_Force%28song%29)

In April 1938, Bernarr A. Macfadden, publisher of Liberty magazine stepped in, offering a prize of $1,000 to the winning composer, stipulating that the song must be of simple “harmonic structure”, “within the limits of [an] untrained voice”, and its beat in “march tempo of military pattern”.

The contest rules required the winner to submit his entry in written form, and Crawford immediately complied. However his original title, What Do You think of the Air Corps Now?, was soon officially changed to The Army Air Corps.

The actual marked-up original for the latter part of the second paragraph above is actually the following (wihout the line breaks):

However his original title, What Do You think of the Air Corps Now?, was soon officially changed to The Army Air Corps.

Preprocessing/normalisation: an example, cont’d

It should be evident that a large number of decisions have to be made, many of them dependent on the eventual intended use of the output, before a satisfactory preprocessor for such data can be produced. Documenting those decisions and their implementation is then a key step in establishing the credibility of any subsequent experiments. Such documentation is especially important if some preprocessing has been done on a corpus before it is distributed publically. You may have noted, for example, that the movie review corpus we looked at earlier has already had case conversion (in this case, lower-casing) performed, as well as some separation of punctuation...