Plagiarism Detection: Identifying Unoriginal Text using Stylometry and Lexical Markers, Study notes of Programming Languages

The concept of plagiarism and its detection through various methods, with a focus on authorship attribution, text classification, language identification, and stylometry. It discusses the use of lexical markers, specifically function words, to identify plagiarism. The document also touches upon the history of plagiarism detection and its significance in academic and journalistic contexts.

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-q9j
koofers-user-q9j 🇺🇸

10 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Language and
Computers
Topic3.1: Text
Classification
Introduction
TextClassification
Language Identification
Authorship
Attribution
AuthorIdentification
Stylometry
LexicalMarkers
LexicalMarkers: Function
Words
Plagiarism
Detection
What is plagiarism?
Plagiarism Example
Plagiarism
Detection
Detection Goals
PreviousApproaches
References
Language and Computers (Ling 261)
Topic 3.1: Text Classification
Markus Dickinson
Dept. of Linguistics, Georgetown
Spring 2007
1/ 24
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Plagiarism Detection: Identifying Unoriginal Text using Stylometry and Lexical Markers and more Study notes Programming Languages in PDF only on Docsity!

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Language and Computers (Ling 261)

Topic 3.1: Text Classification

Markus Dickinson

Dept. of Linguistics, Georgetown

Spring 2007

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Authorship Attribution

I Authorship attribution is the process of identifying

who wrote a text.

I Potential applications of include

I Author Identification (Madison or Hamilton... who

penned The Federalist Papers?)

I Forensic Evidence (suicide or murder... who wrote the

note?)

I Plagiarism Detection (pass or fail... who did the work?)

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Language identification

I We can attempt to classify documents according to the

language a document is (mostly) written in.

I Can sometimes tell by

I which characters are used,

I e.g. Liebe Gr ¨uße uses ¨u and ß → German

I which character encoding is being used

I e.g., ISO 8859-8 is used to encode Hebrew characters

→ text is written in Hebrew

I But how can you tell if you are reading English vs.

Japanese transliterated into the Roman alphabet? Or

Swedish vs. Norwegian? And all phonetically

transcribed text is encoded in the same IPA encoding!

I Consider what you base your guess on when I ask

whether the following is Portuguese or Polish:

Czy brak plan ´ow zagospodarowania hamuje rozw ´oj

Warszawy?

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Language identification

N-grams

I One simple technique for identifying languages is to use

n-grams = stretch of n tokens (i.e., letters or words):

I Go through texts for which we know which language

they are written in and store the n-grams of letters

found, for a certain n.

I e.g., extracting the trigrams (3-grams) for the last

sentence we’d get: Go , o t, th, thr, hro, rou,...

I This provides us with an indication of what sequences

of letters are possible in a given language (and how

frequent they occur).

I e.g., thr is not a likely Japanese string.

I How do we make this more concrete?

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Language identification

Different techniques

I Although n-grams do not capture abstract linguistic

knowledge, they are a simple and surprisingly effective

technique, used throughout computational linguistics.

I Another simple technique for language identification

would be to look for keywords in the documents, e.g.,

capture → English, je → French, etc.

I Requires knowledge which words are the best

indicators for a particular language.

I Words occurring frequently and independent of the

topic of the text are best, e.g., so-called function words

like articles (e.g., in English the, a,... ),

complementizers (e.g., in English that, whether, if,... ).

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Identifying the Author

I In a classic study, Mosteller and Wallace (1964) applied

authorship detection techniques to The Federalist

Papers.

I The Federalist Papers were a series of 85 articles

written between 1787 and 1788 by James Madison,

Alexander Hamilton and John Jay to persuade New

York to ratify the Constitution.

I Some of the papers were clearly written by one of the

three; 12 are in question, written either by Hamilton or

Madison.

I Mosteller and Wallace examined the frequency of

various words in the disputed papers and compared

each to a model of known Hamilton writings and known

Madison writings.

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Stylometric Approach

I The basic approach:

I Extract style markers

I Use the markers to classify texts

I Style markers may be based on words, grammar or a

combination.

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Lexical Style Markers

I Lexical style markers are words that give clues about

authorship.

I There are two types of markers: vocabulary richness

and frequency of function words.

I Reminder: Function words such as “to” and “that”

carry little meaning but occur often in a text.

I Function words are independent of topic, but the idea is

that which function words you choose and where you

use them are enough to identify you as an author.

I How can we use lexical markers to detect plagiarism?

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

What is plagiarism?

I Clough (2003) defines text reuse is the deliberate or

unintentional use of existing text for the creation of a

new text.

I Plagiarism is one kind of text reuse.

I Reusing newswire text in journalistic publications is

another instance of text reuse.

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Types of Plagiarism

Clough (2003) outlines six forms of plagiarism:

1. Word-for-word – Whole phrases, sentences or

passages are copied, but not attributed.

2. Paraphrasing – The unattributed source material is

rewritten, but is still recognizable in the new text.

3. Secondary Source – Sources are cited, but extracted

from a secondary source (not the original).

4. Source Form – A source’s argument structure/text

organization is copied.

5. Ideas – Thoughts (independent of form) are copied

without attribution.

6. Authorship – Authorship of an entire text is falsely

claimed.

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Word-for-Word Plagiarism: Copy

Critical care nurses have a hierarchy of roles.

The nurse manager hires and fires nurses. S/he

does not directly care for patients but does

follow unusual or long-term cases. On each shift a resource

nurse attends to the

functioning of the unit as a whole, such as making sure

beds are available in the operating room, and also

has a patient assignment....

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Recognizing Plagiarism (1)

The following factors may indicate plagiarism:

I Vocabulary use beyond the skill level of the writer (Ex:

technical/advanced terms).

I A drastic change in the quality of writing compared to

previous submissions.

I Style or vocabulary inconsistencies within a text.

I Choppy text that lacks transitions or smooth flow,

indicating a “cut-and-paste” job.

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Plagiarism Detection

1. Detection in a single text:

I Identify inconsistencies within a text

I Find sources for the inconsistencies

2. Detection across multiple texts:

I Identify unacceptable collaborations

I Identify direct copying

Computers

Topic 3.1: Text Classification

Introduction Text Classification Language Identification

Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words

Plagiarism Detection What is plagiarism? Plagiarism Example

Plagiarism Detection Detection Goals Previous Approaches

References

Detection Goals

We would like to...

I Maximize true positives (texts correctly marked as

instances of plagiarism) and true negatives (texts

correctly marked as not instances of plagiarism).

I Minimize false positives (texts incorrectly marked as

instances of plagiarism) and false negatives (texts

incorrectly marked as not instances of plagiarism).