Stylometry: Recognizing Authorship Through Writing Styles - Prof. Rachel A. Greenstadt, Study notes of Cryptography and System Security

The field of stylometry, which involves assigning authorship to documents based on linguistic styles they exhibit. The presentation covers various techniques, including decision trees, neural networks, and naive bayesian classifiers. The document also discusses challenges in stylometry, such as obfuscation and imitation attacks, and presents methods for detecting them.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-a90-1
koofers-user-a90-1 🇺🇸

9 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Prac%cal'A)acks'Against'Authorship'
Recogni%on'Techniques'
(…and'a'li)le'bit'of'AI)'
Mike'Brennan'
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download Stylometry: Recognizing Authorship Through Writing Styles - Prof. Rachel A. Greenstadt and more Study notes Cryptography and System Security in PDF only on Docsity!

Prac%cal A)acks Against Authorship

Recogni%on Techniques

(…and a li)le bit of AI)

Mike Brennan

What is AI?

(Wait, I don’t think I need this slide! We’re all computer scien%sts here, right?)

What is a Classifier?

• Given some set of informa%on about a target,

determine it’s classifica%on.

• For example, if given the en%re course history of

a student at Drexel you would probably be able to

figure out what their major is.

  • Feature set might include a comparison of the classes

they took to the required classes for each major.

  • What if they switched majors?
    • We could add another feature that looks at how recently the comparison set of classes was taken.

Learning With Decision Trees

  • Examples described by a)ribute values (Boolean, discrete, con%nuous)
  • E.g., situa%ons where I will/won't wait for a table:
  • Classifica%on of examples is posi%ve (T) or nega%ve (F)
  • Goal: Learn the defini%on for the goal predicate WillWait (slide from Russell & Norvig)

Neural Networks

• A)empt to build a computa%on system based

on the parallel architecture of brains.

• Characteris%cs:

– Many simple processing elements

– Many connec%ons

– Simple messages

– Adap%ve interac%on

  • Connec%on “weights” altered as data flows through. (slide from Rachel Greenstadt)

Neural Networks

(image from Russell & Norvig)

Naïve Bayesian Classifier

  • Given a target, determine which class it belongs to, based on the prior probabili%es of features in the target. - Prior Probability: What is the probability of X given that we
know Y?
  • Example: Movie Review Classifier
    • Features: frequency of each word in posi%ve and nega%ve
reviews.
  • Mul%ply the probability of each word indica%ng a posi%ve
review and each word indica%ng a nega%ve review. Classify
the review as whichever has a greater probability.

Stylometry

• Stylometry is the study of a)ribu%ng

authorship to documents based on the

linguis%c styles they exhibit.

– Handwri%ng doesn’t count!

• In general the basic ques%on of Stylometry is

“who wrote this document?”

Why is Stylometry Important?

• Great, you can figure out who wrote a 200 year

old document, so what?

• From the Ins%tute for Linguis%c Evidence:

  • “In some criminal, civil, and security ma)ers, language

can be evidence. A suicide note, a threatening le)er,

anonymous communica%ons business emails, blog

posts, trademarks – all of these can help inves%gators,

a)orneys, human resource execu%ves and private

individuals understand the heart of an incident. When

you are faced with [a threat or incident] you need

reliable, validated methods [of stylometry].”

The Problem

• An underlying assump%on in most Stylometry

research is that people are honest in their

wri%ng style. What if they’re not?

Related Work

• Some work has been done in looking into how

well methods of stylometry can dis%nguish

between true authors and imitators.

  • Somers & Tweedie, 2003: Alice in Wonderland.
  • Mixed results.

• Overall this area has been widely looked over.

Patrick Juola, an expert in Computer Linguis%cs

formerly of Oxford states in his 2008 book:

  • “There is obviously great poten%al for further work

here.”

Study Format and Setup

• 15 Individual Authors. Par%cipa%on had three

parts:

  • Submit 5000 words of pre‐exis%ng wri%ng from a

formal source.

  • Formal means school essays, professional reports, etc. No slang, abbrevia%ons, casual conversa%on.
  • Write a new 500 word passage as an obfusca%on

a)ack.

  • Task: Describe your neighborhood.
  • Write a new 500 word passage as an imita%on a)ack.
  • Task: Imitate Cormac McCarthy using a passage from The Road, write a third person narra%ve about your day star%ng from when you wake up.

Some A)ack Examples…

• “Light sliced through the blinds, and construc%on

began in the adjacent apartment. The harsh

cacophony crashed through the wall.”

• “Hot water in the mug. Brush in the mug. The

blade read "Wilkinson Sword" on the layered wax

paper packaging.”

• “He fills the coffee pot with water, arer cleaning

out the putrid remains of yesterday's brew. The

beans are in the freezer, he remembers.”

Our Approach: Method 1

• Sta%s%cal Method using the Signature

Stylometric System.

– Assigns a Chi‐Square value for each comparison.

The higher the number, the less likely they are the

same author.

• Three features: word length, le)er usage,

punctua%on usage.

• Average Accuracy: 95%