Text Simplification: New Data Set & Preliminary Results in Simple English | Essays (high school) Object Oriented Programming

Simple English Wikipedia: A New Text Simplification Task

William Coster

Computer Science Department

Pomona College

Claremont, CA 91711

[email protected]

David Kauchak

Computer Science Department

Pomona College

Claremont, CA 91711

[email protected]

Abstract

In this paper we examine the task of sentence

simplification which aims to reduce the read-

ing complexity of a sentence by incorporat-

ing more accessible vocabulary and sentence

structure. We introduce a new data set that

pairs English Wikipedia with Simple English

Wikipedia and is orders of magnitude larger

than any previously examined for sentence

simplification. The data contains the full range

of simplification operations including reword-

ing, reordering, insertion and deletion. We

provide an analysis of this corpus as well as

preliminary results using a phrase-based trans-

lation approach for simplification.

1 Introduction

The task of text simplification aims to reduce the

complexity of text while maintaining the content

(Chandrasekar and Srinivas, 1997; Carroll et al.,

1998; Feng, 2008). In this paper, we explore the

sentence simplification problem: given a sentence,

the goal is to produce an equivalent sentence where

the vocabulary and sentence structure are simpler.

Text simplification has a number of important ap-

plications. Simplification techniques can be used to

make text resources available to a broader range of

readers, including children, language learners, the

elderly, the hearing impaired and people with apha-

sia or cognitive disabilities (Carroll et al., 1998;

Feng, 2008). As a preprocessing step, simplification

can improve the performance of NLP tasks, includ-

ing parsing, semantic role labeling, machine transla-

tion and summarization (Miwa et al., 2010; Jonnala-

gadda et al., 2009; Vickrey and Koller, 2008; Chan-

drasekar and Srinivas, 1997). Finally, models for

text simplification are similar to models for sentence

compression; advances in simplification can bene-

fit compression, which has applications in mobile

devices, summarization and captioning (Knight and

Marcu, 2002; McDonald, 2006; Galley and McKe-

own, 2007; Nomoto, 2009; Cohn and Lapata, 2009).

One of the key challenges for text simplification

is data availability. The small amount of simplifi-

cation data currently available has prevented the ap-

plication of data-driven techniques like those used

in other text-to-text translation areas (Och and Ney,

2004; Chiang, 2010). Most prior techniques for

text simplification have involved either hand-crafted

rules (Vickrey and Koller, 2008; Feng, 2008) or

learned within a very restricted rule space (Chan-

drasekar and Srinivas, 1997).

We have generated a data set consisting of 137K

aligned simplified/unsimplified sentence pairs by

pairing documents, then sentences from English

Wikipedia1with corresponding documents and sen-

tences from Simple English Wikipedia2. Simple En-

glish Wikipedia contains articles aimed at children

and English language learners and contains similar

content to English Wikipedia but with simpler vo-

cabulary and grammar.

Figure 1 shows example sentence simplifications

from the data set. Like machine translation and other

text-to-text domains, text simplification involves the

full range of transformation operations including

deletion, rewording, reordering and insertion.

1http://en.wikipedia.org/

2http://simple.wikipedia.org

Text Simplification: New Data Set & Preliminary Results in Simple English, Essays (high school) of Object Oriented Programming

Related documents

Partial preview of the text

Download Text Simplification: New Data Set & Preliminary Results in Simple English and more Essays (high school) Object Oriented Programming in PDF only on Docsity!

Simple English Wikipedia: A New Text Simplification Task

William Coster

Computer Science Department

Pomona College

Claremont, CA 91711

[email protected]

David Kauchak

Computer Science Department

Pomona College

Claremont, CA 91711

[email protected]

Abstract

1 Introduction

2 Previous Data

3 Simplification Corpus Generation

5 Sentence-level Text Simplification

6 Conclusion

References