Text Simplification: New Data Set & Preliminary Results in Simple English, Essays (high school) of Object Oriented Programming

A new data set for text simplification, which involves pairing English Wikipedia with Simple English Wikipedia to examine sentence simplification operations such as rewording, reordering, insertion, and deletion. The authors provide an analysis of this corpus and preliminary results using a phrase-based translation approach.

Typology: Essays (high school)

2021/2022

Uploaded on 07/04/2022

Bjarne_90
Bjarne_90 🇳🇴

4.9

(8)

337 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Simple English Wikipedia: A New Text Simplification Task
William Coster
Computer Science Department
Pomona College
Claremont, CA 91711
David Kauchak
Computer Science Department
Pomona College
Claremont, CA 91711
Abstract
In this paper we examine the task of sentence
simplification which aims to reduce the read-
ing complexity of a sentence by incorporat-
ing more accessible vocabulary and sentence
structure. We introduce a new data set that
pairs English Wikipedia with Simple English
Wikipedia and is orders of magnitude larger
than any previously examined for sentence
simplification. The data contains the full range
of simplification operations including reword-
ing, reordering, insertion and deletion. We
provide an analysis of this corpus as well as
preliminary results using a phrase-based trans-
lation approach for simplification.
1 Introduction
The task of text simplification aims to reduce the
complexity of text while maintaining the content
(Chandrasekar and Srinivas, 1997; Carroll et al.,
1998; Feng, 2008). In this paper, we explore the
sentence simplification problem: given a sentence,
the goal is to produce an equivalent sentence where
the vocabulary and sentence structure are simpler.
Text simplification has a number of important ap-
plications. Simplification techniques can be used to
make text resources available to a broader range of
readers, including children, language learners, the
elderly, the hearing impaired and people with apha-
sia or cognitive disabilities (Carroll et al., 1998;
Feng, 2008). As a preprocessing step, simplification
can improve the performance of NLP tasks, includ-
ing parsing, semantic role labeling, machine transla-
tion and summarization (Miwa et al., 2010; Jonnala-
gadda et al., 2009; Vickrey and Koller, 2008; Chan-
drasekar and Srinivas, 1997). Finally, models for
text simplification are similar to models for sentence
compression; advances in simplification can bene-
fit compression, which has applications in mobile
devices, summarization and captioning (Knight and
Marcu, 2002; McDonald, 2006; Galley and McKe-
own, 2007; Nomoto, 2009; Cohn and Lapata, 2009).
One of the key challenges for text simplification
is data availability. The small amount of simplifi-
cation data currently available has prevented the ap-
plication of data-driven techniques like those used
in other text-to-text translation areas (Och and Ney,
2004; Chiang, 2010). Most prior techniques for
text simplification have involved either hand-crafted
rules (Vickrey and Koller, 2008; Feng, 2008) or
learned within a very restricted rule space (Chan-
drasekar and Srinivas, 1997).
We have generated a data set consisting of 137K
aligned simplified/unsimplified sentence pairs by
pairing documents, then sentences from English
Wikipedia1with corresponding documents and sen-
tences from Simple English Wikipedia2. Simple En-
glish Wikipedia contains articles aimed at children
and English language learners and contains similar
content to English Wikipedia but with simpler vo-
cabulary and grammar.
Figure 1 shows example sentence simplifications
from the data set. Like machine translation and other
text-to-text domains, text simplification involves the
full range of transformation operations including
deletion, rewording, reordering and insertion.
1http://en.wikipedia.org/
2http://simple.wikipedia.org
pf3
pf4
pf5

Partial preview of the text

Download Text Simplification: New Data Set & Preliminary Results in Simple English and more Essays (high school) Object Oriented Programming in PDF only on Docsity!

Simple English Wikipedia: A New Text Simplification Task

William Coster

Computer Science Department

Pomona College

Claremont, CA 91711

[email protected]

David Kauchak

Computer Science Department

Pomona College

Claremont, CA 91711

[email protected]

Abstract

In this paper we examine the task of sentence simplification which aims to reduce the read- ing complexity of a sentence by incorporat- ing more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including reword- ing, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.

1 Introduction

The task of text simplification aims to reduce the complexity of text while maintaining the content (Chandrasekar and Srinivas, 1997; Carroll et al., 1998; Feng, 2008). In this paper, we explore the sentence simplification problem: given a sentence, the goal is to produce an equivalent sentence where the vocabulary and sentence structure are simpler. Text simplification has a number of important ap- plications. Simplification techniques can be used to make text resources available to a broader range of readers, including children, language learners, the elderly, the hearing impaired and people with apha- sia or cognitive disabilities (Carroll et al., 1998; Feng, 2008). As a preprocessing step, simplification can improve the performance of NLP tasks, includ- ing parsing, semantic role labeling, machine transla- tion and summarization (Miwa et al., 2010; Jonnala-

gadda et al., 2009; Vickrey and Koller, 2008; Chan- drasekar and Srinivas, 1997). Finally, models for text simplification are similar to models for sentence compression; advances in simplification can bene- fit compression, which has applications in mobile devices, summarization and captioning (Knight and Marcu, 2002; McDonald, 2006; Galley and McKe- own, 2007; Nomoto, 2009; Cohn and Lapata, 2009). One of the key challenges for text simplification is data availability. The small amount of simplifi- cation data currently available has prevented the ap- plication of data-driven techniques like those used in other text-to-text translation areas (Och and Ney, 2004; Chiang, 2010). Most prior techniques for text simplification have involved either hand-crafted rules (Vickrey and Koller, 2008; Feng, 2008) or learned within a very restricted rule space (Chan- drasekar and Srinivas, 1997). We have generated a data set consisting of 137K aligned simplified/unsimplified sentence pairs by pairing documents, then sentences from English Wikipedia^1 with corresponding documents and sen- tences from Simple English Wikipedia^2. Simple En- glish Wikipedia contains articles aimed at children and English language learners and contains similar content to English Wikipedia but with simpler vo- cabulary and grammar. Figure 1 shows example sentence simplifications from the data set. Like machine translation and other text-to-text domains, text simplification involves the full range of transformation operations including deletion, rewording, reordering and insertion. (^1) http://en.wikipedia.org/ (^2) http://simple.wikipedia.org

a. Normal: As Isolde arrives at his side, Tristan dies with her name on his lips. Simple: As Isolde arrives at his side, Tristan dies while speaking her name. b. Normal: Alfonso Perez Munoz, usually referred to as Alfonso, is a former Spanish footballer, in the striker position. Simple: Alfonso Perez is a former Spanish football player. c. Normal: Endemic types or species are especially likely to develop on islands because of their geographical isolation. Simple: Endemic types are most likely to develop on islands because they are isolated. d. Normal: The reverse process, producing electrical energy from mechanical, energy, is accomplished by a generator or dynamo. Simple: A dynamo or an electric generator does the reverse: it changes mechanical movement into electric energy.

Figure 1: Example sentence simplifications extracted from Wikipedia. Normal refers to a sentence in an English Wikipedia article and Simple to a corresponding sentence in Simple English Wikipedia.

2 Previous Data

Wikipedia and Simple English Wikipedia have both received some recent attention as a useful resource for text simplification and the related task of text compression. Yamangil and Nelken (2008) examine the history logs of English Wikipedia to learn sen- tence compression rules. Yatskar et al. (2010) learn a set of candidate phrase simplification rules based on edits identified in the revision histories of both Simple English Wikipedia and English Wikipedia. However, they only provide a list of the top phrasal simplifications and do not utilize them in an end- to-end simplification system. Finally, Napoles and Dredze (2010) provide an analysis of the differences between documents in English Wikipedia and Sim- ple English Wikipedia, though they do not view the data set as a parallel corpus.

Although the simplification problem shares some characteristics with the text compression problem, existing text compression data sets are small and contain a restricted set of possible transformations (often only deletion). Knight and Marcu (2002) in- troduced the Zipf-Davis corpus which contains 1K sentence pairs. Cohn and Lapata (2009) manually generated two parallel corpora from news stories to- taling 3K sentence pairs. Finally, Nomoto (2009) generated a data set based on RSS feeds containing 2K sentence pairs.

3 Simplification Corpus Generation

We generated a parallel simplification corpus by aligning sentences between English Wikipedia and Simple English Wikipedia. We obtained complete copies of English Wikipedia and Simple English Wikipedia in May 2010. We first paired the articles by title, then removed all article pairs where either article: contained only a single line, was flagged as a stub, was flagged as a disambiguation page or was a meta-page about Wikipedia. After pairing and filter- ing, 10,588 aligned, content article pairs remained (a 90% reduction from the original 110K Simple En- glish Wikipedia articles). Throughout the rest of this paper we will refer to unsimplified text from English Wikipedia as normal and to the simplified version from Simple English Wikipedia as simple. To generate aligned sentence pairs from the aligned document pairs we followed an approach similar to those utilized in previous monolingual alignment problems (Barzilay and Elhadad, 2003; Nelken and Shieber, 2006). Paragraphs were iden- tified based on formatting information available in the articles. Each simple paragraph was then aligned to every normal paragraph where the TF-IDF, co- sine similarity was over a threshold or 0.5. We ini- tially investigated the paragraph clustering prepro- cessing step in (Barzilay and Elhadad, 2003), but did not find a qualitative difference and opted for the simpler similarity-based alignment approach, which does not require manual annotation.

wordings – a normal word is changed to a different simple word, deletions – a normal word is deleted, reorderings – non-monotonic alignment, splits – a normal words is split into multiple simple words, and merges – multiple normal words are condensed to a single simple word.

Transformation % rewordings 65% deletions 47% reorders 34% merges 31% splits 27%

Table 2: Percentage of sentence pairs that contained word-level operations based on the induced word align- ment. Splits and merges are from the perspective of words in the normal sentence. These are not mutually exclusive events.

Table 2 shows the percentage of each of these phe- nomena occurring in the sentence pairs. All of the different operations occur frequently in the data set with rewordings being particularly prevalent.

5 Sentence-level Text Simplification

To understand the usefulness of this data we ran preliminary experiments to learn a sentence-level simplification system. We view the problem of text simplification as an English-to-English transla- tion problem. Motivated by the importance of lex- ical changes, we used Moses, a phrase-based ma- chine translation system (Och and Ney, 2004).^3 We trained Moses on 124K pairs from the data set and the n-gram language model on the simple side of this data. We trained the hyper-parameters of the log- linear model on a 500 sentence pair development set. We compared the trained system to a baseline of not doing any simplification (NONE). We evaluated the two approaches on a test set of 1300 sentence pairs. Since there is currently no standard for au- tomatically evaluating sentence simplification, we used three different automatic measures that have been used in related domains: BLEU, which has been used extensively in machine translation (Pap- ineni et al., 2002), and word-level F1 and simple string accuracy (SSA) which have been suggested (^3) We also experimented with T3 (Cohn and Lapata, 2009) but the results were poor and are not presented here.

System BLEU word-F1 SSA NONE 0.5937 0.5967 0. Moses 0.5987 0.6076 0. Moses-Oracle 0.6317 0.6661 0.

Table 3: Test scores for the baseline (NONE), Moses and Moses-Oracle.

for text compression (Clarke and Lapata, 2006). All three of these measures have been shown to correlate with human judgements in their respective domains. Table 3 shows the results of our initial test. All differences are statistically significant at p = 0. 01 , measured using bootstrap resampling with 100 sam- ples (Koehn, 2004). Although the baseline does well (recall that over a quarter of the sentence pairs in the data set are identical) the phrase-based approach does obtain a statistically significant improvement. To understand the the limits of the phrase-based model for text simplification, we generated an n- best list of the 1000 most-likely simplifications for each test sentence. We then greedily picked the sim- plification from this n-best list that had the highest sentence-level BLEU score based on the test exam- ples, labeled Moses-Oracle in Table 3. The large difference between Moses and Moses-Oracle indi- cates possible room for improvement utilizing better parameter estimation or n-best list reranking tech- niques (Och et al., 2004; Ge and Mooney, 2006).

6 Conclusion

We have described a new text simplification data set generated from aligning sentences in Simple English Wikipedia with sentences in English Wikipedia. The data set is orders of magnitude larger than any cur- rently available for text simplification or for the re- lated field of text compression and is publicly avail- able.^4 We provided preliminary text simplification results using Moses, a phrase-based translation sys- tem, and saw a statistically significant improvement of 0.005 BLEU over the baseline of no simplifica- tion and showed that further improvement of up to 0.034 BLEU may be possible based on the oracle results. In the future, we hope to explore alignment techniques more tailored to simplification as well as applications of this data to text simplification. (^4) http://www.cs.middlebury.edu/∼dkauchak/simplification/

References

Regina Barzilay and Noemie Elhadad. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of EMNLP. John Carroll, Gido Minnen, Yvonne Canning, Siobhan Devlin, and John Tait. 1998. Practical simplification of English newspaper text to assist aphasic readers. In Proceedings of AAAI Workshop on Integrating AI and Assistive Technology. Raman Chandrasekar and Bangalore Srinivas. 1997. Au- tomatic induction of rules for text simplification. In Knowledge Based Systems. David Chiang. 2010. Learning to translate with source and target syntax. In Proceedings of ACL. James Clarke and Mirella Lapata. 2006. Models for sentence compression: A comparison across domains, training requirements and evaluation measures. In Proceedings of ACL. Trevor Cohn and Mirella Lapata. 2009. Sentence com- pression as tree transduction. Journal of Artificial In- telligence Research. Lijun Feng. 2008. Text simplification: A survey. CUNY Technical Report. Michel Galley and Kathleen McKeown. 2007. Lexical- ized Markov grammars for sentence compression. In Proceedings of HLT/NAACL. Ruifang Ge and Raymond Mooney. 2006. Discrimina- tive reranking for semantic parsing. In Proceedings of COLING. Siddhartha Jonnalagadda, Luis Tari, Jorg Hakenberg, Chitta Baral, and Graciela Gonzalez. 2009. To- wards effective sentence simplification for automatic processing of biomedical text. In Proceedings of HLT/NAACL. Dan Klein and Christopher Manning. 2003. Accurate unlexicalized parsing. In Proceedings of ACL. Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceed- ings of ACL. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP. Ryan McDonald. 2006. Discriminative sentence com- pression with soft syntactic evidence. In Proceedings of EACL.

Makoto Miwa, Rune Saetre, Yusuke Miyao, and Jun’ichi Tsujii. 2010. Entity-focused sentence simplication for relation extraction. In Proceedings of COLING. Courtney Napoles and Mark Dredze. 2010. Learn- ing simple Wikipedia: A cogitation in ascertaining abecedarian language. In Proceedings of HLT/NAACL Workshop on Computation Linguistics and Writing. Rani Nelken and Stuart Shieber. 2006. Towards robust context-sensitive sentence alignment for monolingual corpora. In Proceedings of AMTA. Tadashi Nomoto. 2007. Discriminative sentence com- pression with conditional random fields. In Informa- tion Processing and Management. Tadashi Nomoto. 2008. A generic sentence trimmer with CRFs. In Proceedings of HLT/NAACL. Tadashi Nomoto. 2009. A comparison of model free ver- sus model intensive approaches to sentence compres- sion. In Proceedings of EMNLP. Franz Josef Och and Hermann Ney. 2003. A system- atic comparison of various statistical alignment mod- els. Computational Linguistics, 29(1):19–51. Franz Och and Hermann Ney. 2004. The alignment tem- plate approach to statistical machine translation. Com- putational Linguistics. Franz Josef Och, Kenji Yamada, Stanford U, Alex Fraser, Daniel Gildea, and Viren Jain. 2004. A smorgasbord of features for statistical machine translation. In Pro- ceedings of HLT/NAACL. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic eval- uation of machine translation. In Proceedings of ACL. Emily Pitler. 2010. Methods for sentence compression. Technical Report MS-CIS-10-20, University of Penn- sylvania. Jenine Turner and Eugene Charniak. 2005. Supervised and unsupervised learning for sentence compression. In Proceedings of ACL. David Vickrey and Daphne Koller. 2008. Sentence sim- plification for semantic role labeling. In Proceedings of ACL. Elif Yamangil and Rani Nelken. 2008. Mining Wikipedia revision histories for improving sentence compression. In ACL. Mark Yatskar, Bo Pang, Critian Danescu-Niculescu- Mizil, and Lillian Lee. 2010. For the sake of simplic- ity: Unsupervised extraction of lexical simplifications from Wikipedia. In HLT/NAACL Short Papers.