Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas


Natural Language Processing With Python, Notas de estudo de Cultura

Is a book about Phython programming, written by Steven Bird, Ewan Klein and Edward Loper

Tipologia: Notas de estudo

2015

Compartilhado em 23/02/2015

marcos-lima-64
marcos-lima-64 🇧🇷

5

(2)

6 documentos

1 / 504

Toggle sidebar

Esta página não é visível na pré-visualização

Não perca as partes importantes!

bg1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Pré-visualização parcial do texto

Baixe Natural Language Processing With Python e outras Notas de estudo em PDF para Cultura, somente na Docsity!

Natural Language Processing

with Python

Steven Bird, Ewan Klein, and Edward Loper

Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo

Table of Contents

    1. Language Processing and Python Preface ix
    • 1.1 Computing with Language: Texts and Words
    • 1.2 A Closer Look at Python: Texts as Lists of Words
    • 1.3 Computing with Language: Simple Statistics
    • 1.4 Back to Python: Making Decisions and Taking Control
    • 1.5 Automatic Natural Language Understanding
    • 1.6 Summary
    • 1.7 Further Reading
    • 1.8 Exercises
    1. Accessing Text Corpora and Lexical Resources
    • 2.1 Accessing Text Corpora
    • 2.2 Conditional Frequency Distributions
    • 2.3 More Python: Reusing Code
    • 2.4 Lexical Resources
    • 2.5 WordNet
    • 2.6 Summary
    • 2.7 Further Reading
    • 2.8 Exercises
    1. Processing Raw Text
    • 3.1 Accessing Text from the Web and from Disk
    • 3.2 Strings: Text Processing at the Lowest Level
    • 3.3 Text Processing with Unicode
    • 3.4 Regular Expressions for Detecting Word Patterns
    • 3.5 Useful Applications of Regular Expressions
    • 3.6 Normalizing Text
    • 3.7 Regular Expressions for Tokenizing Text
    • 3.8 Segmentation
    • 3.9 Formatting: From Lists to Strings
    • 3.10 Summary
    • 3.11 Further Reading
    • 3.12 Exercises
    1. Writing Structured Programs
      • 4.1 Back to the Basics
      • 4.2 Sequences
      • 4.3 Questions of Style
      • 4.4 Functions: The Foundation of Structured Programming
      • 4.5 Doing More with Functions
      • 4.6 Program Development
      • 4.7 Algorithm Design
      • 4.8 A Sample of Python Libraries
      • 4.9 Summary
    • 4.10 Further Reading
    • 4.11 Exercises
    1. Categorizing and Tagging Words
      • 5.1 Using a Tagger
      • 5.2 Tagged Corpora
      • 5.3 Mapping Words to Properties Using Python Dictionaries
      • 5.4 Automatic Tagging
      • 5.5 N-Gram Tagging
      • 5.6 Transformation-Based Tagging
      • 5.7 How to Determine the Category of a Word
      • 5.8 Summary
      • 5.9 Further Reading
    • 5.10 Exercises
    1. Learning to Classify Text
      • 6.1 Supervised Classification
      • 6.2 Further Examples of Supervised Classification
      • 6.3 Evaluation
      • 6.4 Decision Trees
      • 6.5 Naive Bayes Classifiers
      • 6.6 Maximum Entropy Classifiers
      • 6.7 Modeling Linguistic Patterns
      • 6.8 Summary
      • 6.9 Further Reading
    • 6.10 Exercises
    1. Extracting Information from Text
      • 7.1 Information Extraction
        • 7.2 Chunking
        • 7.3 Developing and Evaluating Chunkers
        • 7.4 Recursion in Linguistic Structure
        • 7.5 Named Entity Recognition
        • 7.6 Relation Extraction
        • 7.7 Summary
        • 7.8 Further Reading
        • 7.9 Exercises
      1. Analyzing Sentence Structure
        • 8.1 Some Grammatical Dilemmas
        • 8.2 What’s the Use of Syntax?
        • 8.3 Context-Free Grammar
        • 8.4 Parsing with Context-Free Grammar
        • 8.5 Dependencies and Dependency Grammar
        • 8.6 Grammar Development
        • 8.7 Summary
        • 8.8 Further Reading
        • 8.9 Exercises
      1. Building Feature-Based Grammars
        • 9.1 Grammatical Features
        • 9.2 Processing Feature Structures
        • 9.3 Extending a Feature-Based Grammar
        • 9.4 Summary
        • 9.5 Further Reading
        • 9.6 Exercises
    1. Analyzing the Meaning of Sentences
      • 10.1 Natural Language Understanding
      • 10.2 Propositional Logic
      • 10.3 First-Order Logic
      • 10.4 The Semantics of English Sentences
      • 10.5 Discourse Semantics
      • 10.6 Summary
      • 10.7 Further Reading
      • 10.8 Exercises
    1. Managing Linguistic Data
      • 11.1 Corpus Structure: A Case Study
      • 11.2 The Life Cycle of a Corpus
      • 11.3 Acquiring Data
      • 11.4 Working with XML
    • 11.5 Working with Toolbox Data
    • 11.6 Describing Language Resources Using OLAC Metadata
    • 11.7 Summary
    • 11.8 Further Reading
    • 11.9 Exercises
  • Afterword: The Language Challenge
  • Bibliography
  • NLTK Index
  • General Index

Preface

This is a book about Natural Language Processing. By “natural language” we mean a language that is used for everyday communication by humans; languages such as Eng- lish, Hindi, or Portuguese. In contrast to artificial languages such as programming lan- guages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing—or NLP for short—in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves “understanding” complete human utterances, at least to the extent of being able to give useful responses to them.

Technologies based on NLP are becoming increasingly widespread. For example, phones and handheld computers support predictive text and handwriting recognition; web search engines give access to information locked up in unstructured text; machine translation allows us to retrieve texts written in Chinese and read them in Spanish. By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multi- lingual information society.

This book provides a highly accessible introduction to the field of NLP. It can be used for individual study or as the textbook for a course on natural language processing or computational linguistics, or as a supplement to courses in artificial intelligence, text mining, or corpus linguistics. The book is intensely practical, containing hundreds of fully worked examples and graded exercises.

The book is based on the Python programming language together with an open source library called the Natural Language Toolkit (NLTK). NLTK includes extensive soft- ware, data, and documentation, all freely downloadable from http://www.nltk.org/. Distributions are provided for Windows, Macintosh, and Unix platforms. We strongly encourage you to download Python and NLTK, and try out the examples and exercises along the way.

ix

Note that this book is not a reference work. Its coverage of Python and NLP is selective, and presented in a tutorial style. For reference material, please consult the substantial quantity of searchable resources available at http://python.org/ and http://www.nltk .org/.

This book is not an advanced computer science text. The content ranges from intro- ductory to intermediate, and is directed at readers who want to learn how to analyze text using Python and the Natural Language Toolkit. To learn about advanced algo- rithms implemented in NLTK, you can examine the Python code linked from http:// www.nltk.org/ , and consult the other materials cited in this book.

What You Will Learn

By digging into the material presented here, you will learn:

  • How simple programs can help you manipulate and analyze language data, and how to write these programs
  • How key concepts from NLP and linguistics are used to describe and analyze language
  • How data structures and algorithms are used in NLP
  • How language data is stored in standard formats, and how data can be used to evaluate the performance of NLP techniques

Depending on your background, and your motivation for being interested in NLP, you will gain different kinds of skills and knowledge from this book, as set out in Table P-1.

Table P-1. Skills and knowledge to be gained from reading this book, depending on readers’ goals and background

Goals Background in arts and humanities Background in science and engineering Language analysis

Manipulating large corpora, exploring linguistic models, and testing empirical claims.

Using techniques in data modeling, data mining, and knowledge discovery to analyze natural language. Language technology

Building robust systems to perform linguistic tasks with technological applications.

Using linguistic algorithms and data structures in robust language processing software.

Organization

The early chapters are organized in order of conceptual difficulty, starting with a prac- tical introduction to language processing that shows how to explore interesting bodies of text using tiny Python programs (Chapters 1–3). This is followed by a chapter on structured programming (Chapter 4) that consolidates the programming topics scat- tered across the preceding chapters. After this, the pace picks up, and we move on to a series of chapters covering fundamental topics in language processing: tagging, clas- sification, and information extraction (Chapters 5–7). The next three chapters look at

Preface | xi

ways to parse a sentence, recognize its syntactic structure, and construct representa- tions of meaning (Chapters 8–10). The final chapter is devoted to linguistic data and how it can be managed effectively (Chapter 11). The book concludes with an After- word, briefly discussing the past and future of the field.

Within each chapter, we switch between different styles of presentation. In one style, natural language is the driver. We analyze language, explore linguistic concepts, and use programming examples to support the discussion. We often employ Python con- structs that have not been introduced systematically, so you can see their purpose before delving into the details of how and why they work. This is just like learning idiomatic expressions in a foreign language: you’re able to buy a nice pastry without first having learned the intricacies of question formation. In the other style of presentation, the programming language will be the driver. We’ll analyze programs, explore algorithms, and the linguistic examples will play a supporting role.

Each chapter ends with a series of graded exercises, which are useful for consolidating the material. The exercises are graded according to the following scheme: ○ is for easy exercises that involve minor modifications to supplied code samples or other simple activities; ◑ is for intermediate exercises that explore an aspect of the material in more depth, requiring careful analysis and design; ● is for difficult, open-ended tasks that will challenge your understanding of the material and force you to think independently (readers new to programming should skip these).

Each chapter has a further reading section and an online “extras” section at http://www .nltk.org/ , with pointers to more advanced materials and online resources. Online ver- sions of all the code examples are also available there.

Why Python?

Python is a simple yet powerful programming language with excellent functionality for processing linguistic data. Python can be downloaded for free from http://www.python .org/. Installers are available for all platforms.

Here is a five-line Python program that processes file.txt and prints all the words ending in ing:

for line in open("file.txt"): ... for word in line.split(): ... if word.endswith('ing'): ... print word

This program illustrates some of the main features of Python. First, whitespace is used to nest lines of code; thus the line starting with if falls inside the scope of the previous line starting with for; this ensures that the ing test is performed for each word. Second, Python is object-oriented ; each variable is an entity that has certain defined attributes and methods. For example, the value of the variable line is more than a sequence of characters. It is a string object that has a “method” (or operation) called split() that

xii | Preface

NLTK-Data This contains the linguistic corpora that are analyzed and processed in the book.

NumPy (recommended) This is a scientific computing library with support for multidimensional arrays and linear algebra, required for certain probability, tagging, clustering, and classifica- tion tasks.

Matplotlib (recommended) This is a 2D plotting library for data visualization, and is used in some of the book’s code samples that produce line graphs and bar charts.

NetworkX (optional) This is a library for storing and manipulating network structures consisting of nodes and edges. For visualizing semantic networks, also install the Graphviz library.

Prover9 (optional) This is an automated theorem prover for first-order and equational logic, used to support inference in language processing.

Natural Language Toolkit (NLTK)

NLTK was originally created in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylva- nia. Since then it has been developed and expanded with the help of dozens of con- tributors. It has now been adopted in courses in dozens of universities, and serves as the basis of many research projects. Table P-2 lists the most important NLTK modules.

Table P-2. Language processing tasks and corresponding NLTK modules with examples of functionality

Language processing task NLTK modules Functionality Accessing corpora nltk.corpus Standardized interfaces to corpora and lexicons String processing nltk.tokenize, nltk.stem Tokenizers, sentence tokenizers, stemmers Collocation discovery nltk.collocations t-test, chi-squared, point-wise mutual information Part-of-speech tagging nltk.tag n-gram, backoff, Brill, HMM, TnT Classification nltk.classify, nltk.cluster Decision tree, maximum entropy, naive Bayes, EM, k-means Chunking nltk.chunk Regular expression, n-gram, named entity Parsing nltk.parse Chart, feature-based, unification, probabilistic, dependency Semantic interpretation nltk.sem, nltk.inference Lambda calculus, first-order logic, model checking Evaluation metrics nltk.metrics Precision, recall, agreement coefficients Probability and estimation nltk.probability Frequency distributions, smoothed probability distributions Applications nltk.app, nltk.chat Graphical concordancer, parsers, WordNet browser, chatbots

xiv | Preface

Language processing task NLTK modules Functionality Linguistic fieldwork nltk.toolbox Manipulate data in SIL Toolbox format

NLTK was designed with four primary goals in mind:

Simplicity To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data

Consistency To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names

Extensibility To provide a structure into which new software modules can be easily accommo- dated, including alternative implementations and competing approaches to the same task

Modularity To provide components that can be used independently without needing to un- derstand the rest of the toolkit

Contrasting with these goals are three non-requirements—potentially useful qualities that we have deliberately avoided. First, while the toolkit provides a wide range of functions, it is not encyclopedic; it is a toolkit, not a system, and it will continue to evolve with the field of NLP. Second, while the toolkit is efficient enough to support meaningful tasks, it is not highly optimized for runtime performance; such optimiza- tions often involve more complex algorithms, or implementations in lower-level pro- gramming languages such as C or C++. This would make the software less readable and more difficult to install. Third, we have tried to avoid clever programming tricks, since we believe that clear implementations are preferable to ingenious yet indecipher- able ones.

For Instructors

Natural Language Processing is often taught within the confines of a single-semester course at the advanced undergraduate level or postgraduate level. Many instructors have found that it is difficult to cover both the theoretical and practical sides of the subject in such a short span of time. Some courses focus on theory to the exclusion of practical exercises, and deprive students of the challenge and excitement of writing programs to automatically process language. Other courses are simply designed to teach programming for linguists, and do not manage to cover any significant NLP con- tent. NLTK was originally developed to address this problem, making it feasible to cover a substantial amount of theory and practice within a single-semester course, even if students have no prior programming experience.

Preface | xv

Chapter Arts and Humanities Science and Engineering Chapter 5, Categorizing and Tagging Words 2–4 2– Chapter 6, Learning to Classify Text 0–2 2– Chapter 7, Extracting Information from Text 2 2– Chapter 8, Analyzing Sentence Structure 2–4 2– Chapter 9, Building Feature-Based Grammars 2–4 1– Chapter 10, Analyzing the Meaning of Sentences 1–2 1– Chapter 11, Managing Linguistic Data 1–2 1– Total 18–36 18–

Conventions Used in This Book

The following typographical conventions are used in this book:

Bold Indicates new terms.

Italic Used within paragraphs to refer to linguistic examples, the names of texts, and URLs; also used for filenames and file extensions.

Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, statements, and keywords; also used for pro- gram names.

Constant width italic Shows text that should be replaced with user-supplied values or by values deter- mined by context; also used for metavariables within program code examples.

This icon signifies a tip, suggestion, or general note.

This icon indicates a warning or caution.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example,

Preface | xvii

writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “ Natural Language Processing with Py- thon , by Steven Bird, Ewan Klein, and Edward Loper. Copyright 2009 Steven Bird, Ewan Klein, and Edward Loper, 978-0-596-51649-9.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected].

Safari® Books Online

When you see a Safari® Books Online icon on the cover of your favorite technology book, that means the book is available online through the O’Reilly Network Safari Bookshelf.

Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://my.safaribooksonline.com.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:

http://www.oreilly.com/catalog/

xviii | Preface