Statistical Natural Language Parsing: The Rise of Data and Statistics, Study notes of Computer Science

A comprehensive overview of statistical natural language parsing, highlighting the shift from traditional symbolic grammar-based approaches to data-driven methods. It explores the role of annotated data, probabilistic context-free grammars (pcfgs), and the chomsky normal form in parsing. The document also delves into the cky parsing algorithm, a widely used technique for efficient parsing of pcfgs. It includes illustrative examples and explanations of key concepts, making it a valuable resource for students and researchers in computational linguistics and natural language processing.

Typology: Study notes

2021/2022

Uploaded on 01/21/2025

vu-le-12
vu-le-12 🇭🇰

2 documents

1 / 65

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistical
Natural Language
Parsing
Nguyen Phuong
Thai
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41

Partial preview of the text

Download Statistical Natural Language Parsing: The Rise of Data and Statistics and more Study notes Computer Science in PDF only on Docsity!

Statistical

Natural Language

Parsing

Nguyen Phuong

Thai

LLOs

 (^) LLO1: Trình bày được các kiến thức cơ bản về phân tích cú pháp bao gồm bài toán và các tiếp cận giải quyết, các mô hình văn phạm CFG và PCFG, phân tích cú pháp theo tiếp cận thống kê và treebank, đánh giá kết quả phân tích cú pháp  (^) LLO2: Vận dụng được văn phạm CFG/PCFG trong mô tả cú pháp của ngôn ngữ tự nhiên như đọc hiểu dữ liệu treebank, dựng cây cú pháp thủ công, tính xác suất của cây

  • (^) LLO 3 : Vận dụng được các thuật toán: nhị phân hóa văn phạm phi ngữ cảnh (chuyển văn phạm sang dạng chuẩn Chomsky), lập bảng phân tích sử dụng thuật toán CKY

Two views of linguistic structure:

1. Constituency (phrase structure)

Phrase structure organizes words into nested constituents.

  • (^) How do we know what is a constituent? (Not that linguists don’t argue about some cases.) - (^) Distribution: a constituent behaves as a unit that can appear in different places: - (^) John talked [to the children] [about drugs]. - (^) John talked [about drugs] [to the children]. - (^) *John talked drugs to the children about - (^) Substitution/expansion/pro-forms: - (^) I sat [on the box/right on top of the box/there]. - (^) Coordination, regular internal structure, no intrusion, fragments, semantics, …

Two views of linguistic structure:

2. Dependency structure

Dependency structure shows which words depend on (modify or are arguments of) which other words. The boy put the tortoise on the rug rug the the tortoise^ on put boy The

Statistical

Natural Language

Parsing

Parsing: The rise of

data and statistics

Classical NLP Parsing:

The problem and its solution

Categorical constraints can be added to grammars to limit unlikely/weird parses for sentences

  • (^) But the attempt make the grammars not robust
    • (^) In traditional systems, commonly 30% of sentences in even an edited text would have no parse.
  • (^) A less constrained grammar can parse more sentences
  • (^) But simple sentences end up with ever more parses with no way to choose between them

We need mechanisms that allow us to find the most likely parse(s) for a sentence

  • (^) Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but still quickly find the best parse(s)

The rise of annotated data:

The Penn Treebank

( (S (NP-SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (, ,) (S-ADV (NP-SBJ (-NONE- *)) (VP (VBG reflecting) (NP (NP (DT a) (VBG continuing) (NN decline)) (PP-LOC (IN in) (NP (DT that) (NN market))))))) (. .))) [Marcus et al. 1993, Computational Linguistics ]

CFGs and PCFGs

(Probabilistic)

Context-Free

Grammars

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP  e PP P NP people fish tanks people fish with rods N people N fish N tanks N rods V people V fish V tanks P with

Phrase structure grammars in NLP

G = (T, C, N, S, L, R)

  • (^) T is a set of terminal symbols
  • (^) C is a set of preterminal symbols
  • (^) N is a set of nonterminal symbols
  • (^) S is the start symbol (S ∈N)
  • (^) L is the lexicon, a set of items of the form X x
    • (^) X ∈ P and x ∈T
  • (^) R is the grammar, a set of items of the form X 
    • (^) X ∈ N and∈ (N ∪C)*
  • (^) By usual convention, S is the start symbol, but in statistical NLP, we usually have an extra node at the top (ROOT, TOP)

We usually write e for an empty sequence, rather than nothing

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP  e PP P NP people fish tanks people fish with rods N  people N  fish N  tanks N  rods V  people V  fish V  tanks P  with

A PCFG

[With empty NP removed so less

  • S NP VP 1.
  • VP V NP 0.
  • VP V NP PP 0.
  • NP NP NP 0.
  • NP NP PP 0.
  • NP N 0.
  • PP P NP 1.
  • N  people 0.
  • N  fish 0.
  • N  tanks 0.
  • N  rods 0.
  • V  people 0.
  • V  fish 0.
  • V  tanks 0.
  • P  with 1.

The probability of trees and strings

P( t ) – The probability of a tree t is the product of the probabilities of the rules used to generate it.

  • (^) P( s ) – The probability of the string s is the sum of the probabilities of the trees which have that string as their yield P( s ) = Σ j P( s , t ) where t is a parse of s = Σ j P( t )