Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Statistical Natural Language Parsing: The Rise of Data and Statistics, Study notes of Computer Science

Hong Kong Baptist University Computer Science

A comprehensive overview of statistical natural language parsing, highlighting the shift from traditional symbolic grammar-based approaches to data-driven methods. It explores the role of annotated data, probabilistic context-free grammars (pcfgs), and the chomsky normal form in parsing. The document also delves into the cky parsing algorithm, a widely used technique for efficient parsing of pcfgs. It includes illustrative examples and explanations of key concepts, making it a valuable resource for students and researchers in computational linguistics and natural language processing.

Typology: Study notes

2021/2022

Uploaded on 01/21/2025

vu-le-12 🇭🇰

2 documents

1 / 65

This page cannot be seen from the preview

Don't miss anything!

Statistical

Natural Language

Parsing

Nguyen Phuong

Thai

Discover Study notes of Computer Science Hong Kong Baptist University

Partial preview of the text

Download Statistical Natural Language Parsing: The Rise of Data and Statistics and more Study notes Computer Science in PDF only on Docsity!

Statistical

Natural Language

Parsing

Nguyen Phuong

Thai

LLOs

 (^) LLO1: Trình bày được các kiến thức cơ bản về phân tích cú pháp bao gồm bài toán và các tiếp cận giải quyết, các mô hình văn phạm CFG và PCFG, phân tích cú pháp theo tiếp cận thống kê và treebank, đánh giá kết quả phân tích cú pháp  (^) LLO2: Vận dụng được văn phạm CFG/PCFG trong mô tả cú pháp của ngôn ngữ tự nhiên như đọc hiểu dữ liệu treebank, dựng cây cú pháp thủ công, tính xác suất của cây

(^) LLO 3 : Vận dụng được các thuật toán: nhị phân hóa văn phạm phi ngữ cảnh (chuyển văn phạm sang dạng chuẩn Chomsky), lập bảng phân tích sử dụng thuật toán CKY

Two views of linguistic structure:

1. Constituency (phrase structure)

Phrase structure organizes words into nested constituents.

(^) How do we know what is a constituent? (Not that linguists don’t argue about some cases.) - (^) Distribution: a constituent behaves as a unit that can appear in different places: - (^) John talked [to the children] [about drugs]. - (^) John talked [about drugs] [to the children]. - (^) *John talked drugs to the children about - (^) Substitution/expansion/pro-forms: - (^) I sat [on the box/right on top of the box/there]. - (^) Coordination, regular internal structure, no intrusion, fragments, semantics, …

Two views of linguistic structure:

2. Dependency structure

Dependency structure shows which words depend on (modify or are arguments of) which other words. The boy put the tortoise on the rug rug the the tortoise^ on put boy The

Statistical

Natural Language

Parsing

Parsing: The rise of

data and statistics

Classical NLP Parsing:

The problem and its solution

Categorical constraints can be added to grammars to limit unlikely/weird parses for sentences

(^) But the attempt make the grammars not robust
- (^) In traditional systems, commonly 30% of sentences in even an edited text would have no parse.
(^) A less constrained grammar can parse more sentences
(^) But simple sentences end up with ever more parses with no way to choose between them

We need mechanisms that allow us to find the most likely parse(s) for a sentence

(^) Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but still quickly find the best parse(s)

The rise of annotated data:

The Penn Treebank

( (S (NP-SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (, ,) (S-ADV (NP-SBJ (-NONE- *)) (VP (VBG reflecting) (NP (NP (DT a) (VBG continuing) (NN decline)) (PP-LOC (IN in) (NP (DT that) (NN market))))))) (. .))) [Marcus et al. 1993, Computational Linguistics ]

CFGs and PCFGs

(Probabilistic)

Context-Free

Grammars

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP  e PP P NP people fish tanks people fish with rods N people N fish N tanks N rods V people V fish V tanks P with

Phrase structure grammars in NLP

G = (T, C, N, S, L, R)

(^) T is a set of terminal symbols
(^) C is a set of preterminal symbols
(^) N is a set of nonterminal symbols
(^) S is the start symbol (S ∈N)
(^) L is the lexicon, a set of items of the form X x
- (^) X ∈ P and x ∈T
(^) R is the grammar, a set of items of the form X 
- (^) X ∈ N and∈ (N ∪C)*
(^) By usual convention, S is the start symbol, but in statistical NLP, we usually have an extra node at the top (ROOT, TOP)

We usually write e for an empty sequence, rather than nothing

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP  e PP P NP people fish tanks people fish with rods N  people N  fish N  tanks N  rods V  people V  fish V  tanks P  with

A PCFG

[With empty NP removed so less

S NP VP 1.
VP V NP 0.
VP V NP PP 0.
NP NP NP 0.
NP NP PP 0.
NP N 0.
PP P NP 1.
N  people 0.
N  fish 0.
N  tanks 0.
N  rods 0.
V  people 0.
V  fish 0.
V  tanks 0.
P  with 1.

The probability of trees and strings

P( t ) – The probability of a tree t is the product of the probabilities of the rules used to generate it.

(^) P( s ) – The probability of the string s is the sum of the probabilities of the trees which have that string as their yield P( s ) = Σ j P( s , t ) where t is a parse of s = Σ j P( t )

Statistical Natural Language Parsing: The Rise of Data and Statistics, Study notes of Computer Science

Related documents

Partial preview of the text

Download Statistical Natural Language Parsing: The Rise of Data and Statistics and more Study notes Computer Science in PDF only on Docsity!

Statistical

Natural Language

Parsing

Nguyen Phuong

Thai

LLOs

Two views of linguistic structure:

1. Constituency (phrase structure)

Two views of linguistic structure:

2. Dependency structure

Statistical

Natural Language

Parsing

Parsing: The rise of

data and statistics

Classical NLP Parsing:

The problem and its solution

The rise of annotated data:

The Penn Treebank

CFGs and PCFGs

(Probabilistic)

Context-Free

Grammars

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

Phrase structure grammars in NLP

G = (T, C, N, S, L, R)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

A PCFG

The probability of trees and strings