




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth exploration of part-of-speech tagging and sense disambiguation in speech synthesis, as covered in the ece 598: speech synthesis course at the university of illinois at urbana-champaign. The challenges of part-of-speech tagging, particularly in languages with minimal morphology, and the importance of tagging for speech synthesis. It also delves into sense disambiguation, using the french language as an example, and the decision-list approach for disambiguation. The document also touches upon other topics related to speech synthesis, such as word pronunciation, abbreviation expansion, and language modeling.
Typology: Study Guides, Projects, Research
1 / 109
This page cannot be seen from the preview
Don't miss anything!





























































































Richard Sproat http://www.linguistics.uiuc.edu/rws/ URL for this course: http://catarina.ai.uiuc.edu/ECE598/
? Part of speech tagging ? Word-sense disambiguation ? Word pronunciation ? Preprocessing: Abbreviation expansion, etc.
? Word segmentation in Asian languages ? Architectures for multilingual linguistic analysis
ECE 598: Linguistic Analysis
? It’s more obvious what the basic set of tags should be since words fall into ? The morphology gives important cues to what the part of speech is: cantaremos is highly likely to be a verb given the ending -ar-emos.
? there are fewer cues in English than there are in Spanish ? for some languages like Chinese, cues are almost completely absent and linguists can’t even agree on whether (e.g.) Chinese distinguishes verbs from adjectives.
But usually these analyses assume an additional set of morphosyntactic features.
eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN
“the Penn Treebank project collapsed many tags compared to the original Brown tagset, and got better results.” (http://www.ilc.cnr.it/EAGLES96/ morphsyn/node18.html)
But choosing the right size tagset depends upon the intended application.
As far as I know, there is no demonstration of what is the “optimal” tagset.
? “the Penn Treebank tagset is based on that of the Brown Corpus. However the stochastic orientation of the Penn Treebank and the resulting concern with sparse data led us to modify the Brown tagset by paring it down considerably” (Marcus, Santorini and Marcinkiewicz, 1993). ? eliminated distinctions that were lexically recoverable: thus no separate tags for be, do, have. ? as well as distinctions that were syntactically recoverable (e.g. the distinction between subject and object pronouns)
Even with a well-designed tagset, there are cases that even experts find it difficult to agree on.
1 35, 2 3, 3 264 4 61 5 12 6 2 7 1 “still”
ENGTWOL (Voutilainen, 1995) is an FST-based rule-based system for English tagging.
Example rule:
Adverbial that rule: Given input “that”: if:
(+1 A/ADV/QUANT) /* next word is adj, adv. or quant / (+2 SENT-LIM) / following is sentence boundary / (NOT -1 SVOC/A) / prev word not adj comp verb */
then eliminate non-ADV tags else eliminate ADV tag
ti = argmaxjP (tj|ti− 1 )P (wi|tj)
j
P (tj|tj− 1 )P (wj|tj)
which is given by
But since we know the word sequence we can eliminate that and just maximize
You can assume an initial distribution of tags over the corpus (given a dictionary and perhaps some lingustically base guesses) and then use an algorithm such as expectation maximization (EM).
TBL was proposed by Eric Brill in his 1995 U Penn dissertation. It is a “weakly statistical” method
Thus: Change NN to VB when the previous tag is TO
So: expected/VBD to/TO race/NN → expected/VBD to/TO race/VB
The rulespace is searched for the rule that gives the most improvement given the corpus.