




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Distributional Properties,Phonotactical Rules,Syllable Template,Consonant Clusters and Lexical Constraints.
Typology: Exercises
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science
6.345 Automatic Sp eech Recognition Spring, 2003
Issued: 02/14/ Due: 02/26/
A language is not only limited by the inventory of basic sound units, but also by the allowable combinations of these sounds. This assignment is intended to give you some feeling ab out such constraints.
To do this, we will use an interactive software facility named Crystal, which runs on the Linux workstations. Crystal is an interactive system which provides many functions for studying and displaying the distributional constraints of a lexicon. For the purp ose of this lab, we will use the Merriam Po cket dictionary, which contains ab out 20,000 entries, as the working lexicon. To start the lab simply enter the command:
% start_lab2.cmd
We will b egin our investigation by examining some of the distributional prop erties of this lexicon of English words.
T1: In this exercise, we will study the prop erties of the most common words in the English language. Click on Sort by Brown Corpus Frequency (BCF)^1 in the Search Results sub- window, which will sort the words in the dictionary according to their numb er of o ccurrences in the Brown Corpus. Study the counts and prop erties of the top 15 words in the list.
Q1: What are the common characteristics of the 15 most frequent words (e.g., numb er of syllables, part of sp eech, etc.?) (^1) The Brown Corpus is a corpus of over one million words gathered at Brown University. These words
were taken from various sources such as b o oks, pap ers and magazines, and their frequencies of o ccurrence were recorded.
T2: In this exercise, we will study the prop erties of the most frequent two and three syllable words in the English language. Set the Search Typ e to stress and typ e in.. (.? ) in Search String. Note that all characters in the search string are separated by spaces. The rst two dots match two syllables, while the third dot a question mark in parentheses matches an optional third syllable.
Q2: What are the most frequent two and three syllable words, and how highly are they ranked in the lexicon? When lo oking at only two syllable words by using.. as the search string, which syllable is more likely to b e stressed? For the second part, use S to match a stressed syllable.
T3: In this exercise, we will study the distribution prop erties of syllable patterns for English. Restore the original lexicon by clicking on it in the history sub-window. Click on Syllables p er Word in the Statistics sub-window. The distribution of the syllable patterns in the Brown Corpus is di erent from that in the dictionary, b ecause some words in the dictionary o ccur more often than others. To weight words by their Brown Corpus frequencies click on Weight by BCF in the Statistics sub-window. The Syllables p er Word graph should now b e weighted by Brown Corpus frequencies.
Q3: It turns out that all of the words in the lexicon contain eight or fewer syllables. What is the most frequent numb er of syllables p er word? Describ e the probability distribution for Numb er of Syllables p er Word. How would your answer di er when the words are weighted by their Brown Corpus frequencies?
T4: In this exercise, we will study the distribution of stress patterns for English. Click on Stress Pattern Occurrences in the Statistics sub-window. Also, view the distribution as weighted by Brown Corpus frequencies.
Q4: What is the most frequent p olysyllabic stress pattern? How would your answer di er when the words are weighted by their Brown Corpus frequencies?
T5: In this exercise, we will study the distribution prop erties of phonemes for English. Click on Phoneme Occurrences in the Statistics sub-window. Also, view the distribution as weighted by Brown Corpus frequencies.
Q5: Of the ten most frequently o ccurring phonemes in the lexicon, what are the most common manner of pro duction and place of articulation? How would your answer di er when the words are weighted by their Brown Corpus frequencies?
The study of allowable sound sequences of a language are called phonotactics. This part of the assignment exp oses you to some of the common phonotactical rules of English.
There are only a limited numb er of distinct word-initial and word- nal consonant clusters in the English language. We will study their prop erties in this part of the lab.
T6: First, restore Search Typ e to phonemic. Search for word-initial consonant clusters in the original lexicon containing at least two consonants by typing C C ( C * ) V. * in Search String. The C C ( C * ) p ortion matches two or more consonants, while the V p ortion matches exactly one vowel. Finally, the. * p ortion matches the remaining zero or more phonemes of an arbitrary word. Pay sp ecial attention to the existence of
Next, restore the original lexicon by clicking on it in the history sub-window. Search for all p ossible word- nal consonant clusters in the lexicon by typing. * V C * in
(you can verify this by searching the lexicon with. * t k t. * or. * k t k. *.) Are the following two phonemic transcriptions p ossible?
What is the maximum length of a word-initial consonant cluster? At this length, how many consonant clusters are there and what are they?
T7: Search for words with two adjacent vowels by typing. * V V. * in Search String. Be sure to restore the original lexicon and ignore syllable b oundaries by enabling Ignore Syllable Boundaries.
Q7: How many words have two vowels in a row? How many of them have a schwa as the second vowel? How many have a schwa as the rst vowel? Use ( ax j ix ) to match b oth plain or front schwas. What do two adjacent vowels imply ab out the syllable structure of the two syllables to which they b elong?
T8: The homorganic nasal-stop rule states that nasal-stop clusters must agree on the place of articulation. Verify this by examining all the o ccurrences of nasal-stop clusters in the lexicon. You can search for all words containing nasal-stop sequences by typing
. * NASAL STOP. * in Search String. You can also search for more sp eci c exam-
typ e. * n ( d j t ). * in Search String. You will want to exp eriment how ignoring or accounting for syllable b oundaries a ects the results.
Q8: How often is nasal-stop homorganic rule violated? Can you try to generalize a rule to summarize when it is broken.
In this part of the lab you will investigate the extent to which a given word can b e disam- biguated from comp etitors based on partial phonetic information.
T9: You have done some sp ectrogram reading practice in class. In this exercise, we will show that the use lexical access can greatly assist the task. In the Figures 3, 4, and 5, you will nd three sp ectrograms of isolated words. Start with a very coarse transcription of the sp ectrogram by hand. If you can not determine the phones, try to come up with phone classes such as vowel, nasal, strong fricatives, voiced stop, etc. Perform a search on the lexicon based on your partial hyp othesis. If you can not determine the words, try to re ne your hyp othesis and search again. The search pattern should b e expressed as regular expressions, many examples of which have are already b een given in the previous tasks. The following classes have b een de ned along with abbreviations, or you can use the OR op erator, j, to create custom classes. Enable Ignore Syllable Boundaries so that you will not have to explicitly sp ecify syllable b oundaries.
CLASS ABBREVIATION MEMBERS VOWEL V all vowels RETROFLEXED R r axr er FRICATIVE F s sh z zh f th v dh STRONG-FRICATIVE SF s sh z zh WEAK-FRICATIVE WF f th v dh NASAL N m n ng GLIDE G w y LIQUID L l r SEMIVOWEL SV l r w y ASPIRANT hh STOP S b d g p t k VOICED-STOP VS b d g UNVOICED-STOP US p t k AFFRICATE A ch jh SYLLABIC-CONSONANT SC el em en
Q9: What are the words in each sp ectrogram? What is the partial phonetic hyp othesis you have that leads you to the answer with the help of lexical search?
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.
kHz kHz 0 0
8 8
(^16) Zero Crossing Rate 16
dB dB
Total Energy
dB dB
Energy -- 125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.
Figure 4: Mystery word #2.
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.
kHz kHz
0 0
8 8
16 16 Zero Crossing Rate
dB dB
Total Energy
dB dB
Energy -- 125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.
Figure 5: Mystery word #3.