



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
HOW TO HANDLE URDU LANGUAGE GRAMMAR
Typology: Thesis
1 / 6
This page cannot be seen from the preview
Don't miss anything!




This paper provides details of the development of an inflectional morphological analyzer that can analyze different inflections of a Pashto verb, noun or adjective. The system is corpus-based. The developed system is capable to accept input in the form of a transliterated Pashto verbal, nominal or adjectival inflection; convert it to an Arabic-scripted Pashto equivalent; morphologically analyze the word and search and display all the sentences in the corpus, in which the word is used.
Pashto is a morphologically rich language. There are countless applications of Natural Language Processing (NLP), one of which can be the development of a system that can provide all the morphological tags of a given word and search examples of the use of the word in a corpus of real life data. This work deals with the design and development of a similar application. The developed system can morphologically analyze as well as provide examples of the use of any verbal, nominal or adjectival inflection. These examples are searched from the Pashto corpus [1]. There can be several uses of the system, developed in this work. A linguist can use the system to morphologically analyze a particular word and see its daily life examples. Another and very important use of the system can be in the development of a part of speech (POS) tagger for Pashto language. The rest of the paper is divided into the following sections. Section 2 provides a brief overview of the morphology of Pashto verbs, nouns and adjectives. Section 3 sheds light on the analysis of verbal, nominal and adjectival inflections. Section 4 is about the modeling and design of the morphological analyzer. In section 5, the implementation of the morphological analyzer is discussed. Section 5 provides details of the
overall corpus-based morphological analyzer for Pashto.
It is important to provide a brief summary of the work, done by Pashto linguists, we studied before starting the computational work. They are Penzl [2], Khattak [3], Tegey and Robson [4], and Babrakzai [5]. The work of these linguists form the basis for the research work presented in this paper. Khattak [3] identifies different facets, for which a Pashto verb inflects. He says, “The formal distinctions of the Pashto verb reflect a variety of categories: tense, aspect, mood and voice. Referring to the NPs in the subject or object position, the verb also inflects for person, number and gender.” Khattak [3] further says that the morphology of the Pashto verb shows only two simple tenses: present and past. The future is expressed with the help of a model clitic ba. Babrakzai [5] provides the basic structure of a Pashto verb, given below, where # indicates the potential positions for clitics. Verb=[aspect # negative # stem + agreement # ] Babrakzai [5] provides the definition of agreement as follows: “System of inflection that records a nominal’s inherent features (usually person, number, gender/ or case) on another category, generally a verb, adjective or a determiner”. According to Tegey and Robson [4], agreement is indicated with personal endings, i.e. suffixes following the verb stem which show person and number. The category of gender is restricted to the third person form of simple verbs and to the third person singular forms of the auxiliary [2] called copula verbs of 'to be' [6]. However, the category of gender is found in third person plural form of this auxiliary in Yousafzai dialect [7].
A Pashto noun inflects for gender, number and case [2]. Different Pashto grammarians [2, 8, 9] categorize the Pashto nouns into different masculine and feminine classes according to their final phonemes. Bellew [10] and others have also contributed significantly to the investigation about Pashto nouns. The Pashto adjectives have more or less the same inflectional properties and similar morphological behavior as those of Pashto nouns.
Different verbal, nominal and adjectival inflections were manually extracted from about 30,000 words written Pashto data. These include over 2000 verbal, 2500 nominal and 1800 adjectival inflections. These inflections were decomposed into stems and affixes. This lengthy analysis phase revealed the personal suffixes for a Pashto verb given in table 1.
Table 1: Personal suffixes
Person Suffix First person singular (Present + Past) (^) -әm
First person plural (Present + Past) (^) -u
Second person singular (Present + Past) (^) -ee
Second person plural (Present + Past) (^) -әi
Third person singular and plural in present tense
-i
Third person masculine singular (Past) (^) -o
Third person masculine plural (Past) (^) -
Third person feminine singular (Past) (^) -a
Third person feminine plural (Past) (^) -ee
Various other verbal affixes, revealed in this analysis, are listed in table 2.
Table 2: Various affixes used in verb morphology
Morphological property Affix Perfective marking prefix (^) wә- Past marking infix (^) -әl- Passive participle suffix (^) -e Perfect participle suffix (^) -e Optative suffix (^) -e or -ɑy
The analysis of Pashto nominal inflections shows that the Pashto nouns have various types (classes), based on their ending phoneme. The Pashto nouns are classified in seven masculine and seven feminine classes. Each of these classes have a particular type of ending phoneme and the suffixation of each class is different from the other classes for reflecting the same facet. For example, the suffixes for direct plural formation of various masculine classes of nouns are given in table 3.
Table 3: Suffixes for various masculine classes of nouns
Noun class Suffix First masculine (animate) (^) - ɑn First masculine (inanimate) -una Second masculine -i (loud-stressed) Third -i (weak-stressed) Fourth masculine (human) -una Fourth masculine (animal) (^) - ɑn Fifth masculine (^) - gɑn or -wɑn Sixth masculine -una Seventh masculine (^) - yɑn There may be a chance that the direct plural forming suffix of two classes is the same, but in this case their other suffixes e.g. their vocative forming suffix will be different. Hence these are different classes. The case of Pashto adjectives is similar to Pashto nouns, as revealed by the analysis of adjectival inflections. Based on the ending phonemes of Pashto adjectives, eight classes are defined [11].
The morphological analyzer is modeled using Finite State Transducers (FSTs) as tools. FSTs combine lexicon and rules as said by Beesley and Karttunen [12]: “An FST incorporates all the lexicon and rule information in a single network data structure, mapping directly between a language of underlying or “lexical” strings and a language of surface strings”. The rules devised in this research work are productive. Thus, more verbs, nouns and adjectives can be added to the system, without changing the rules. After various affixes in the morphology were identified, the order in which these affixes are attached to the verbal, nominal or adjectival stem was determined. The determination of this order served as a
Figure 3: The masculine form of the fifth class of adjectives
These FSTs are ready to be implemented. The next section sheds light on the implementation of these FSTs.
The implementation details of the morphological analyzer are provided in this section. The FSTs, developed during the modeling and design phase, are implemented. For this implementation, four programming languages and tools are used, which are C# (in .NET framework), Xerox tools lexc and xfst, and Microsoft Access. A Romanized transliteration scheme, similar to that of Penzl [2], is used instead of actual Arabic script. Though, a great part of the transliteration symbols is adopted from [2], some symbols differ from that scheme. These differences are because of the diacritic symbols, used by Penzl, which are replaced by alternative keyboard symbols in this work because these diacritic symbols either are difficult to type or not available on keyboard. The symbols, used by Penzl, are shown in table 6 and the additions made to it in Table 7.
Table 6: Adopted transliteration symbols Alphabet Transliteration Alphabet Transliteration ا aa ش sh ب b ښ ss پ P غ gh ت T ف f
ټ Tt ق q ج Dzh k ځ Dz ګ g چ (^) Tsh ل (^) l د D م m ډ Dd ن n ر R ڼ nn ړ (^) Rr و (^) w ز Z ى y ژ Zh ي i ږ Zz ې ee س S و u
Table 7: Additional transliteration symbols Alphabet Transliteration Alphabet Transliteration ؤ Aw ع ah و Oo ۀ @ ح h? ۍ @i خ X ے e ذ z? ـ) A?
All the FSTs are implemented in lexc, the binary files of its output were opened in xfst, and then saved in text files, where the lexical and corresponding surface strings were listed. These files were then read in the MS-Access database tables. One of these MS- Access tables is shown in figure 4.
Figure 4: The MS-Access nouns' table
Thus, a lexicon is obtained, with which all the rules of inflections of verbs, nouns and adjectives are incorporated. This lexicon contains various possible inflections of 200 root verbs, 250 root nouns and 140 root adjectives. This is the morphological analyzer for the verbal, nominal and adjectival inflectional system of Pashto.
There are several components, designed and developed during this research work, in addition to the morphological analyzer. All these components are combined to develop the overall corpus-based finite state morphological analyzer for Pashto. All the components of the overall system are discussed briefly below. The first component is a finite state morphological analyzer. This component analyzes any verbal, nominal or adjectival inflection morphologically subject to the condition that the part of speech, to be analyzed, is listed in the lexicon. This morphological analyzer is the result of the implementation of verbal, nominal and adjectival FSTs. The second component of the system is a monitor corpus of written Pashto data [1]. This corpus currently contains Pashto data of 24,000 words and its size is increasing. This corpus is used for evaluating the results of the finite state morphological analyzer. The third component is a Microsoft Access database. In this database, the output of the xfst is saved. This database contains a VERB, a NOUN and an ADJECTIVE table. All the surface forms and the corresponding lexical forms, obtained as an output of the implementation of FST, are stored in these tables. The fourth component is an English-to-Pashto spelling transducer. This is one of the most wanted and most important components, designed and developed during this research work. This transducer can map from transliterated string to Arabic-scripted Pashto word. All these components are integrated in a way, depicted in the flowchart in figure 5.
Figure 5: The flowchart of the whole system
By combining all the components, an application is developed that takes a transliterated Pashto verbal, nominal or adjectival inflection as input, convert it into an Arabic-scripted Pashto word, morphologically analyzes it, and provides all the sentences from the Pashto corpus, in which the input word is used. A sample interaction with this application, having a user- friendly interface, is given in figure 6.
Figure 6: Sample interaction with the system
In the next section, the accuracy of the system is discussed and a brief error analysis is provided.