Proceedings of the Conference on Language & Technology 2009

A Corpus-Based Finite State Morphological Analyzer for Pashto

Fatima Tuz Zuhra and Mohammad Abid Khan

Department of Computer Science, University of Peshawar, Peshawar, Pakistan

[email protected], [email protected]

Abstract

This paper provides details of the development of

an inflectional morphological analyzer that can

analyze different inflections of a Pashto verb, noun or

adjective. The system is corpus-based. The developed

system is capable to accept input in the form of a

transliterated Pashto verbal, nominal or adjectival

inflection; convert it to an Arabic-scripted Pashto

equivalent; morphologically analyze the word and

search and display all the sentences in the corpus, in

which the word is used.

1. Introduction

Pashto is a morphologically rich language. There

are countless applications of Natural Language

Processing (NLP), one of which can be the

development of a system that can provide all the

morphological tags of a given word and search

examples of the use of the word in a corpus of real life

data. This work deals with the design and development

of a similar application. The developed system can

morphologically analyze as well as provide examples

of the use of any verbal, nominal or adjectival

inflection. These examples are searched from the

Pashto corpus [1].

There can be several uses of the system, developed

in this work. A linguist can use the system to

morphologically analyze a particular word and see its

daily life examples. Another and very important use of

the system can be in the development of a part of

speech (POS) tagger for Pashto language.

The rest of the paper is divided into the following

sections. Section 2 provides a brief overview of the

morphology of Pashto verbs, nouns and adjectives.

Section 3 sheds light on the analysis of verbal, nominal

and adjectival inflections. Section 4 is about the

modeling and design of the morphological analyzer. In

section 5, the implementation of the morphological

analyzer is discussed. Section 5 provides details of the

overall corpus-based morphological analyzer for

Pashto.

2. A brief overview of Pashto morphology

It is important to provide a brief summary of the

work, done by Pashto linguists, we studied before

starting the computational work. They are Penzl [2],

Khattak [3], Tegey and Robson [4], and Babrakzai [5].

The work of these linguists form the basis for the

research work presented in this paper.

Khattak [3] identifies different facets, for which a

Pashto verb inflects. He says, “The formal distinctions

of the Pashto verb reflect a variety of categories: tense,

aspect, mood and voice. Referring to the NPs in the

subject or object position, the verb also inflects for

person, number and gender.”

Khattak [3] further says that the morphology of the

Pashto verb shows only two simple tenses: present and

past. The future is expressed with the help of a model

clitic ba.

Babrakzai [5] provides the basic structure of a

Pashto verb, given below, where # indicates the

potential positions for clitics.

Verb=[aspect # negative # stem + agreement # ]

Babrakzai [5] provides the definition of agreement

as follows:

“System of inflection that records a nominal’s

inherent features (usually person, number, gender/ or

case) on another category, generally a verb, adjective

or a determiner”.

According to Tegey and Robson [4], agreement is

indicated with personal endings, i.e. suffixes following

the verb stem which show person and number.

The category of gender is restricted to the third

person form of simple verbs and to the third person

singular forms of the auxiliary [2] called copula verbs

of 'to be' [6]. However, the category of gender is found

in third person plural form of this auxiliary in

Yousafzai dialect [7].

NATURAL LANGUAGE PROCESSING URDU, Thesis of Natural Language Processing (NLP)

Related documents

Partial preview of the text

Download NATURAL LANGUAGE PROCESSING URDU and more Thesis Natural Language Processing (NLP) in PDF only on Docsity!

A Corpus-Based Finite State Morphological Analyzer for Pashto

Fatima Tuz Zuhra and Mohammad Abid Khan

Department of Computer Science, University of Peshawar, Peshawar, Pakistan

[email protected], [email protected]

Abstract

1. Introduction

2. A brief overview of Pashto morphology

3. The analysis of verbal, nominal and

adjectival inflections

4. Modeling and design of Pashto

morphological analyzer

5. Implementation of the morphological

analyzer

6. The overall system