Download Classifiers for parsing and more Study Guides, Projects, Research Natural Language Processing (NLP) in PDF only on Docsity!
The Selection of a Classifier for a
Data-driven Parser
Author One
, Author
Institution, Department, Address, City, Country
Institution, Department, Address, City, Country
Abstract. There is a large number of classifiers that can be used
for generating a parse model; i.e., as an oracle for guiding data-
driven parsers when parsing natural languages. In this paper we
present a general and simple approach for generating a parse model.
Additionally, we present a large number of experiments on various
classifiers. We also present the effect of various parse models that
are generated from different classifiers on a data-driven parser to see
they way each model contributes to parsing performance.
1 Introduction
The objective of this study is to present an approach for generating dif-
ferent parse models, which are used for guiding parsers during natural language
parsing, from different machine learning classifiers. There are various classifica-
tion algorithms that can be used for this purpose. However, each classifier may
learn from a set of data in different ways, which means that they may affect
parsing performance in different ways. In Section 3 we present a data-driven
parser that we have used for evaluating different parse models that are generated
from different classifiers. In Section 4 we show a simple approach for generating
a parse model from the J48 classifier while in Section 6.1 we show the accuracy
of a large number of classifiers. Section 6.2 covers the effect of each parse model
on parsing performance. Finally, in Section 7 we compare our parser with the
arc-standard algorithm of MaltParser.
2 Dataset
We have used the Penn Arabic Treebank (PATB) (Maamouri and Bies,
2004) part 1 version 3 for evaluating various classifiers. We have also used
this dataset for training and testing our data-driven parser, which is a re-
implementation of the arc-standard version of MaltParser (Nivre et al., 2010;
Kuhlmann and Nivre, 2010; Nivre et al., 2006), and the arc-standard algorithm.
We have converted the phrase structure trees of the PATB to dependency
structure trees using the standard conversion algorithm for transforming phrase
structure trees to dependency trees, as described by Xia and Palmer (2001).
In order to perform a 5-fold validation, we have systematically generated
five sets of testing data and five sets of training data from the treebank, where
the testing data is not part of the training data. The training data contains
approximately 4853 sentences. The average length of sentences is 29 words and
the total number of testing sentences in each fold is about 970 sentences.
3 A Basic Shift-Reduce Parser
Our parser is based on the arc-standard algorithm of MaltParser (Kuhl-
mann and Nivre, 2010). This algorithm deterministically generates dependency
trees using two data-structures: a queue of input words, and a stack of items
that have been looked at by the parser. Three parse actions are applied to the
queue and the stack: SHIFT, LEFT-ARC and RIGHT-ARC (we will write LA
and RA for LEFT-ARC and RIGHT-ARC respectively to save space). SHIFT
moves the head of the queue onto the top of the stack, LA makes the head of
the queue a parent of the topmost item on the stack and pops this item from the
stack, and RA makes the topmost item on the stack a parent of the head of the
queue; RA removes the head of the queue and moves the topmost item on the
stack back to the queue. MaltParser uses a support vector machine classifier for
generating a parse model from a set of parsed trees, which is used for predicting
the next parse action given the current state of the parser.
We re-implement the arc-standard algorithm so that we can train it on
different classifiers and also run it non-deterministically. We will call our parser
NDParser. At each parse step, we generate a state for LA, RA, and SHIFT, and
we will assign different scores to each state. A score is computed for each newly
generated state by computing two different scores: (i) a score that is based on
the recommendation made by a parse model. For example, for a SHIFT state
the parser gives a score of 1 if a SHIFT operation is recommended by the model.
Otherwise a score of 0 is given (and the same applies to LA and RA). (ii) the
score of the given state (which is the state that the new parse state is generated
from). The sum of these two scores is assigned to the newly generated state.
The advantage of assigning a score to each parse state is that we can rank
a collection of parse states by using their scores and then process the state with
situation – for instance, in a situation like in step 6 in Fig. 2 the parser should
use SHIFT instead of RA for the reason explained above.
Dependency relations: (sat>cat) (sat>on) (cat>the) (on>mat) (mat>the)
Steps Action Queue Stack Arcs
1 θ [the,cat,sat,on,the,mat] [] θ 2 SHIFT [cat,sat,on,the,mat] [the] θ 3 LA [cat,sat,on,the,mat] [] A1=(cat>the) 4 SHIFT [sat,on,the,mat] [cat] A 5 LA [sat,on,the,mat] [] A2=A1∪(sat>cat) 6 SHIFT [on,the,mat] [sat] A 7 SHIFT [the,mat] [on,sat] A 8 SHIFT [mat] [the,on,sat] A 9 LA [mat] [on,mat] A3=A2∪(mat>the) 10 RA [on] [sat] A4=A3∪(on mat) 11 RA [sat] [] A5=A4∪(sat>on) 12 SHIFT [] [sat] A 13 θ [] [sat] A
Fig. 2: Action sequence for parsing the sentence ‘the cat sat on the mat’.
Given a set of such data-points, it is possible to extract and record the
parse states and train a classifier for building a parse model, which can be used
for predicting parse operation; i.e., it can be used for guiding the parser. The
task here is to classify intermediate states of the parser into three groups: cases
where SHIFT should be performed, cases where LA should be performed, and
cases where RA should be performed.
Step 3. Preparing recorded parse states for classification: from the set of
parse states that we obtain in step 2, we populate an .arff file with the correct
data format, i.e., the format that is accepted by WEKA. An example of a set of
WEKA-style data format is shown in Fig. 3, which is based on the parse states
shown in Fig. 2. Here we have extracted the word forms as a feature for learning
but it is possible to use a number of different features (such as POS tags, word
position etc.) as values for the queue and the stack attribute parameters.
@relation states @attribute queue_word_pos_1{‘the’,‘cat’,‘sat’,‘on’,‘mat’,‘-’} @attribute queue_word_pos_2{‘cat’,‘sat’,‘on’,‘the’,‘mat’, ‘-’} @attribute stack_word_pos_1{‘-’, ‘the’,‘cat’,‘sat’,‘on’} @attribute stack_word_pos_2{‘-’,‘sat’,‘on’} @attribute parse_action{‘SHIFT’, ‘LEFT-ARC’, ‘RIGHT-ARC’} @data ‘the’, ‘cat’, ‘-’, ‘-’, ‘SHIFT’ ‘cat’, ‘sat’, ‘the’, ’-’, ‘LEFT-ARC’ ‘cat’, ‘sat’, ‘-’, ‘-’, ‘SHIFT’ ‘sat’, ‘on’, ‘cat’, ‘-’, ‘LEFT-ARC’ ‘sat’, ‘on’, ‘-’, ‘-’, ‘SHIFT’ ‘on’, ‘the’, ‘sat’, ‘-’, ‘SHIFT’ ‘the’, ‘mat’, ‘on’, ‘sat’, ‘SHIFT’ ‘mat’, ‘-’, ‘the’, ‘on’, ‘LEFT-ARC’ ‘mat’, ‘-’, ‘on’, ‘sat’, ‘RIGHT-ARC’ ‘on’, ‘-’, ‘sat’, ‘-’, ‘RIGHT-ARC’ ‘sat’, ‘-’, ‘-’, ‘-’, ‘SHIFT’
Fig. 3: An example of data for a .arff file.
Additionally, one can use many different window sizes for the queue and the
stack in the data selection as instances for the classification algorithms to learn
from. In Table 3, we use the window sizes of two items for the queue and two
items for the stack, while the dash mark (‘-’) represents an empty item where
the queue or the stack did not contain an item in the given position.
Step 4. Training a classifier using the .arff file: we supply WEKA with
the data prepared in step 3 (i.e., the .arff file) and then we select a classification
algorithm for learning. Fig. 4 is a screenshot of the J48 classification algorithm
output from WEKA. The classification rules inferred by a decision tree classifier
take the form of questions, such as Is the POS tag on the top of the stack (AB-
BREV)?, and a possible answer where the possible answer is a further question
such as (Is POS tag in the head of queue (ABBREV)?) or a classification such
as (This is the kind of situation where you should carry out a RIGHT-ARC.)
Fig. 4: A screenshot of the J48 classifier output using WEKA.
Step 5. Generating a parse model from the classification output: finally, we
convert the output produced by the classification algorithm to an appropriate
question-answer model that we can use for guiding the parser to parse new
sentences. Fig. 5 is a sample of some questions and answers we have extracted
from the J48 (Quinlan, 1992) classifier’s output.
5 Labelled Attachment Score
In this section we show the way we obtain labelled attachment scores,
which is largely different from the way this is implemented in the standard
implementation of MaltParser. As in the arc-standard algorithm, for each de-
pendency relation between two words, a semantic label is attached to indicate
the semantic role of the daughter item with its parent. However, the way we
assign labels to dependency relations during parsing is that we extract patterns
from the training data during the training phase. This contrasts with the ap-
proach used in MaltParser whereby labels are predicted with the LA and RA
We consider a classifier appropriate for producing a parse model if it meets
two requirements: (i) it produces good classification accuracy. Although the
accuracy of the classifiers that are presented in Table 1 may not directly reflect
the accuracy of a parser that uses its recommendations, a classifier that produces
a high level of accuracy is more likely to assist a parser to make more informed
parse decisions at each parse step than a classifier that produces a low level of
accuracy; and (ii) its output can be used for generating a parse model which
can be used for making recommendations to a data-driven parser, for example,
what action (SHIFT, LA, or RA) the parser should take in a specific situation.
We have used various features for training different classification algo-
rithms. These features included POS tags, word forms, word locations in sen-
tences, their spans (i.e., their start and end positions in sentences). Additionally,
we have used a combination of these features such as word forms with POS tags,
word forms with word location or word spans, and similar combination of POS
tags with other features. Also various window sizes are used for the queue and
the stack, ranging between two items to four items. The use of these features
for training each classifier along with the classification accuracy is presented in
Table 1.
During the evaluation of the classifiers, some widely used classifiers did
not yield encouraging results. For example, the LiBSVM classifier (Chang and
Lin, 2001) which is used in MaltParser did not perform well with the set of
features that we have supplied. It only managed to learn successfully from one
feature (POS tags), while the accuracy was well below the accuracy of some
of the other classifiers. The entries for LiBSVM in Table 1 are incomplete
because training takes so long (3 days per case) that future experiments seemed
infeasible. However, the fact that it produces no better classification than the
J48 classifier in the cases that we have looked at suggests that it is unlikely to
substantially outperform it in the remaining cases.
From the large number of experiments we have conducted on several classi-
fiers, we will evaluate NDParser by training it using the classification algorithms
that produced a high classification accuracy in the following section.
6.2 Evaluating NDParser with Various Classi-
fiers
As presented in Table 1 the classification accuracy varies because each
classifier learns differently from the set of training data. In this section, we in-
vestigate the effect of different classifiers on parsing. Our objective is to identify
the algorithms that help the parser perform best in terms of accuracy and speed
(We measure speed as second per dependency relation).
These experiments also highlight whether generating different parsing mod-
els by using different classifiers contribute in different ways to parsing perfor-
mance. The optimal classification of intermediate states may not necessarily
Table 1: Classification accuracy with various features and settings.
Word (%) - - - - - - - - - - - Word + location (%) - - - - - - - - - - - Word + location + span (%) - - - - - - - - - - - POS + location (%) - - - - - - - - - - - POS + location + span (%) - - - - - - - - - - - POS + span (%) - - - - - - - - - - - Word + POS (%) - - - - - - - - - - - Word + POS + location (%) - - - - - - - - - - - Word + POS + location + span (%) - - - - - - - - - - -
- J
- Items on Queue
- Items on Stack
- Word (%) 68.08 68.12 68.24 67.81 68.29 68.37 68.53 68.10 68.56 68.67 68.
- Word + location (%) 72.08 71.79 71.92 72.03 72.23 71.88 71.67 72.10 72.41 72.11 71.
- Word + location + span (%) 72.65 72.08 71.73 72.50 72.76 72.25 72.00 72.49 72.87 72.45 72.
- Word + span (%) 70.67 70.46 70.17 70.66 70.81 70.64 70.43 70.68 70.83 70.79 70.
- POS (%) 84.76 84.88 84.94 84.48 85.63 85.77 85.80 84.75 85.89 86.05 86.
- POS + location (%) 85.27 85.27 85.27 85.04 85.89 85.91 85.92 85.25 86.96 86.08 86.
- POS + location + span (%) 85.12 85.11 85.23 85.10 85.81 85.84 85.92 85.28 85.95 85.97 85.
- POS + span (%) 84.98 85.00 85.00 84.93 85.71 85.69 85.67 85.09 85.88 85.88 85.
- Word + POS (%) 85.25 85.25 85.28 85.05 86.23 86.24 86.24 85.25 86.46 86.47 86.
- Word + POS + location (%) 85.93 85.93 85.83 85.77 86.57 86.53 86.45 85.84 86.63 86.57 86.
- Word + POS + location + span (%) 85.90 85.84 85.83 85.89 86.48 86.54 86.47 85.94 86.49 86.55 86.
- Word + POS + span (%) 85.87 85.76 85.79 85.89 86.50 86.45 86.43 85.90 86.53 86.49 86.
- Items on Queue LIBSVM
- Items on Stack
- POS (%) 74.40 74.73 74.62 74.32 75.41 75.43 75.39 74.58 75.62 75.63 75. Word + span (%) - - - - - - - - - - - - Id Word + POS + span (%) - - - - - - - - - - -
- Items on Queue
- Items on Stack
- Word (%) 67.85 67.71 67.65 67.64 67.94 67.77 67.68 67.74 67.97 67.79 67.
- Word + location (%) 68.70 63.63 62.37 67.83 65.22 63.04 61.89 66.92 64.41 62.55 61.
- Word + location + span (%) 68.61 65.65 63.69 67.77 64.85 63.18 62.26 66.89 64.23 62.79 62.
- Word + span (%) 67.08 64.46 62.42 61.21 66.14 63.60 61.84 65.24 62.81 61.25 60.
- POS (%) 83.15 83.04 81.64 83.74 83.41 81.78 80.54 82.31 81.47 79.70 78.
- POS + location (%) 77.55 75.79 74.80 76.25 75.48 75.04 74.95 74.94 74.83 74.65 74.
- POS + location + span (%) 77.57 75.71 74.51 76.31 75.42 74.92 74.83 75.13 74.85 74.63 74.
- POS + span (%) 77.42 75.51 74.49 74.28 76.10 75.17 74.59 74.92 74.64 74.41 74.
- Word + POS (%) 83.62 82.25 81.02 83.34 82.89 81.29 80.36 81.76 81.04 79.67 79.
- Word + POS + location (%) 77.36 75.84 75.13 76.31 75.73 75.39 75.33 75.21 75.29 75.07 75.
- Word + POS + location + span (%) 77.38 75.72 74.92 76.39 75.64 75.22 75.15 75.36 75.21 75.00 74.
- Word + POS + span (%) 77.23 75.54 74.68 74.55 76.17 75.45 75.04 75.18 75.04 74.81 74.
- Items on Queue RandomTree
- Items on Stack
- Word (%) 67.72 68.04 68.00 68.00 67.82 68.27 68.26 68.32 68.01 68.47 68.
- Word + location (%) 71.36 70.67 69.48 70.25 71.32 70.64 69.35 68.67 71.05 70.19 69.
- Word + span (%) 69.97 69.27 68.23 67.46 69.80 68.99 68.25 67.39 69.63 68.67 67. Word + location + span (%) 72.02 70.55 - - 71.83 70.27 - - 71.50 69.937 -
- POS (%) 83.67 84.82 84.50 83.71 84.47 85.28 84.71 84.26 84.26 84.78 84.
- POS + location (%) 83.04 82.93 80.76 79.18 82.77 81.41 80.15 78.84 81.71 80.09 78.
- POS + location + span (%) 82.54 81.00 79.17 76.32 81.78 79.41 78.02 76.26 80.27 78.79 77.
- POS + span (%) 83.15 81.47 79.89 79.19 81.50 80.89 78.69 77.78 80.60 79.93 77.
- Word + POS (%) 84.08 85.02 84.34 83.62 84.80 85.37 84.34 83.57 84.36 84.46 83.
- Word + POS + location (%) 83.33 82.06 81.23 80.10 82.02 81.84 80.62 79.17 81.24 80.27 79.
- Word + POS + location + span (%) 82.91 80.68 78.92 77.35 81.75 79.59 77.50 76.33 80.48 78.09 76.
- Word + POS + span (%) 82.89 81.75 79.46 78.42 81.71 80.58 77.87 77.34 80.37 78.95 76.
- Items on Queue NaiveBayes
- Items on Stack
- Word (%) 66.99 61.83 60.13 65.70 65.95 65.48 63.37 64.57 65.06 64.68 64.
- Word + location (%) 60.72 57.56 57.02 64.45 64.12 57.03 55.68 62.10 62.21 54.67 52.
- Word + location + span (%) 50.95 47.39 49.98 55.98 47.29 44.51 47.60 51.12 45.28 42.79 45.
- Word + span (%) 56.69 51.92 53.47 61.93 55.32 48.75 51.09 58.75 52.30 46.46 48.
- POS (%) 76.58 74.79 70.42 76.95 76.78 76.19 74.02 75.40 76.01 75.38 74.
- POS + location (%) 73.49 66.60 64.05 74.76 74.70 71.13 67.13 71.35 72.39 70.68 67.
- POS + location + span (%) 65.37 59.14 58.43 69.06 65.72 58.22 57.11 63.75 61.27 55.49 54.
- POS + span (%) 69.93 62.92 61.15 71.68 70.93 65.13 61.31 67.19 67.62 63.37 59.
- Word + POS (%) 76.72 71.35 66.00 73.84 74.67 73.66 71.04 70.76 72.12 72.21 71.
- Word + POS + location (%) 73.52 64.61 62.25 71.93 72.50 69.20 63.57 68.08 69.13 67.89 62.
- Word + POS + location + span (%) 65.40 58.09 57.62 67.91 64.58 56.72 55.55 62.92 59.78 53.95 52.
- Word + POS + span (%) 69.66 61.49 59.98 69.87 69.69 62.84 59.08 65.67 65.33 60.50 56.
- Items on Queue DecisionStump
- Items on Stack
- Word (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
- Word + location (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
- Word + location + span (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
- Word + span (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
- POS (%) 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.
- POS + location (%) 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.
- POS + location + span (%) 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.
- POS + span (%) 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.58 60.
- Word + POS (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
- Word + POS + location (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
- Word + POS + location + span (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
- Word + POS + span (%) 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.77 63.
a training feature is lower than when using POS tags and their locations as
training features. Using only POS tags as a training feature yielded 86.04%
classification accuracy while using POS tags and locations as training features
yielded 86.96% accuracy for classification.
However, although using only POS tags as a training feature yielded lower
classification accuracy than using POS tags and location as training features,
when we look at parsing accuracy the reverse is true. The second and fourth rows
of Table 2 show that using only POS tags as a training feature yielded higher
parsing accuracy than using POS tags and location as training features. Using
only POS tags as a training feature for data classification with the J48 classifier
improved the parsing accuracy by 4.2% for unlabelled attachment score, 4% for
labelled attachment score and 0.3% for labelled accuracy. Additionally, using
only POS tags as a training feature for the same classifier makes the parser 44%
quicker than using POS tags and location as training features.
Furthermore, using a smaller window size for the queue and the stack
improved the performance of our parser. Training our parser with the same
classifier (the J48 classifier) and with the same training feature (POS tags only)
but with a smaller window size (four items on the queue and three items on
the stack) improved the parsing performance, 0.04% for unlabelled attachment
score, 0.5% for labelled attachment score and 5.8% for its speed, over using a
large window size (four items on the queue and four items on the stack).
The reason for the improved parsing speed when training the parser on
smaller sets of feature is that the question set that the parser searches through
is smaller and hence it will find an answer more quickly compared to situations
where there are several features for searching through. The reason for the
improved parsing accuracy when training on smaller set of features is likely
because the parser learns consistently from a smaller set of feature than from
several features where the training is not large enough to learn more effectively.
7 A Comparison with the State-of-the-
Art Parser
In this section we compare our parser with MaltParser, as shown in Ta-
ble 3, where we have conducted a 5-fold cross validation on both parsers. We
have trained our parser using the J48 classifier with POS tags as features and
a window size of four items on the queue and three items on the stack. We
can note that our parser is 40.3% more efficient than MaltParser. Although
the unlabelled attachment accuracy of our parser is slightly lower than that of
MaltParser (0.7%), the labelled attachment score and the labelled score is more
accurate than MaltParser by 1% and 1.4% respectively. We believe that this
improved accuracy of labelled attachment score and labelled score is because our
parser have information about intermediate items between parent and daugh-
ter, which are collected during training (see Section 5 for more details) where
such information is not available to MaltParser. MaltParser learns models that
contain information about parent and daughter relations and their labels dur-
ing training phase where the information about the intermediate items between
parents and daughters that we use is not recorded.
Table 3: Parser performance of MaltParser and NDParser.
Parser UAS(%) LAS(%) LA(%) Speed
MaltParser 75.2 70.0 92.2 0.
NDParser 74.5 71.0 93.6 0.
The results for MaltParser in Table 3 were obtained by training and testing
it on the dependency trees that we extracted from the PATB. The structure
of these trees depends on the head-percolation table (Magerman, 1995) that is
used during the conversion process. It is likely that this underlies the differences
between our results from MaltParser and the results published by Nivre (2008)
(77.76% for UAS, 65.79% for LAS, and 79.30% for LA), where Niver’s results
for UAS are slightly better than the ones we obtained in this study, for LAS
and LA they are slightly worse.
8 Conclusion
In this paper we have presented a simple approach for evaluating and gen-
erating parse models from various machine learning classifiers. We have shown
that generating a parse model, which is used for guiding data-driven parsers,
from different classifiers affects parsing performance in different ways. We have
discovered that generating a parse model using a small set of features and set-
tings improves parsing accuracy and speed compared with using large features
and settings. We have presented a basic shift-reduce parser, which is based on
the arc-standard algorithm of MaltParser, and we have evaluated our parser
with a various parse models that were generated from different classification
algorithms, features and settings.
References
Chang, C.-c. and Lin, C.-J. (2001), ‘Libsvm: A Library for Support Vector
Machines’.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Written,
I. H. (2009), ‘The weka data mining software: An update’, SIGKDD Explo-
rations 11 (1).