





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A new approach to automated essay scoring (aes) using memory networks and word embeddings. The study demonstrates state-of-the-art performance in 7 out of 8 essay sets and the efficiency of the model. The authors compare their model with other aes systems and discuss the importance of external memory in improving performance.
Typology: Study Guides, Projects, Research
1 / 9
This page cannot be seen from the preview
Don't miss anything!






ABSTRACT The need for automated grading tools for essay writing and open-ended assignments has received increasing attention due to the unprecedented scale of Massive Online Courses (MOOCs) and the fact that more and more students are relying on computers to complete and submit their school work. In this paper, we propose an efficient memory networks-powered automated grading model. The idea of our model stems from the philosophy that with enough graded samples for each score in the rubric, such samples can be used to grade future work that is found to be similar. For each possible score in the rubric, a student response graded with the same score is collected. These selected responses represent the grading criteria spec- ified in the rubric and are stored in the memory component. Our model learns to predict a score for an ungraded response by computing the relevance between the ungraded response and each selected response in memory. The evaluation was conducted on the Kaggle Automated Student Assessment Prize (ASAP) dataset. The results show that our model achieves state-of-the-art performance in 7 out of 8 essay sets and can be trained efficiently due to the simplicity of model structure.
ACM Classification Keywords I.2.7. ARTIFICIAL INTELLIGENCE: Natural Language Pro- cessing
Author Keywords Automated grading; neural networks; memory networks; word embeddings; natural language processing
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. L@S’17, April 20–21, 2017, Boston, MA, USA © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 123-4567-24-567/08/06... $15. DOI: http://dx.doi.org/10.475/123_
INTRODUCTION Automated grading is a critical part of Massive Open Online Courses (MOOCs) system and any intelligent tutoring systems (ITS) at scale. Many studies have been conducted to improve automated grading for assignments with simple fixed-form an- swers, short-answers [3, 15, 19, 26, 21], or long-form answers [26, 2, 7, 14]. Some standard tests, such as Test of English as a Foreign Language (TOEFL) and Graduate Record Examina- tion (GRE), assess student writing skills. Manually grading these essay will be time-consuming. Thus automated essay scoring (AES) systems has been used in these tests to reduce the time and cost of grading essays. Moreover, as massive open online courses (MOOCs) become widespread and the number of students enrolled in one course increases, the need for grading and providing feedback on written assignments are ever critical.
As part of the automated grading system, AES has employed numerous efforts to improving its performance. AES uses statistical and Natural Language Processing (NLP) techniques to automatically predict a score for an essay based on the essay prompt and rubric. Essay writing is usually a common student assessment process in schools and universities. In this task, students are required to write essays of various length, given a prompt or essay topic.
Most existing AES systems are built on the basis of predefined features, e.g. number of words, average word length, and number of spelling errors, and a machine learning algorithm [4]. It is normally a heavy burden to find out effective features for AES. Moreover, the performance of the AES systems is constrained by the effectiveness of the predefined features. Recently another kind of approach has emerged, employing neural network models to learn the features automatically in an end-to-end manner [29]. By this means, a direct predic- tion of essay scores can be achieved without performing any feature extraction. The model based on long short-term mem- ory (LSTM) networks in [29] has demonstrated promise in accomplishing multiple types of automated grading tasks.
Neural Networks have achieved promising results on various NLP tasks, including machine translation [1, 5], sentiment analysis [6], and question answering [13, 31, 18, 28]. Neural Network models, in terms of NLP tasks, use word vectors to learn distributed representations from text. The advantages are that these models do not require hand-engineered features and can be trained to solve tasks in an end-to-end fashion.
Recent work [29] has exploited several Recurrent Neural Net- work (RNN) models to solve AES tasks. The results show that neural-based models outperform even strong baselines. Mem- ory Networks (MN) [31, 18, 28] have been recently introduced to deal with complex reasoning and inferencing NLP tasks and have been shown to outperform RNNs on some complex reasoning tasks [28]. MN is a class of models which contains an external scalable memory and a controller to read from and write to that memory. The notion of neural networks with memory was introduced to solve complex reasoning and in- ferring AI-tasks which require remembering external contexts. Some work [18, 28] has shown the success of MN on different kinds of tasks, e.g. bAbI tasks [30], MovieQA, and WikiQA [18].
To our knowledge, no study has been conducted to investigate the feasibility and effectiveness of MN applied in automated grading tasks. In this study, we develop a generic model for such tasks using Memory Networks inspired by their capabil- ity to store rich representations of data and reason over that data in memory. For each essay score, we select one essay exhibiting the same score from student responses as a sample for that grade. All collected sample responses are loaded into the memory of the model. The model is trained with the rest of student responses in a supervised learning manner on these data to compute the relevance between the representation of an ungraded response and that of each sample. The intuition is that as a part of a scoring rubric, a number of sample re- sponses of variable quality are usually provided to students and graders to help them better understand the rubric. These collected responses are characterized with expectations of quality described in the rubric. The model is expected to learn the grading criteria from these responses. We evaluate our model on a publicly available essay grading data set from the Kaggle Automated Student Assessment Prize (ASAP) compe- tition (https://www.kaggle.com/c/asap-aes). Our experiments show that our model achieves state-of-the-art results on this dataset and training of the model is found to be efficient and cost-effective.
The rest of the paper is organized as follows. Section 2 gives an overview of related work in this research area. Section 3 provides detailed information of our model. Section 4 de- scribes the ASAP dataset and evaluation metrics used to test our framework. Furthermore, it contains the details of our im- plementation and experimental setup to help other researchers replicate our work. In section 5, we present the results of our framework and compare them with other models. Finally, we discuss the results and conclude the paper.
RELATED WORK
Automated Grading MOOCs were introduced in 2008 and become more popular recently. Most MOOCs systems provide automated grading as their important features to prove the efficiency of their in- teraction with massive number of online users. Some specific assignment types have been adopted for automated grading since the correct answers of these kinds of assignments have some simple fixed-forms, such as multi-choice questions. Pro- gramming assignments are the represents of these kinds of assignments with simple form answer such as "yes" or "no" [8, 11]. Not satisfied with providing answers for one specific assignment, more efforts have been devoted to providing feed- back on many different assignments according to the shared features of the programming codes [22, 25].
However, many assignment types cannot be responded well only with simple feedback. Some studies have been con- ducted with the attempt to fixing this problem by using semi- automatic grading approach. This kind of approach aims to optimize the collaboration between humans and machines and provide short-answers [20, 3]. Another approach is to provide prediction directly. One research direction of this approach is to apply information extraction techniques to constructing specific answer patterns manually or to training from large training dataset with strong supervision support [2,3,24]. An- other direction is to compare the students’ answers with a es- tablished standard answer with an unsupervised text-similarity approach [21].
Most studies mentioned above are dealing with simple fixed- form answers or short-answers assignments. Some complex assignments have long form answer instead of short, simple one. Essay writing with a given topic is a typical assignment with long form answers and AES has become one important research branch of automated grading system.
AES is generally treated as a machine learning problem. We can group the existing AES solutions from different points of view. Most developed AES system is based on a number of predefined features. These features include essay length, number of words, lexicon and grammar, syntactic features, readability, text coherence, essay organization, and so on [4]. Recently, there emerges another trial to treat the whole essays as inputs and learn the features automatically in an end-to-end manner [29]. Without pres-working on features extraction, work burden was lightened. Moreover, the predicting accurate is improved by removing the dependency of effectiveness of predefined features.
Based on learning techniques utilized in existing solutions, we divide them into three categories: regression based approach, classification based approach and preference ranking based ap- proach. PEG-system and E-rater are two examples that belong to regression based approach. Specifically, when the scores range of the essays is wide, the regression based approach is normally adopted since it treats the essay score as a continuous value.
Besides essay writing, some complex assignments such as medicinal assignments utilized regression model as well [9].
Figure 1. An illustration of memory networks for AES. The score range is 0 - 3. For each score, only one sample with the same score is selected from student responses. There are 4 samples in total in memory. Input representation layer is not included.
response representation to a d-dimensional features space. The intuition is that the responses with the same grade are highly likely to have the similar representation in the feature space.
Memory Reading After weight vector p is calculated, the output of the memory is computed as a weighted sum of each piece of memory in m:
o = (^) ∑ i
pimiCT^ (2)
where C is a k × d matrix used to transfer the response rep- resentation to the feature space. The k × d matrix C may be identical to A, but from our experiment, we found that training a separate C leads to a better performance. From the equation, we can see that weight vector p controls the amount of content that is read from each memory piece.
Multiple Hops The success of neural networks is due to its ability of learn- ing multiple layers of neurons and each layer can transform the representation at previous level into a higher level of ab- stract representation. Inspired by this idea, we stack multiple memory addressing step and memory reading step together to handle multiple hops operations.
After receiving the output o from equation 2, the ungraded response u is updated with:
u 2 = Relu(R 1 (u + o)) (3)
where R 1 is a k × k matrix, u = xAT^ and Relu(y) = max( 0 , y). Then memory addressing step and reading memory step are repeated, using a different matrix R (^) j on each hop j. The memory addressing step is modified accordingly to use the updated representation of the ungraded response.
pi = So f tmax(u (^) j · miB) (4)
Output Layer After a fixed number H hops, the resulting state uH is used to predict a final score over the possible scores:
sˆ = So f tmax(uHW + b) (5)
where W is k × r matrix, r is the number of possible scores and b is the bias value. Note that the number of output nodes equals to the length of score range. We calculate a distribution over all possible scores and select most probable score as the prediction. The whole network is trained in end-to-end fashion without any hand-engineered features, and the matrices A, B,C,W and R 1 , ..., RH are learned through backpropagation and stochastic gradient descent by minimizing a standard cross entropy loss between the predicted score ˆs and the actual score s.
EXPERIMENTAL SETUP
Dataset Dataset used in this study comes from Kaggle Automated Student Assessment Prize (ASAP) competition sponsored by
William and Flora Hewlett Foundation (Hewlett). There are 8 sets of essays and each set is generated from a single prompt. All responses collected in the dataset were written by students ranging from grade 7 to grade 10. Score range varies on essay sets. All essays were graded by at least 2 human graders. The average length of the essays differs for each essay set, ranging from 150 words to 650 words. Selected details for each essay set is shown in Table 1.
Evaluation Metric Quadratic weighted Kappa (QWK) is used to measure the agreement between the human grader and the model. We choose to use this metric because it is the official evaluation metric of the ASAP competition. Other work such as [4, 29, 24] that uses the ASAP dataset also uses this evaluation metric. QWK is calculated using
k = 1 −
∑i, j wi, jOi, j ∑i, j wi, jEi, j
where matrices O, w and E are the matrices of observed scores, weights, and expected scores respectively. Matrix Oi, j corresponds to the number of student responses that re- ceive a score i by the first grader and a score j by the second grader (the model in our experiment). The weight matrix are wi, j = (i − j)^2 /(N − 1 )^2 , where N is the number of possible scores. Matrix E is calculated by taking the outer product between the score vectors of the two graders, which are then normalized to have the same sum as O.
Implementation Details The model was implemented using Tensorflow framework [16]. We used Adam stochastic gradient descent [12] for optimizing the learned parameters. The learning rate was set to 0. and batch size for each iteration to 32 for all models. As final prediction layer, we used a fully connected layer on top of output from memory reading layer with a softmax activation function. The model learned the parameters by minimizing a standard cross-entropy loss between predicted score and the correct score.
For regularization we used L2 loss on all learned parameters with lambda set to 0.3 and limited the norm of the gradients to be below 10. Moreover, we added gradient noise sampled from a Gaussian distribution with mean 0 and variance 0. when training the memory networks.
We used the publicly available pre-trained Glove word embed- dings [23], which was trained on 42 billion tokens of web data, from Common Crawl (http://commoncrawl.org/). The dimen- sion of each word vector is 300. Word2vec [17] is another popular word embedding algorithm and pre-trained word em- beddings are also publicly available from this algorithm. As results shown in [23], Glove outperforms word2vec on word analogy, word similarity, and named entity recognition tasks. 5-fold cross validation was used to evaluate our model. For each fold, the data was split into two parts: 80% of the data as the training data and 20% as the testing data. The sampled response for each score is selected from the training data. A model was trained on each essay set due to the fact that score
Figure 2. An illustration of baseline LSTM model for AES
range varies among 8 essay sets. We trained each model for 200 epochs using batch gradient descent.
Baselines In [29], their system are compared with Enhanced AI Scoring Engine (EASE), an open-source AES system, to demonstrate the improvements on performance. EASE, like traditional NLP techniques, requires fine-grained hand-engineered fea- tures and builds a regression model on top of these features. The reason we use this system as baseline is that it achieved best QWK scores among all open-source systems participated in ASAP competition. [31] described a set of reliable features and reported the results of two models using these features: support vector regression (SVR) and Bayesian linear ridge regression (BLRR).
[29] examined several neural networks models, e.g. RNN and Convolutional Neural Networks (CNN), on ASAP dataset. In their experiments, Long Short Term Memory networks (LSTM) [36], a variant of RNN, achieved the best performance. LSTM is designed to have three gates in each hidden node: input gate, forget gate, and output gate. By controlling these three gates, LSMT has the capability of attaining long-term dependencies. The structure of the LSTM model described in [10] is presented in Figure 2.
To verify the efficacy of GloVe word embeddings and external memory, we developed a simple multi-layer forward neural networks (FNN) model, which is similar to our model with respect to the model structure, but without an external mem- ory. We refer this baseline model as FNN for the rest of paper for convenience. As shown in Figure 3, each word of a stu- dent response is first converted to a continuous vector using GloVe word embeddings. The vector representation for the response is obtained by applying PE on all word vectors from the response. Afterward the representation is fed into 4 hidden layers, each of which has 100 hidden nodes. Apply a softmax operation on the resulting states of last hidden layer at output layer to predict the final score. The model is also trained using Adam Optimizer by minimising the standard cross entropy between sˆ and truth score s. FNN is properly defined by the
Set MN FNN EASE(SVR) EASE(BLRR) LSTM LSTM+CNN Human 1 0.83 0.75 0.78 0.76 0.78 0.82 0. 2 0.72 0.7 0.62 0.61 0.69 0.69 0. 3 0.72 0.7 0.63 0.62 0.68 0.69 0. 4 0.82 0.8 0.75 0.74 0.8 0.81 0. 5 0.83 0.8 0.78 0.78 0.82 0.81 0. 6 0.83 0.79 0.77 0.78 0.81 0.82 0. 7 0.79 0.73 0.73 0.73 0.81 0.81 0. 8 0.68 0.63 0.53 0.62 0.59 0.64 0. Avg 0.78 0.74 0.7 0.71 0.75 0.76 0. Table 2. QWK scores on ASAP dataset.
Set FNN MN LSTM 1 0.2 1.1 15. 2 0.2 1 19. 3 0.2 1 7 4 0.1 1 7 5 0.2 1 8 6 0.2 1 8. 7 0.2 1.5 10 8 0.1 1.4 6. Avg 0.2 1.1 10. Table 3. Average runtime (seconds) of each training epoch
hand, MN is 9 times faster than LSTM since the computation of GloVe with PE is a simple element-wise sum and MN is insensitive to the length of a response. FNN is the fastest since the structure of FNN is the simplest. Unlike MN, FNN does not need to loop through each memory piece to measure the relevance of two student responses at training time.
DISCUSSION AND CONCLUSION In this study, we develop a generic model for automated grad- ing tasks using memory networks and word embeddings. To our best knowledge this is the first study that memory networks are applied for this kind of task. Our model is tested on ASAP dataset and achieves state-of-the-art performance in 7 out of 8 essay sets. Similar to other neural networks models for AES, our model can be trained in an end-to-end fashion and does not require any hand-engineered features. Compared to RNN, CNN, using GloVe word embeddings with PE to represent a student response makes our model simple and cost-effective. Adding external memory improves the performance over FNN model, which means our model is able to take advantage of sampled responses stored in the external memory.
Our model can be generalized to automatically grade assign- ments from other subjects. As shown above, there are two key factors to the performance: reliable representation and memory component. In order to apply our model to other kinds of assignment, learning a good vector representation for the assignment is the first step. It is analogous to how the re- gression model is built for supervised NLP tasks: first extract numerical hand-engineered features from text and then apply a regression model on these generated features to predict true labels. In the context of neural networks, a vector is required
to represent the student response. Learning the vector can be a part of the predictive model. For example, the word embed- dings in [10] are learned from their predictive model. These vectors can also come from pre-trained models, like GloVe and word2vec. The next step is to select characterized samples and store these samples to memory. The purpose of this step is to teach the model to understand the grading strategy and eventually associate a vector representation to a score. However, we only test our model on one dataset. There is a need to explore our model with more datasets that contain var- ious formats of assignments to verify our model. Furthermore, the representation of the assignment and the mechanism for measuring relevance among assignments is still elementary. Future work should therefore focus on these two areas to im- prove the generalizability of the model. A lot of effort is still needed to better interpret memory networks and explain the key factors behind our performance improvement.
ACKNOWLEDGMENTS We acknowledge funding from multiple NSF grants (ACI- 1440753, DRL-1252297, DRL-1109483, DRL-1316736 & DRL-1031398), the U.S. Department of Education (IES R305A120125 & R305C100024 and GAANN), the ONR, and the Gates Foundation.
REFERENCES
Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. CoRR abs/1606.03126 (2016).