Automated Essay Scoring using Memory Networks and Word Embeddings: A New Approach, Study Guides, Projects, Research of Computer Science

A new approach to automated essay scoring (aes) using memory networks and word embeddings. The study demonstrates state-of-the-art performance in 7 out of 8 essay sets and the efficiency of the model. The authors compare their model with other aes systems and discuss the importance of external memory in improving performance.

Typology: Study Guides, Projects, Research

2017/2018

Uploaded on 10/08/2018

zaidismail
zaidismail 🇵🇰

1 document

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
A Memory-Augmented Neural Model for Automated Grading
Siyuan Zhao
Worcester Polytechnic
Institute
Worcester, MA 01609, USA
Yaqiong Zhang
Worcester Polytechnic
Institute
Worcester, MA 01609, USA
Xiaolu Xiong
Worcester Polytechnic
Institute
Worcester, MA 01609, USA
Anthony Botelho
Worcester Polytechnic
Institute
Worcester, MA 01609, USA
Neil Heffernan
Worcester Polytechnic
Institute
Worcester, MA 01609, USA
ABSTRACT
The need for automated grading tools for essay writing and
open-ended assignments has received increasing attention
due to the unprecedented scale of Massive Online Courses
(MOOCs) and the fact that more and more students are relying
on computers to complete and submit their school work. In
this paper, we propose an efficient memory networks-powered
automated grading model . The idea of our model stems from
the philosophy that with enough graded samples for each score
in the rubric, such samples can be used to grade future work
that is found to be similar. For each possible score in the rubric,
a student response graded with the same score is collected.
These selected responses represent the grading criteria spec-
ified in the rubric and are stored in the memory component.
Our model learns to predict a score for an ungraded response
by computing the relevance between the ungraded response
and each selected response in memory. The evaluation was
conducted on the Kaggle Automated Student Assessment Prize
(ASAP) dataset. The results show that our model achieves
state-of-the-art performance in 7 out of 8 essay sets and can
be trained efficiently due to the simplicity of model structure.
ACM Classification Keywords
I.2.7. ARTIFICIAL INTELLIGENCE: Natural Language Pro-
cessing
Author Keywords
Automated grading; neural networks; memory networks;
word embeddings; natural language processing
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copyotherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
L@S’17, April 20–21, 2017, Boston, MA, USA
© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 123-4567-24-567/08/06. . . $15.00
DOI: http://dx.doi.org/10.475/123_4
INTRODUCTION
Automated grading is a critical part of Massive Open Online
Courses (MOOCs) system and any intelligent tutoring systems
(ITS) at scale. Many studies have been conducted to improve
automated grading for assignments with simple fixed-form an-
swers, short-answers [3, 15, 19, 26, 21], or long-form answers
[26, 2, 7, 14]. Some standard tests, such as Test of English as
a Foreign Language (TOEFL) and Graduate Record Examina-
tion (GRE), assess student writing skills. Manually grading
these essay will be time-consuming. Thus automated essay
scoring (AES) systems has been used in these tests to reduce
the time and cost of grading essays. Moreover, as massive
open online courses (MOOCs) become widespread and the
number of students enrolled in one course increases, the need
for grading and providing feedback on written assignments
are ever critical.
As part of the automated grading system, AES has employed
numerous efforts to improving its performance. AES uses
statistical and Natural Language Processing (NLP) techniques
to automatically predict a score for an essay based on the essay
prompt and rubric. Essay writing is usually a common student
assessment process in schools and universities. In this task,
students are required to write essays of various length, given a
prompt or essay topic.
Most existing AES systems are built on the basis of predefined
features, e.g. number of words, average word length, and
number of spelling errors, and a machine learning algorithm
[4]. It is normally a heavy burden to find out effective features
for AES. Moreover, the performance of the AES systems is
constrained by the effectiveness of the predefined features.
Recently another kind of approach has emerged, employing
neural network models to learn the features automatically in
an end-to-end manner [29]. By this means, a direct predic-
tion of essay scores can be achieved without performing any
feature extraction. The model based on long short-term mem-
ory (LSTM) networks in [29] has demonstrated promise in
accomplishing multiple types of automated grading tasks.
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Automated Essay Scoring using Memory Networks and Word Embeddings: A New Approach and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

A Memory-Augmented Neural Model for Automated Grading

Siyuan Zhao

Worcester Polytechnic

Institute

Worcester, MA 01609, USA

[email protected]

Yaqiong Zhang

Worcester Polytechnic

Institute

Worcester, MA 01609, USA

[email protected]

Xiaolu Xiong

Worcester Polytechnic

Institute

Worcester, MA 01609, USA

[email protected]

Anthony Botelho

Worcester Polytechnic

Institute

Worcester, MA 01609, USA

[email protected]

Neil Heffernan

Worcester Polytechnic

Institute

Worcester, MA 01609, USA

[email protected]

ABSTRACT The need for automated grading tools for essay writing and open-ended assignments has received increasing attention due to the unprecedented scale of Massive Online Courses (MOOCs) and the fact that more and more students are relying on computers to complete and submit their school work. In this paper, we propose an efficient memory networks-powered automated grading model. The idea of our model stems from the philosophy that with enough graded samples for each score in the rubric, such samples can be used to grade future work that is found to be similar. For each possible score in the rubric, a student response graded with the same score is collected. These selected responses represent the grading criteria spec- ified in the rubric and are stored in the memory component. Our model learns to predict a score for an ungraded response by computing the relevance between the ungraded response and each selected response in memory. The evaluation was conducted on the Kaggle Automated Student Assessment Prize (ASAP) dataset. The results show that our model achieves state-of-the-art performance in 7 out of 8 essay sets and can be trained efficiently due to the simplicity of model structure.

ACM Classification Keywords I.2.7. ARTIFICIAL INTELLIGENCE: Natural Language Pro- cessing

Author Keywords Automated grading; neural networks; memory networks; word embeddings; natural language processing

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. L@S’17, April 20–21, 2017, Boston, MA, USA © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 123-4567-24-567/08/06... $15. DOI: http://dx.doi.org/10.475/123_

INTRODUCTION Automated grading is a critical part of Massive Open Online Courses (MOOCs) system and any intelligent tutoring systems (ITS) at scale. Many studies have been conducted to improve automated grading for assignments with simple fixed-form an- swers, short-answers [3, 15, 19, 26, 21], or long-form answers [26, 2, 7, 14]. Some standard tests, such as Test of English as a Foreign Language (TOEFL) and Graduate Record Examina- tion (GRE), assess student writing skills. Manually grading these essay will be time-consuming. Thus automated essay scoring (AES) systems has been used in these tests to reduce the time and cost of grading essays. Moreover, as massive open online courses (MOOCs) become widespread and the number of students enrolled in one course increases, the need for grading and providing feedback on written assignments are ever critical.

As part of the automated grading system, AES has employed numerous efforts to improving its performance. AES uses statistical and Natural Language Processing (NLP) techniques to automatically predict a score for an essay based on the essay prompt and rubric. Essay writing is usually a common student assessment process in schools and universities. In this task, students are required to write essays of various length, given a prompt or essay topic.

Most existing AES systems are built on the basis of predefined features, e.g. number of words, average word length, and number of spelling errors, and a machine learning algorithm [4]. It is normally a heavy burden to find out effective features for AES. Moreover, the performance of the AES systems is constrained by the effectiveness of the predefined features. Recently another kind of approach has emerged, employing neural network models to learn the features automatically in an end-to-end manner [29]. By this means, a direct predic- tion of essay scores can be achieved without performing any feature extraction. The model based on long short-term mem- ory (LSTM) networks in [29] has demonstrated promise in accomplishing multiple types of automated grading tasks.

Neural Networks have achieved promising results on various NLP tasks, including machine translation [1, 5], sentiment analysis [6], and question answering [13, 31, 18, 28]. Neural Network models, in terms of NLP tasks, use word vectors to learn distributed representations from text. The advantages are that these models do not require hand-engineered features and can be trained to solve tasks in an end-to-end fashion.

Recent work [29] has exploited several Recurrent Neural Net- work (RNN) models to solve AES tasks. The results show that neural-based models outperform even strong baselines. Mem- ory Networks (MN) [31, 18, 28] have been recently introduced to deal with complex reasoning and inferencing NLP tasks and have been shown to outperform RNNs on some complex reasoning tasks [28]. MN is a class of models which contains an external scalable memory and a controller to read from and write to that memory. The notion of neural networks with memory was introduced to solve complex reasoning and in- ferring AI-tasks which require remembering external contexts. Some work [18, 28] has shown the success of MN on different kinds of tasks, e.g. bAbI tasks [30], MovieQA, and WikiQA [18].

To our knowledge, no study has been conducted to investigate the feasibility and effectiveness of MN applied in automated grading tasks. In this study, we develop a generic model for such tasks using Memory Networks inspired by their capabil- ity to store rich representations of data and reason over that data in memory. For each essay score, we select one essay exhibiting the same score from student responses as a sample for that grade. All collected sample responses are loaded into the memory of the model. The model is trained with the rest of student responses in a supervised learning manner on these data to compute the relevance between the representation of an ungraded response and that of each sample. The intuition is that as a part of a scoring rubric, a number of sample re- sponses of variable quality are usually provided to students and graders to help them better understand the rubric. These collected responses are characterized with expectations of quality described in the rubric. The model is expected to learn the grading criteria from these responses. We evaluate our model on a publicly available essay grading data set from the Kaggle Automated Student Assessment Prize (ASAP) compe- tition (https://www.kaggle.com/c/asap-aes). Our experiments show that our model achieves state-of-the-art results on this dataset and training of the model is found to be efficient and cost-effective.

The rest of the paper is organized as follows. Section 2 gives an overview of related work in this research area. Section 3 provides detailed information of our model. Section 4 de- scribes the ASAP dataset and evaluation metrics used to test our framework. Furthermore, it contains the details of our im- plementation and experimental setup to help other researchers replicate our work. In section 5, we present the results of our framework and compare them with other models. Finally, we discuss the results and conclude the paper.

RELATED WORK

Automated Grading MOOCs were introduced in 2008 and become more popular recently. Most MOOCs systems provide automated grading as their important features to prove the efficiency of their in- teraction with massive number of online users. Some specific assignment types have been adopted for automated grading since the correct answers of these kinds of assignments have some simple fixed-forms, such as multi-choice questions. Pro- gramming assignments are the represents of these kinds of assignments with simple form answer such as "yes" or "no" [8, 11]. Not satisfied with providing answers for one specific assignment, more efforts have been devoted to providing feed- back on many different assignments according to the shared features of the programming codes [22, 25].

However, many assignment types cannot be responded well only with simple feedback. Some studies have been con- ducted with the attempt to fixing this problem by using semi- automatic grading approach. This kind of approach aims to optimize the collaboration between humans and machines and provide short-answers [20, 3]. Another approach is to provide prediction directly. One research direction of this approach is to apply information extraction techniques to constructing specific answer patterns manually or to training from large training dataset with strong supervision support [2,3,24]. An- other direction is to compare the students’ answers with a es- tablished standard answer with an unsupervised text-similarity approach [21].

Most studies mentioned above are dealing with simple fixed- form answers or short-answers assignments. Some complex assignments have long form answer instead of short, simple one. Essay writing with a given topic is a typical assignment with long form answers and AES has become one important research branch of automated grading system.

AES is generally treated as a machine learning problem. We can group the existing AES solutions from different points of view. Most developed AES system is based on a number of predefined features. These features include essay length, number of words, lexicon and grammar, syntactic features, readability, text coherence, essay organization, and so on [4]. Recently, there emerges another trial to treat the whole essays as inputs and learn the features automatically in an end-to-end manner [29]. Without pres-working on features extraction, work burden was lightened. Moreover, the predicting accurate is improved by removing the dependency of effectiveness of predefined features.

Based on learning techniques utilized in existing solutions, we divide them into three categories: regression based approach, classification based approach and preference ranking based ap- proach. PEG-system and E-rater are two examples that belong to regression based approach. Specifically, when the scores range of the essays is wide, the regression based approach is normally adopted since it treats the essay score as a continuous value.

Besides essay writing, some complex assignments such as medicinal assignments utilized regression model as well [9].

Figure 1. An illustration of memory networks for AES. The score range is 0 - 3. For each score, only one sample with the same score is selected from student responses. There are 4 samples in total in memory. Input representation layer is not included.

response representation to a d-dimensional features space. The intuition is that the responses with the same grade are highly likely to have the similar representation in the feature space.

Memory Reading After weight vector p is calculated, the output of the memory is computed as a weighted sum of each piece of memory in m:

o = (^) ∑ i

pimiCT^ (2)

where C is a k × d matrix used to transfer the response rep- resentation to the feature space. The k × d matrix C may be identical to A, but from our experiment, we found that training a separate C leads to a better performance. From the equation, we can see that weight vector p controls the amount of content that is read from each memory piece.

Multiple Hops The success of neural networks is due to its ability of learn- ing multiple layers of neurons and each layer can transform the representation at previous level into a higher level of ab- stract representation. Inspired by this idea, we stack multiple memory addressing step and memory reading step together to handle multiple hops operations.

After receiving the output o from equation 2, the ungraded response u is updated with:

u 2 = Relu(R 1 (u + o)) (3)

where R 1 is a k × k matrix, u = xAT^ and Relu(y) = max( 0 , y). Then memory addressing step and reading memory step are repeated, using a different matrix R (^) j on each hop j. The memory addressing step is modified accordingly to use the updated representation of the ungraded response.

pi = So f tmax(u (^) j · miB) (4)

Output Layer After a fixed number H hops, the resulting state uH is used to predict a final score over the possible scores:

sˆ = So f tmax(uHW + b) (5)

where W is k × r matrix, r is the number of possible scores and b is the bias value. Note that the number of output nodes equals to the length of score range. We calculate a distribution over all possible scores and select most probable score as the prediction. The whole network is trained in end-to-end fashion without any hand-engineered features, and the matrices A, B,C,W and R 1 , ..., RH are learned through backpropagation and stochastic gradient descent by minimizing a standard cross entropy loss between the predicted score ˆs and the actual score s.

EXPERIMENTAL SETUP

Dataset Dataset used in this study comes from Kaggle Automated Student Assessment Prize (ASAP) competition sponsored by

William and Flora Hewlett Foundation (Hewlett). There are 8 sets of essays and each set is generated from a single prompt. All responses collected in the dataset were written by students ranging from grade 7 to grade 10. Score range varies on essay sets. All essays were graded by at least 2 human graders. The average length of the essays differs for each essay set, ranging from 150 words to 650 words. Selected details for each essay set is shown in Table 1.

Evaluation Metric Quadratic weighted Kappa (QWK) is used to measure the agreement between the human grader and the model. We choose to use this metric because it is the official evaluation metric of the ASAP competition. Other work such as [4, 29, 24] that uses the ASAP dataset also uses this evaluation metric. QWK is calculated using

k = 1 −

∑i, j wi, jOi, j ∑i, j wi, jEi, j

where matrices O, w and E are the matrices of observed scores, weights, and expected scores respectively. Matrix Oi, j corresponds to the number of student responses that re- ceive a score i by the first grader and a score j by the second grader (the model in our experiment). The weight matrix are wi, j = (i − j)^2 /(N − 1 )^2 , where N is the number of possible scores. Matrix E is calculated by taking the outer product between the score vectors of the two graders, which are then normalized to have the same sum as O.

Implementation Details The model was implemented using Tensorflow framework [16]. We used Adam stochastic gradient descent [12] for optimizing the learned parameters. The learning rate was set to 0. and batch size for each iteration to 32 for all models. As final prediction layer, we used a fully connected layer on top of output from memory reading layer with a softmax activation function. The model learned the parameters by minimizing a standard cross-entropy loss between predicted score and the correct score.

For regularization we used L2 loss on all learned parameters with lambda set to 0.3 and limited the norm of the gradients to be below 10. Moreover, we added gradient noise sampled from a Gaussian distribution with mean 0 and variance 0. when training the memory networks.

We used the publicly available pre-trained Glove word embed- dings [23], which was trained on 42 billion tokens of web data, from Common Crawl (http://commoncrawl.org/). The dimen- sion of each word vector is 300. Word2vec [17] is another popular word embedding algorithm and pre-trained word em- beddings are also publicly available from this algorithm. As results shown in [23], Glove outperforms word2vec on word analogy, word similarity, and named entity recognition tasks. 5-fold cross validation was used to evaluate our model. For each fold, the data was split into two parts: 80% of the data as the training data and 20% as the testing data. The sampled response for each score is selected from the training data. A model was trained on each essay set due to the fact that score

Figure 2. An illustration of baseline LSTM model for AES

range varies among 8 essay sets. We trained each model for 200 epochs using batch gradient descent.

Baselines In [29], their system are compared with Enhanced AI Scoring Engine (EASE), an open-source AES system, to demonstrate the improvements on performance. EASE, like traditional NLP techniques, requires fine-grained hand-engineered fea- tures and builds a regression model on top of these features. The reason we use this system as baseline is that it achieved best QWK scores among all open-source systems participated in ASAP competition. [31] described a set of reliable features and reported the results of two models using these features: support vector regression (SVR) and Bayesian linear ridge regression (BLRR).

[29] examined several neural networks models, e.g. RNN and Convolutional Neural Networks (CNN), on ASAP dataset. In their experiments, Long Short Term Memory networks (LSTM) [36], a variant of RNN, achieved the best performance. LSTM is designed to have three gates in each hidden node: input gate, forget gate, and output gate. By controlling these three gates, LSMT has the capability of attaining long-term dependencies. The structure of the LSTM model described in [10] is presented in Figure 2.

To verify the efficacy of GloVe word embeddings and external memory, we developed a simple multi-layer forward neural networks (FNN) model, which is similar to our model with respect to the model structure, but without an external mem- ory. We refer this baseline model as FNN for the rest of paper for convenience. As shown in Figure 3, each word of a stu- dent response is first converted to a continuous vector using GloVe word embeddings. The vector representation for the response is obtained by applying PE on all word vectors from the response. Afterward the representation is fed into 4 hidden layers, each of which has 100 hidden nodes. Apply a softmax operation on the resulting states of last hidden layer at output layer to predict the final score. The model is also trained using Adam Optimizer by minimising the standard cross entropy between sˆ and truth score s. FNN is properly defined by the

Set MN FNN EASE(SVR) EASE(BLRR) LSTM LSTM+CNN Human 1 0.83 0.75 0.78 0.76 0.78 0.82 0. 2 0.72 0.7 0.62 0.61 0.69 0.69 0. 3 0.72 0.7 0.63 0.62 0.68 0.69 0. 4 0.82 0.8 0.75 0.74 0.8 0.81 0. 5 0.83 0.8 0.78 0.78 0.82 0.81 0. 6 0.83 0.79 0.77 0.78 0.81 0.82 0. 7 0.79 0.73 0.73 0.73 0.81 0.81 0. 8 0.68 0.63 0.53 0.62 0.59 0.64 0. Avg 0.78 0.74 0.7 0.71 0.75 0.76 0. Table 2. QWK scores on ASAP dataset.

Set FNN MN LSTM 1 0.2 1.1 15. 2 0.2 1 19. 3 0.2 1 7 4 0.1 1 7 5 0.2 1 8 6 0.2 1 8. 7 0.2 1.5 10 8 0.1 1.4 6. Avg 0.2 1.1 10. Table 3. Average runtime (seconds) of each training epoch

hand, MN is 9 times faster than LSTM since the computation of GloVe with PE is a simple element-wise sum and MN is insensitive to the length of a response. FNN is the fastest since the structure of FNN is the simplest. Unlike MN, FNN does not need to loop through each memory piece to measure the relevance of two student responses at training time.

DISCUSSION AND CONCLUSION In this study, we develop a generic model for automated grad- ing tasks using memory networks and word embeddings. To our best knowledge this is the first study that memory networks are applied for this kind of task. Our model is tested on ASAP dataset and achieves state-of-the-art performance in 7 out of 8 essay sets. Similar to other neural networks models for AES, our model can be trained in an end-to-end fashion and does not require any hand-engineered features. Compared to RNN, CNN, using GloVe word embeddings with PE to represent a student response makes our model simple and cost-effective. Adding external memory improves the performance over FNN model, which means our model is able to take advantage of sampled responses stored in the external memory.

Our model can be generalized to automatically grade assign- ments from other subjects. As shown above, there are two key factors to the performance: reliable representation and memory component. In order to apply our model to other kinds of assignment, learning a good vector representation for the assignment is the first step. It is analogous to how the re- gression model is built for supervised NLP tasks: first extract numerical hand-engineered features from text and then apply a regression model on these generated features to predict true labels. In the context of neural networks, a vector is required

to represent the student response. Learning the vector can be a part of the predictive model. For example, the word embed- dings in [10] are learned from their predictive model. These vectors can also come from pre-trained models, like GloVe and word2vec. The next step is to select characterized samples and store these samples to memory. The purpose of this step is to teach the model to understand the grading strategy and eventually associate a vector representation to a score. However, we only test our model on one dataset. There is a need to explore our model with more datasets that contain var- ious formats of assignments to verify our model. Furthermore, the representation of the assignment and the mechanism for measuring relevance among assignments is still elementary. Future work should therefore focus on these two areas to im- prove the generalizability of the model. A lot of effort is still needed to better interpret memory networks and explain the key factors behind our performance improvement.

ACKNOWLEDGMENTS We acknowledge funding from multiple NSF grants (ACI- 1440753, DRL-1252297, DRL-1109483, DRL-1316736 & DRL-1031398), the U.S. Department of Education (IES R305A120125 & R305C100024 and GAANN), the ONR, and the Gates Foundation.

REFERENCES

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
    1. Neural Machine Translation by Jointly Learning to Align and Translate. (1 Sept. 2014).
  2. S P Balfour. 2013. Assessing writing in MOOCs: Automated essay scoring and calibrated peer review. In Research and Practice in Assessment 8, Vol. 1. 40–48.
  3. Michael Brooks, Sumit Basu, Charles Jacobs, and Lucy Vanderwende. 2014. Divide and correct: using clusters to grade short answers at scale. In L@S.
  4. Hongbo Chen and Ben He. 2013. Automated Essay Scoring by Maximizing Human-Machine Agreement. In EMNLP.
  5. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. (3 June 2014).
  1. Cícero Nogueira dos Santos and Maira Gatti. 2014. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In COLING. 69–78.
  2. Rehab Duwairi. 2006. A framework for the computerized assessment of university student essays. Comput. Human Behav. 22 (2006), 381–388.
  3. G E Forsythe and N Wirth. 1965. Automatic Grading Programs. Commun. Commun. ACM 8 5 (1965), 275–278.
  4. Chase Geigle, Chengxiang Zhai, and Duncan C Ferguson.
    1. An Exploration of Automated Grading of Complex Assignments. In L@S.
  5. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. (20 Oct. 2014).
  6. Michael T Helmick. 2007. Interface-based programming assignments and automatic grading of java programs. In ITiCSE.
  7. Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
  8. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. In ICML.
  9. Leah S Larkey. 1998. Automatic Essay Grading Using Text Categorization Techniques. In SIGIR.
  10. Claudia Leacock and Martin Chodorow. 2003. C-rater: Automated Scoring of Short-Answer Questions. Comput. Hum. 37 (2003), 389–405.
  11. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015).
  12. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (Eds.). Curran Associates, Inc., 3111–3119.
  13. Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason

Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. CoRR abs/1606.03126 (2016).

  1. Tom Mitchell, Terry Russell, Peter Broomhead, and Nicola Aldridge. 2002. Towards Robust Computerised Marking of Free-text Responses towards Robust Computerised Marking of Free-text Responses.
  2. P Mitros, V Paruchuri, J Rogosic, and others. 2013. An integrated framework for the grading of freeform responses. The Sixth Conference of (2013).
  3. Michael Mohler and Rada Mihalcea. 2009. Text-to-Text Semantic Similarity for Automatic Short Answer Grading. In EACL.
  4. Andy Nguyen, Chris Piech, Jonathan Huang, and Leonidas J Guibas. 2014. Codewebs: scalable homework search for massive open online programming courses. In WWW.
  5. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP, Vol. 14. 1532–1543.
  6. Peter Phandi, Kian Ming Adam Chai, and Hwee Tou Ng.
    1. Flexible Domain Adaptation for Automated Essay Scoring Using Correlated Linear Regression. In EMNLP.
  7. Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, and Leonidas J Guibas.
    1. Learning Program Embeddings to Propagate Feedback on Student Code. CoRR abs/1505. (2015).
  8. Carolyn Penstein Rosé, Antonio Roque, Dumisizwe Bhembe, and Kurt VanLehn. 2003. A Hybrid Text Classification Approach For Analysis Of Student Essays.
  9. Lawrence M Rudner and Tahung Liang. 2002. Automated Essay Scoring Using Bayes’ Theorem. The Journal of Technology, Learning and Assessment 1, 2 (1 June 2002).
  10. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. In Advances in Neural Information Processing Systems 28, C Cortes, N D Lawrence, D D Lee, M Sugiyama, and R Garnett (Eds.). Curran Associates, Inc., 2440–2448.
  11. Kaveh Taghipour and Hwee Tou Ng. 2016. A Neural Approach to Automated Essay Scoring. In EMNLP.
  12. Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. (19 Feb. 2015).
  13. Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory Networks. CoRR abs/1410.3916 (2014).
  14. Caiming Xiong, Stephen Merity, and Richard Socher.
    1. Dynamic Memory Networks for Visual and Textual Question Answering. (4 March 2016).