Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas


Deep Averaging Networks: Marrying Speed and Accuracy in Sentiment Analysis, Resumos de Dendrologia

A deep neural network model called deep averaging networks (dan), which aims to combine the speed of unordered functions with the accuracy of syntactic functions in sentiment analysis. The model uses a composition function to process word embeddings and stacks nonlinear layers over the traditional nbow model. The authors also introduce word dropout to improve robustness. The document compares dan to various baselines, including syntactic functions, and presents experimental results.

Tipologia: Resumos

2021

Compartilhado em 21/10/2021

orma-amenic
orma-amenic 🇧🇷

1 documento

1 / 11

Toggle sidebar

Esta página não é visível na pré-visualização

Não perca as partes importantes!

bg1
Deep Unordered Composition Rivals Syntactic Methods
for Text Classification
Mohit Iyyer,1Varun Manjunatha,1Jordan Boyd-Graber,2Hal Daum´
e III1
1University of Maryland, Department of Computer Science and UMIAC S
2University of Colorado, Department of Computer Science
{miyyer,varunm,hal}@umiacs.umd.edu,[email protected]
Abstract
Many existing deep learning models for
natural language processing tasks focus on
learning the compositionality of their in-
puts, which requires many expensive com-
putations. We present a simple deep neural
network that competes with and, in some
cases, outperforms such models on sen-
timent analysis and factoid question an-
swering tasks while taking only a fraction
of the training time. While our model is
syntactically-ignorant, we show significant
improvements over previous bag-of-words
models by deepening our network and ap-
plying a novel variant of dropout. More-
over, our model performs better than syn-
tactic models on datasets with high syn-
tactic variance. We show that our model
makes similar errors to syntactically-aware
models, indicating that for the tasks we con-
sider, nonlinearly transforming the input is
more important than tailoring a network to
incorporate word order and syntax.
1 Introduction
Vector space models for natural language process-
ing (NL P) represent words using low dimensional
vectors called embeddings. To apply vector space
models to sentences or documents, one must first
select an appropriate composition function, which
is a mathematical process for combining multiple
words into a single vector.
Composition functions fall into two classes: un-
ordered and syntactic. Unordered functions treat in-
put texts as bags of word embeddings, while syntac-
tic functions take word order and sentence structure
into account. Previously published experimental
results have shown that syntactic functions outper-
form unordered functions on many tasks (Socher
et al., 2013b; Kalchbrenner and Blunsom, 2013).
However, there is a tradeoff: syntactic functions
require more training time than unordered compo-
sition functions and are prohibitively expensive in
the case of huge datasets or limited computing re-
sources. For example, the recursive neural network
(Section 2) computes costly matrix/tensor products
and nonlinearities at every node of a syntactic parse
tree, which limits it to smaller datasets that can be
reliably parsed.
We introduce a deep unordered model that ob-
tains near state-of-the-art accuracies on a variety of
sentence and document-level tasks with just min-
utes of training time on an average laptop computer.
This model, the deep averaging network (
DAN
),
works in three simple steps:
1.
take the vector average of the embeddings
associated with an input sequence of tokens
2.
pass that average through one or more feed-
forward layers
3.
perform (linear) classification on the final
layer’s representation
The model can be improved by applying a novel
dropout-inspired regularizer: for each training in-
stance, randomly drop some of the tokens’ embed-
dings before computing the average.
We evaluate
DAN
s on sentiment analysis and fac-
toid question answering tasks at both the sentence
and document level in Section 4. Our model’s suc-
cesses demonstrate that for these tasks, the choice
of composition function is not as important as ini-
tializing with pretrained embeddings and using a
deep network. Furthermore,
DAN
s, unlike more
complex composition functions, can be effectively
trained on data that have high syntactic variance. A
pf3
pf4
pf5
pf8
pf9
pfa

Pré-visualização parcial do texto

Baixe Deep Averaging Networks: Marrying Speed and Accuracy in Sentiment Analysis e outras Resumos em PDF para Dendrologia, somente na Docsity!

Deep Unordered Composition Rivals Syntactic Methods

for Text Classification

Mohit Iyyer,^1 Varun Manjunatha,^1 Jordan Boyd-Graber,^2 Hal Daum´e III^1

1 University of Maryland, Department of Computer Science and UMIACS

2 University of Colorado, Department of Computer Science

{miyyer,varunm,hal}@umiacs.umd.edu, [email protected]

Abstract

Many existing deep learning models for natural language processing tasks focus on learning the compositionality of their in- puts, which requires many expensive com- putations. We present a simple deep neural network that competes with and, in some cases, outperforms such models on sen- timent analysis and factoid question an- swering tasks while taking only a fraction of the training time. While our model is syntactically-ignorant, we show significant improvements over previous bag-of-words models by deepening our network and ap- plying a novel variant of dropout. More- over, our model performs better than syn- tactic models on datasets with high syn- tactic variance. We show that our model makes similar errors to syntactically-aware models, indicating that for the tasks we con- sider, nonlinearly transforming the input is more important than tailoring a network to incorporate word order and syntax.

1 Introduction

Vector space models for natural language process- ing (NLP) represent words using low dimensional vectors called embeddings. To apply vector space models to sentences or documents, one must first select an appropriate composition function, which is a mathematical process for combining multiple words into a single vector. Composition functions fall into two classes: un- ordered and syntactic. Unordered functions treat in- put texts as bags of word embeddings, while syntac- tic functions take word order and sentence structure into account. Previously published experimental

results have shown that syntactic functions outper- form unordered functions on many tasks (Socher et al., 2013b; Kalchbrenner and Blunsom, 2013). However, there is a tradeoff: syntactic functions require more training time than unordered compo- sition functions and are prohibitively expensive in the case of huge datasets or limited computing re- sources. For example, the recursive neural network (Section 2) computes costly matrix/tensor products and nonlinearities at every node of a syntactic parse tree, which limits it to smaller datasets that can be reliably parsed. We introduce a deep unordered model that ob- tains near state-of-the-art accuracies on a variety of sentence and document-level tasks with just min- utes of training time on an average laptop computer. This model, the deep averaging network (DAN), works in three simple steps:

  1. take the vector average of the embeddings associated with an input sequence of tokens
  2. pass that average through one or more feed- forward layers
  3. perform (linear) classification on the final layer’s representation

The model can be improved by applying a novel dropout-inspired regularizer: for each training in- stance, randomly drop some of the tokens’ embed- dings before computing the average. We evaluate DANs on sentiment analysis and fac- toid question answering tasks at both the sentence and document level in Section 4. Our model’s suc- cesses demonstrate that for these tasks, the choice of composition function is not as important as ini- tializing with pretrained embeddings and using a deep network. Furthermore, DANs, unlike more complex composition functions, can be effectively trained on data that have high syntactic variance. A

qualitative analysis of the learned layers suggests that the model works by magnifying tiny but mean- ingful differences in the vector average through multiple hidden layers, and a detailed error analy- sis shows that syntactically-aware models actually make very similar errors to those of the more na¨ıve DAN.

2 Unordered vs. Syntactic Composition

Our goal is to marry the speed of unordered func- tions with the accuracy of syntactic functions. In this section, we first describe a class of un- ordered composition functions dubbed “neural bag- of-words models” (NBOW). We then explore more complex syntactic functions designed to avoid many of the pitfalls associated with NBOW mod- els. Finally, we present the deep averaging network (DAN), which stacks nonlinear layers over the tradi- tional NBOW model and achieves performance on par with or better than that of syntactic functions.

2.1 Neural Bag-of-Words Models For simplicity, consider text classification: map an input sequence of tokens X to one of k labels. We first apply a composition function g to the sequence of word embeddings vw for w ∈ X. The output of this composition function is a vector z that serves as input to a logistic regression function. In our instantiation of NBOW, g averages word embeddings^1

z = g(w ∈ X) =

|X|

w∈X

vw. (1)

Feeding z to a softmax layer induces estimated probabilities for each output label y ˆ = softmax(Ws · z + b), (2) where the softmax function is softmax(q) = exp q ∑k j=1 exp^ qj

Ws is a k × d matrix for a dataset with k output labels, and b is a bias term. We train the NBOW model to minimize cross- entropy error, which for a single training instance with ground-truth label y is

`(ˆy) =

∑^ k

p=

yp log(ˆyp). (4)

(^1) Preliminary experiments indicate that averaging outper- forms the vector sum used in NBOW from Kalchbrenner et al. (2014).

Before we describe our deep extension of the NBOW model, we take a quick detour to discuss syntactic composition functions. Connections to other representation frameworks are discussed fur- ther in Section 4.

2.2 Considering Syntax for Composition Given a sentence like “You’ll be more entertained getting hit by a bus”, an unordered model like NBOW might be deceived by the word “entertained” to return a positive prediction. In contrast, syn- tactic composition functions rely on the order and structure of the input to learn how one word or phrase affects another, sacrificing computational efficiency in the process. In subsequent sections, we argue that this complexity is not matched by a corresponding gain in performance. Recursive neural networks (RecNNs) are syntac- tic functions that rely on natural language’s inher- ent structure to achieve state-of-the-art accuracies on sentiment analysis tasks (Tai et al., 2015). As in NBOW, each word type has an associated embed- ding. However, the composition function g now depends on a parse tree of the input sequence. The representation for any internal node in a binary parse tree is computed as a nonlinear function of the representations of its children (Figure 1, left). A more powerful RecNN variant is the recursive neural tensor network (RecNTN), which modifies g to include a costly tensor product (Socher et al., 2013b). While RecNNs can model complex linguistic phenomena like negation (Hermann et al., 2013), they require much more training time than NBOW models. The nonlinearities and matrix/tensor prod- ucts at each node of the parse tree are expen- sive, especially as model dimensionality increases. RecNNs also require an error signal at every node. One root softmax is not strong enough for the model to learn compositional relations and leads to worse accuracies than standard bag-of-words models (Li, 2014). Finally, RecNNs require rela- tively consistent syntax between training and test data due to their reliance on parse trees and thus cannot effectively incorporate out-of-domain data, as we show in our question-answering experiments. Kim (2014) shows that some of these issues can be avoided by using a convolutional network in- stead of a RecNN, but the computational complex- ity increases even further (see Section 4 for runtime comparisons). What contributes most to the power of syntactic

which we call word dropout, our network theoreti- cally sees 2 |X|^ different token sequences for each input X. We posit a vector r with |X| independent Bernoulli trials, each of which equals 1 with prob- ability p. The embedding vw for token w in X is dropped from the average if rw is 0, which expo- nentially increases the number of unique examples the network sees during training. This allows us to modify Equation 1:

rw ∼ Bernoulli(p) (6) X^ ˆ = {w|w ∈ X and rw > 0 } (7)

z = g(w ∈ X) =

w∈ Xˆ vw | Xˆ|

Depending on the choice of p, many of the “dropped” versions of an original training instance will be very similar to each other, but for shorter inputs this is less likely. We might drop a very important token, such as “horrible” in “the crab rangoon was especially horrible”; however, since the number of word types that are predictive of the output labels is low compared to non-predictive ones (e.g., neutral words in sentiment analysis), we always see improvements using this technique. Theoretically, word dropout can also be applied to other neural network-based approaches. How- ever, we observe no significant performance differ- ences in preliminary experiments when applying word dropout to leaf nodes in RecNNs for senti- ment analysis (dropped leaf representations are set to zero vectors), and it slightly hurts performance on the question answering task.

4 Experiments

We compare DANs to both the shallow NBOW model as well as more complicated syntactic mod- els on sentence and document-level sentiment anal- ysis and factoid question answering tasks. The DAN architecture we use for each task is almost identi- cal, differing across tasks only in the type of output layer and the choice of activation function. Our results show that DANs outperform other bag-of- words models and many syntactic models with very little training time.^2 On the question-answering task, DANs effectively train on out-of-domain data, while RecNNs struggle to reconcile the syntactic differences between the training and test data. (^2) Code at http://github.com/miyyer/dan.

Model RT SST SST IMDB Time fine bin (s) DAN-ROOT — 46.9 85.7 — 31 DAN-RAND 77.3 45.4 83.2 88.8 136 DAN 80.3 47.7 86.3 89.4 136 NBOW-RAND 76.2 42.3 81.4 88.9 91 NBOW 79.0 43.6 83.6 89.0 91 BiNB — 41.9 83.1 — — NBSVM-bi 79.4 — — 91.2 — RecNN∗^ 77.7 43.2 82.4 — — RecNTN∗^ — 45.7 85.4 — — DRecNN — 49.8 86.6 — 431 TreeLSTM — 50.6 86.9 — — DCNN∗^ — 48.5 86.9 89.4 — PVEC∗^ — 48.7 87.8 92.6 — CNN-MC 81.1 47.4 88.1 — 2, WRRBM∗^ — — — 89.2 —

Table 1: DANs achieve comparable sentiment accu- racies to syntactic functions (bottom third of table) but require much less training time (measured as time of a single epoch on the SST fine-grained task). Asterisked models are initialized either with differ- ent pretrained embeddings or randomly.

4.1 Sentiment Analysis Recently, syntactic composition functions have revolutionized both fine-grained and binary (pos- itive or negative) sentiment analysis. We conduct sentence-level sentiment experiments on the Rot- ten Tomatoes (RT) movie reviews dataset (Pang and Lee, 2005) and its extension with phrase-level labels, the Stanford Sentiment Treebank (SST) in- troduced by Socher et al. (2013b). Our model is also effective on the document-level IMDB movie review dataset of Maas et al. (2011). 4.1.1 Neural Baselines Most neural approaches to sentiment analysis are variants of either recursive or convolutional net- works. Our recursive neural network baselines include standard RecNNs (Socher et al., 2011b), RecNTNs, the deep recursive network (DRecNN) proposed by ˙Irsoy and Cardie (2014), and the TREE-LSTM of (Tai et al., 2015). Convolu- tional network baselines include the dynamic con- volutional network (Kalchbrenner et al., 2014, DCNN) and the convolutional neural network multi- channel (Kim, 2014, CNN-MC). Our other neu- ral baselines are the sliding-window based para- graph vector (Le and Mikolov, 2014, PVEC)^3 and (^3) PVEC is computationally expensive at both training and test time and requires enough memory to store a vector for every paragraph in the training data.

the word-representation restricted Boltzmann ma- chine (Dahl et al., 2012, WRRBM), which only works on the document-level IMDB task.^4 4.1.2 Non-Neural Baselines

We also compare to non-neural baselines, specif- ically the bigram na¨ıve Bayes (BINB) and na¨ıve Bayes support vector machine (NBSVM-BI) mod- els introduced by Wang and Manning (2012), both of which are memory-intensive due to huge feature spaces of size |V |^2. 4.1.3 DAN Configurations In Table 1, we compare a variety of DAN and NBOW configurations^5 to the baselines described above. In particular, we are interested in not only comparing DAN accuracies to those of the baselines, but also how initializing with pretrained embeddings and re- stricting the model to only root-level labels affects performance. With this in mind, the NBOW-RAND and DAN-RAND models are initialized with ran- dom 300-dimensional word embeddings, while the other models are initialized with publicly-available 300-d GloVe vectors trained over the Common Crawl (Pennington et al., 2014). The DAN-ROOT model only has access to sentence-level labels for SST experiments, while all other models are trained on labeled phrases (if they exist) in addition to sen- tences. We train all NBOW and DAN models using AdaGrad (Duchi et al., 2011). We apply DANs to documents by averaging the embeddings for all of a document’s tokens and then feeding that average through multiple layers as before. Since the representations computed by DANs are always d-dimensional vectors regardless of the input size, they are efficient with respect to both memory and computational cost. We find that the hyperparameters selected on the SST also work well for the IMDB task. 4.1.4 Dataset Details

We evaluate over both fine-grained and binary sentence-level classification tasks on the SST, and just the binary task on RT and IMDB. In the fine- grained SST setting, each sentence has a label from zero to five where two is the neutral class. For the binary task, we ignore all neutral sentences.^6 (^4) The WRRBM is trained using a slow Metropolis-Hastings algorithm. (^5) Best hyperparameters chosen by cross-validation: three 300-d ReLu layers, word dropout probability p = 0. 3 , L regularization weight of 1e-5 applied to all parameters (^6) Our fine-grained SST split is {train: 8,544, dev: 1,101, test: 2,210}, while our binary split is {train: 6,920, dev:872,

4.1.5 Results The DAN achieves the second best reported result on the RT dataset, behind only the significantly slower CNN-MC model. It’s also competitive with more complex models on the SST and outperforms the DCNN and WRRBM on the document-level IMDB task. Interestingly, the DAN achieves good performance on the SST when trained with only sentence-level labels, indicating that it does not suffer from the vanishing error signal problem that plagues RecNNs. Since acquiring labelled phrases is often expensive (Sayeed et al., 2012; Iyyer et al., 2014b), this result is promising for large or messy datasets where fine-grained annotation is infeasible.

4.1.6 Timing Experiments DANs require less time per epoch and—in general— require fewer epochs than their syntactic coun- terparts. We compare DAN runtime on the SST to publicly-available implementations of syntactic baselines in the last column of Table 1; the reported times are for a single epoch to control for hyper- parameter choices such as learning rate, and all models use 300-d word vectors. Training a DAN on just sentence-level labels on the SST takes under five minutes on a single core of a laptop; when labeled phrases are added as separate training in- stances, training time jumps to twenty minutes.^7 All timing experiments were performed on a single core of an Intel I7 processor with 8GB of RAM.

4.2 Factoid Question Answering DANs work well for sentiment analysis, but how do they do on other NLP tasks? We shift gears to a paragraph-length factoid question answering task and find that our model outperforms other unordered functions as well as a more complex syntactic RecNN model. More interestingly, we find that unlike the RecNN, the DAN significantly benefits from out-of-domain Wikipedia training data. Quiz bowl is a trivia competition in which play- ers are asked four-to-six sentence questions about entities (e.g., authors, battles, or events). It is an ideal task to evaluate DANs because there is prior test:1,821}. Split sizes increase by an order of magnitude when labeled phrases are added to the training set. For RT, we do 10-fold CV over a balanced binary dataset of 10, sentences. Similarly, for the IMDB experiments we use the provided balanced binary training set of 25,000 documents. (^7) We also find that DANs take significantly fewer epochs to reach convergence than syntactic models.

0

10

20

30

40

50

0 1 2 3 4 5 Layer

Perturbation Response

cool okay the worst underwhelming

Perturbation Response vs. Layer

Figure 3: Perturbation response (difference in 1- norm) at each layer of a 5-layer DAN after replac- ing awesome in the film’s performances were awe- some with four words of varying sentiment polarity. While the shallow NBOW model does not show any meaningful distinctions, we see that as the network gets deeper, negative sentences are increasingly different from the original positive sentence.

l

l

l (^) l l l (^) l

l

l

l

l

l l^ l

83

84

85

86

87

0 2 4 6 Number of Layers

Binary Classification Accuracy

ll ll

DAN DAN−ROOT

Effect of Depth on Sentiment Accuracy

Figure 4: Two to three layers is optimal for the DAN on the SST binary sentiment analysis task, but adding any depth at all is an improvement over the shallow NBOW model.

5 How Do DANs Work?

In this section we first examine how the deep layers of the DAN amplify tiny differences in the vector av- erage that are predictive of the output labels. Next, we compare DANs to DRecNNs on sentences that contain negations and contrastive conjunctions and find that both models make similar errors despite the latter’s increased complexity. Finally, we an- alyze the predictive ability of unsupervised word embeddings on a simple sentiment task in an effort to explain why initialization with these embeddings improves the DAN.

5.1 Perturbation Analysis Following the work of ˙Irsoy and Cardie (2014), we examine our network by measuring the response at each hidden layer to perturbations in an input sen- tence. In particular, we use the template the film’s performances were awesome and replace the fi- nal word with increasingly negative polarity words (cool, okay, underwhelming, the worst). For each perturbed sentence, we observe how much the hid- den layers differ from those associated with the original template in 1-norm. Figure 3 shows that as a DAN gets deeper, the dif- ferences between negative and positive sentences become increasingly amplified. While nonexistent in the shallow NBOW model, these differences are visible even with just a single hidden layer, thus explaining why deepening the NBOW improves sen- timent analysis as shown in Figure 4.

5.2 Handling Negations and “but”: Where Syntax is Still Needed While DANs outperform other bag-of-words mod- els, how can they model linguistic phenomena such as negation without considering word order? To evaluate DANs over tougher inputs, we collect 92 sentences, each of which contains at least one nega- tion and one contrastive conjunction, from the dev and test sets of the SST.^9 Our fine-grained accuracy is higher on this subset than on the full dataset, improving almost five percent absolute accuracy to 53.3%. The DRecNN model of ˙Irsoy and Cardie (2014) obtains a similar accuracy of 51.1%, con- trary to our intuition that syntactic functions should outperform unordered functions on sentences that clearly require syntax to understand.^10 Are these sentences truly difficult to classify? A close inspection reveals that both the DAN and the DRecNN have an overwhelming tendency to pre- dict negative sentiment (60.9% and 55.4% of the time for the DAN and DRecNN respectively) when they see a negation compared to positive sentiment (35.9% for DANs, 34.8% for DRecNNs). If we fur- ther restrict our subset of sentences to only those with positive ground truth labels, we find that while both models struggle, the DRecNN obtains 41.7% accuracy, outperforming the DAN’s 37.5%. To understand why a negation or contrastive con- junction triggers a negative sentiment prediction, (^9) We search for non-neutral sentences containing not / n’t, and but. 48 of the sentences are positive while 44 are negative. (^10) Both models are initialized with pretrained 300-d GloVe embeddings for fair comparison.

Sentence DAN DRecNN Ground Truth a lousy movie that’s not merely unwatchable , but also unlistenable

negative negative negative

if you’re not a prepubescent girl , you’ll be laughing at britney spears ’ movie-starring debut whenever it does n’t have you impatiently squinting at your watch

negative negative negative

blessed with immense physical prowess he may well be, but ahola is simply not an actor

positive neutral negative

who knows what exactly godard is on about in this film , but his words and images do n’t have to add up to mesmerize you.

positive positive positive

it’s so good that its relentless , polished wit can withstand not only inept school productions , but even oliver parker ’s movie adaptation

negative positive positive

too bad , but thanks to some lovely comedic moments and several fine performances , it’s not a total loss

negative negative positive

this movie was not good negative negative negative this movie was good positive positive positive this movie was bad negative negative negative the movie was not bad negative negative positive

Table 3: Predictions of DAN and DRecNN models on real (top) and synthetic (bottom) sentences that contain negations and contrastive conjunctions. In the first column, words colored red individually predict the negative label when fed to a DAN, while blue words predict positive. The DAN learns that the negators not and n’t are strong negative predictors, which means it is unable to capture double negation as in the last real example and the last synthetic example. The DRecNN does slightly better on the synthetic double negation, predicting a lower negative polarity.

we show six sentences from the negation subset and four synthetic sentences in Table 3, along with both models’ predictions. The token-level predictions in the table (shown as colored boxes) are computed by passing each token through the DAN as separate test instances. The tokens not and n’t are strongly pre- dictive of negative sentiment. While this simplified “negation” works for many sentences in the datasets we consider, it prevents the DAN from reasoning about double negatives, as in “this movie was not bad”. The DRecNN does slightly better in this case by predicting a lesser negative polarity than the DAN; however, we theorize that still more powerful syntactic composition functions (and more labelled instances of negation and related phenomena) are necessary to truly solve this problem.

5.3 Unsupervised Embeddings Capture Sentiment Our model consistently converges slower to a worse solution (dropping 3% in absolute accuracy on coarse-grained SST) when we randomly initialize the word embeddings. This does not apply to just

DANs; both convolutional and recursive networks do the same (Kim, 2014; ˙Irsoy and Cardie, 2014). Why are initializations with these embeddings so crucial to obtaining good performance? Is it pos- sible that unsupervised training algorithms are al- ready capturing sentiment? We investigate this theory by conducting a sim- ple experiment: given a sentiment lexicon contain- ing both positive and negative words, we train a logistic regression to discriminate between the asso- ciated word embeddings (without any fine-tuning). We use the lexicon created by Hu and Liu (2004), which consists of 2,006 positive words and 4, negative words. We balance and split the dataset into 3,000 training words and 1,000 test words. Using 300-dimensional GloVe embeddings pre- trained over the Common Crawl, we obtain over 95% accuracy on the unseen test set, supporting the hypothesis that unsupervised pretraining over large corpora can capture properties such as sentiment. Intuitively, after the embeddings are fine-tuned during DAN training, we might expect a decrease in the norms of stopwords and an increase in the

References

Carmen Banea, Di Chen, Rada Mihalcea, Claire Cardie, and Janyce Wiebe. 2014. Simcompass: Using deep learn- ing word embeddings to assess cross-level similarity. In SemEval.

Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective- noun constructions in semantic space. In Proceedings of Empirical Methods in Natural Language Processing.

Yoshua Bengio, R ejean Ducharme, Pascal Vincent, and Chris-´ tian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 35(8):1798–1828.

Jordan Boyd-Graber, Brianna Satinoff, He He, and Hal Daum e´ III. 2012. Besting the quiz master: Crowdsourcing incre- mental classification games. In Proceedings of Empirical Methods in Natural Language Processing.

Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of Empirical Methods in Natural Language Processing.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.

  1. Learning phrase representations using rnn encoder- decoder for statistical machine translation. In Proceedings of Empirical Methods in Natural Language Processing.

Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. 2010. Mathematical foundations for a compositional distribu- tional model of meaning. Linguistic Analysis (Lambek Festschirft).

Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the International Conference of Machine Learning.

George E Dahl, Ryan P Adams, and Hugo Larochelle. 2012. Training restricted boltzmann machines on word observa- tions. In Proceedings of the International Conference of Machine Learning.

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research.

Katrin Erk and Sebastian Pad ´o. 2008. A structured vector space model for word meaning in context. In Proceedings of Empirical Methods in Natural Language Processing.

Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011. Ex- perimental support for a categorical compositional distri- butional model of meaning. In Proceedings of Empirical Methods in Natural Language Processing.

Karl Moritz Hermann, Edward Grefenstette, and Phil Blun- som. 2013. ”not not bad” is not ”bad”: A distributional account of negation. Proceedings of the ACL Workshop on Continuous Vector Space Models and their Compositional- ity.

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580. Sepp Hochreiter and J ¨urgen Schmidhuber. 1997. Long short- term memory. Neural computation. Minqing Hu and Bing Liu. 2004. Mining and summariz- ing customer reviews. In Knowledge Discovery and Data Mining. Ozan ˙Irsoy and Claire Cardie. 2014. Deep recursive neural networks for compositionality in language. In Proceedings of Advances in Neural Information Processing Systems. Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daum ´e III. 2014a. A neural network for factoid question answering over paragraphs. In Proceedings of Empirical Methods in Natural Language Processing. Mohit Iyyer, Peter Enns, Jordan Boyd-Graber, and Philip Resnik. 2014b. Political ideology detection using recursive neural networks. In Proceedings of the Association for Computational Linguistics. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent convo- lutional neural networks for discourse compositionality. In ACL Workshop on Continuous Vector Space Models and their Compositionality. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom.

  1. A convolutional neural network for modelling sen- tences. In Proceedings of the Association for Computa- tional Linguistics. Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2013. Prior disambiguation of word tensors for constructing sentence vectors. In Proceedings of Empirical Methods in Natural Language Processing. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of Empirical Methods in Natural Language Processing. Quoc V Le and Tomas Mikolov. 2014. Distributed represen- tations of sentences and documents. In Proceedings of the International Conference of Machine Learning. Jiwei Li. 2014. Feature weight tuning for recursive neural networks. CoRR, abs/1412.3714. Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. 2014. A recursive recurrent neural network for statistical machine translation. In Proceedings of the Association for Compu- tational Linguistics. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learn- ing word vectors for sentiment analysis. In Proceedings of the Association for Computational Linguistics. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of the Association for Computational Linguistics. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the Association for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher Manning.

  1. Glove: Global vectors for word representation. In Proceedings of Empirical Methods in Natural Language Processing.

Asad B. Sayeed, Jordan Boyd-Graber, Bryan Rusk, and Amy Weinberg. 2012. Grammatical structures for word-level sentiment detection. In North American Association of Computational Linguistics.

Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. 2011a. Dynamic Pool- ing and Unfolding Recursive Autoencoders for Paraphrase Detection. In Proceedings of Advances in Neural Informa- tion Processing Systems.

Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. 2011b. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distri- butions. In Proceedings of Empirical Methods in Natural Language Processing.

Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013a. Parsing With Compositional Vector Grammars. In Proceedings of the Association for Compu- tational Linguistics.

Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013b. Recursive deep models for semantic com- positionality over a sentiment treebank. In Proceedings of Empirical Methods in Natural Language Processing.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1).

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Se- quence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems.

Kai Sheng Tai, Richard Socher, and Christopher D. Man- ning. 2015. Improved semantic representations from tree- structured long short-term memory networks.

Tim Van de Cruys. 2014. A neural network approach to selec- tional preference acquisition. In Proceedings of Empirical Methods in Natural Language Processing.

Sida I. Wang and Christopher D. Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classifica- tion. In Proceedings of the Association for Computational Linguistics.

Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In International Joint Conference on Artificial Intelligence.