






































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The limitations of current chatbot systems in understanding context and emotional cues, and proposes methods for argument mining and stance classification using chatbots. It covers different types of chatbots, evaluation tools, and datasets. The document also suggests potential solutions to improve chatbot performance in the debate domain.
Typology: Lecture notes
1 / 78
This page cannot be seen from the preview
Don't miss anything!







































































Human Computer Interaction and Design EIT Digital Master School Interactive Technology University of Twente P.O. Box 217 7500 AE Enschede The Netherlands
ArgueBot: Enabling debates through a hybrid
retrieval-generation-based chatbot
Supervisors from University of Twente
Dr. M. Theune Prof. Dr. D.K.J. Heylen J.B. van Waterschoot, MSc
Supervisor from Findwise
J. Bratt, MSc
Abstract
The goal of this study is to develop a debate platform, the ArgueBot, that is able to maintain a meaningful debate with the user for various topics. The goal of the chatbot is to carry out human-like debates with the users. The Arguebot uses a hybrid model, combining retrieval- and generative-based models. The retrieval model uses cosine similarity to compare the user input with the argument candidates for a specific debate. The generative model is used to compensate for the limitations of the retrieval model that is restricted to the arguments stored in the database. The Arguebot utilizes Dialogflow, Flask, spaCy, and Machine Learning technologies within its architecture. The user tests and the survey are used to evaluate the chatbot’s performance. The user tests showed that there is potential in the Arguebot, but it needs better context understanding, a more accurate stance classifier and a better generative model.
ii
algorithm that improves the argument detection; and finally Text-to-Speech (TTS) Systems that convert text into spoken voice output and gives the Debater its voice.
In the meantime, chatbots are gaining more and more momentum as a new platform for human-computer interaction. According to Gartner, Inc by 2022, twenty-five percent of enterprises will have integrated virtual customer assistants and chatbots within their platforms 4. However, current chatbot systems still have several limita- tions such as incorrect understanding of the context (meaning) of the user utterance, a lack of empathy and the inability to understand social and emotional cues that exist in human-to-human communication (Klopfenstein et al., 2017; Moore et al., 2017).
1.1 Problem Statement
The following research aims to create a chatbot that can maintain a meaningful debate with users on various topics. The goal of the chatbot, called ArgueBot, is to be able to carry out human-like debates with the users.
The problem statement for the following research is defined as:
How can a hybrid retrieval-generation-based chatbot maintain a debate with a user for various topics?
The problem statement can be divided into sub-questions:
SQ:1 How can the model recognize and handle the arguments?
SQ:2 How can stance classification be applied for the conversational agents?
SQ:3 What is an appropriate model for the chatbot’s response generation?
SQ:4 How can human-like conversation with the chatbot be carried out in the debate domain?
SQ:5 How can such a chatbot be evaluated?
The research presented in this thesis was carried out at Findwise AB, a consultancy company that provides search-driven solutions 5. Findwise supported this project with guidance and testing.
(^4) https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology- trends-for-2019/ 5 https://findwise.com/en
1.1 Problem Statement 2
1.2 Thesis Structure
Chapter 2. Background and Related Work
This chapter elaborates on the background for the research topic and related work done within the field. Here, more information about existing methods for argumen- tation mining, building, and evaluating chatbots can be found. Moreover, research questions SQ: 1, 2, 3 ad 5 will be answered in relation to the previous work.
Chapter 3. First Implementation with Basic Functionalities
This chapter describes the chosen methods and user tests for the first implementation of the ArgueBot. Here research questions SQ: 1, 2, 3, 4 and 5 will be answered in relation to the first implementation of the ArgueBot.
Chapter 4. Second Implementation with Machine Learning
This chapter describes the changes made in the second implementation of the ArgueBot. Here, research questions SQ: 1, 2 and 3 will be answered in relation to the second implementation of the ArgueBot.
Chapter 5. Final evaluation of the ArgueBot
This chapter will present the results for the evaluation of the second implementation of the ArgueBot. Here, research question SQ: 5 will be answered in relation to the second implementation of the ArgueBot.
Chapter 6. Discussion
Here, the results presented in the previous chapter with their challenges and limita- tions will be discussed.
Chapter 7. Conclusion
This chapter will summarize the findings and propose how they can be further improved in future work. Here, all research questions will be answered with regard to the whole project.
1.2 Thesis Structure 3
Consider the following text extracted from the Wikipedia article "Ethics of artificial intelligence": "Joseph Weizenbaum argued in 1976 that AI technology should not be used to replace people in positions that require respect and care, such as any of these: customer service representative [...], therapist [...], nursemaid for the elderly [...], soldier, judge, police officer. Weizenbaum explains that we require authentic feelings of empathy from people in these positions. If machines replace them, we will find ourselves alienated, devalued and frustrated. Artificial intelligence, if used in this way, represents a threat to human dignity [...]"^1
Here, "AI technology should not be used to replace people in positions that require respect and care" is a conclusion (the claim, the core of the argument). "Weizen- baum explains that we require authentic feelings of empathy from people in these positions", "If machines replace them, we will find ourselves alienated, devalued and frustrated", "Artificial intelligence, if used in this way, represents a threat to human dignity" are the premises (statements that provide reason, evidence or support for the conclusion).
An inference is a process of drawing conclusions based on the premises and in the above-mentioned example would be:
Habernal and Gurevych (2017) proposed a model based on machine learning for identifying argument components containing feature sets: baseline lexical features; structural, morphological, and syntactic features; semantic, coreference, and dis- course features; and embedding features. These sets of features were used to identify argument components and extract the arguments from the annotated forum posts.
Another method was proposed by Levy et al. (2017), who used it for detecting topic-relevant claims from the data extracted from Wikipedia. The study used claim sentence query to extract sentences with the word “that” followed by the claim topic, followed by any word from a pre-defined lexicon. This lexicon included words characteristic to the claims such as argue, disagree, argument, claim, conflict and others (Levy et al., 2017).
(^1) https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence
2.1 Argument mining 5
Furthermore, information retrieval techniques can be used to structure the arguments ( indexing ), relate them to each other by computing how similar or dissimilar they are to each other, making it possible to find and retrieve the most relevant arguments and counterarguments (Stab et al., 2018; Ma et al., 2018; Wachsmuth et al., 2017; Wachsmuth et al., 2018; Zhu et al., 2018). Information retrieval can be defined as finding unstructured (does not have a clear, semantically distinguishable structure that is easy to understand by computer) data that complies with the information need from within a large collection of data (Manning, 2008).
Stance classification is a field within argument mining that helps to identify whether the argument is for or against the issue being debated. Mandya et al. (2016) proposed to extract the following features for stance classification: topic-stance fea- tures (specific words associated with topics); stance bearing terminology (words connected by adjectival modifier (amod) and noun compound modifier (nn) de- pendency relations that can indicate the stance in the argument); logical point features (extraction of words following the rule subject-verb-object (SVO) which might capture the claim); unigrams and dependency features (used to classify shorter posts).
Levy et al. (2017) proposed a method of claim stance classification in regard to a given topic. The study used precise semantic analysis of the debate topic and the claim (the sentiment of the claim towards its target), including target identification (through detecting the noun phrases in the claim), and contrast detection between the claim and the topic targets (through their relations), where each of these tasks had a separate machine learning classifier.
2.2 Chatbots
This section will present the recent developments of the conversational agents, also known as chatbots. Chatbot is a computer program that has an ability to mimic written or spoken human speech for interactions with humans (Kim et al., 2018).
Chatbots can be broadly classified into generative which generate a response based on natural language generation techniques (Kim et al., 2018; Le et al., 2018), and
2.2 Chatbots 6
To date, at least two debate-chatbots were made: a chatbot Debbie, that uses a similarity algorithm to retrieve counter-arguments (Rakshit et al., 2019) and a chatbot Dave that used retrieval- and generative-based models separately (Le et al., 2018).
Chatbot Debbie used corpora compiled by (Swanson et al., 2015) containing contro- versial topics from the Internet Argument Corpus (Abbott et al., 2016) and dialogues from online debate forums. The authors were using the Argument Quality (AQ) regressor to choose the best arguments from the database containing statements for and against three controversial topics: death penalty, gay marriage, and gun control. Through Debbie chatbot, users were able to pick a topic and specify their stance (the chatbot assumes that the user utterance is always argumentative). The system then used a similarity algorithm based on the UMBC STS score (that combines lexical similarity features such as latent semantic word similarity and WordNet knowledge) to retrieve a ranked list of the most appropriate counter-arguments that was not previously used by the chatbot. The authors created clusters (groups of documents that are semantically similar (Manning, 2008)) with arguments to speed up the retrieval process. Chatbot Debbie continues the debate until the user terminates the chat. The chatbot was evaluated by comparing the average response times for different retrieval methods used for implementation (Rakshit et al., 2019).
Chatbot Dave (Le et al., 2018) also used Internet Argument Corpora (Abbott et al.,
(^2) https://www.kaggle.com/c/quora-question-pairs
2.2 Chatbots 8
implemented in the chatbot interface, where the users were able to rate each chatbot responses from 1 (very bad) to 5 (very good) (Le et al., 2018).
The chatbot described in this work is different from the above-mentioned chatbots in several ways: firstly, the dataset that is used for the knowledge base of the chatbot is different resulting in different discussion topics within the chatbot; secondly, the model for implementation is different. While chatbot Debbie uses UMBC STS similarity score and chatbot Dave uses Manhattan LSTM similarity model, this project uses cosine similarity in the combination with the GloVe embedding vectors. Additionally, the final implementation of the chatbot presented in this work uses a hybrid model, combining both the retrieval and the generative models.
The Turing test that tests a machine’s ability to perform intelligent behavior equiva- lent to human intelligence (Turing, 1950), inspired many researchers and engineers to develop multiple conversational systems. One such example is Eliza, a computer program that through pattern matching and specific phrasing could imitate human- to-human conversations (Weizenbaum, 1966). The most recent chatbot that passed the Turing test is Mitsuku (four-time Loebner Prize winner), built in Pandorabots (^3) by using the artificial intelligence markup language (AIML). However, chatbots
built using AIML have difficulties with maintaining a dialogue for a longer time (Shum et al., 2018) and are not able to extract complex information needed in the debate-domain.
Currently, there are many online tools available for building chatbots: Dialogflow, Microsoft Bot Framework (Cortana), IBM Watson Conversation, and many others. Among these, Dialogflow 4 is a free platform for creating interfaces based on natural language conversations which functionalities can be expanded by using webhooks (is a way to send information within different applications). Both Microsoft Bot Framework and IBM Watson Conversation have a free version that allows only a limited number of API calls per month.
When it comes to evaluating chatbot’s performance, the most recent tool is ChatEval (^5) that includes evaluation datasets with both human-annotated and automated
baselines (Sedoc et al., 2018). The Turing test can be used to evaluate how human-
(^3) https://pandorabots.com/docs/ (^4) https://dialogflow.com/ (^5) https://chateval.org/
2.2 Chatbots 9
Functionalities
obedient is to strictly limit the spectrum of acceptable opinion, but allow very lively debate within that spectrum—even encourage the more critical and dissident views. That gives people the sense that there’s free thinking going on, while all the time the presuppositions of the system are being reinforced by the limits put on the range of the debate.
— Noam Chomsky (The Common Good (1998))
This chapter describes the first implementation of the ArgueBot platform, the design choices, and how it was tested. The goal of the first implementation was to build the base functionalities for interaction with the user. Henceforward, the ArgueBot chatbot will be referred to as an agent.
3.1 Dataset
The knowledge base for the chatbot consists of the ArguAna Counterargs corpus (Wachsmuth et al., 2018). Table 3.1 lists the 15 topics used in the dataset containing 1069 debates with 6779 points and 6753 counterpoints (see an example of how a debate is composed in figure 3.1) distributed between test, training and validation folders. Arguments consist of points with both pro and con stance towards the debate’s statement. Each such point includes a conclusion, premises and an inference within its text, which are not separated or labelled (see chapter 2.1.1). Each debate has an introduction with the relevant information needed to make an argument. The data in the dataset was crawled from idebate.com 1 , an international debate education association for young people that offers debates written by experienced debaters from around the world. The ArguAna Counterargs corpus includes therefore
(^1) https://idebate.org/
high qualitative arguments, strengthened with citations. The downside of the corpus is its formal nature of argumentation, which might differ from the written arguments provided by the user in the chatbot. This corpus was chosen because of it including debate background and arguments with different stances, providing, therefore, stance labels for each argument and eliminating the problem of stance classification of the existing data.
Topic Debates Points Counterpoints Culture 46 278 278 Digital freedoms 48 341 341 Economy 95 590 588 Education 58 382 381 Environment 36 215 215 Free speech debate 43 274 273 Health 57 334 333 International 196 1315 1307 Law 116 732 730 Philosophy 50 320 320 Politics 155 982 978 Religion 30 179 179 Science 41 271 269 Society 75 436 431 Sport 23 130 130 Training set 644 4083 4065 Validation set 211 1290 1287 Test set 214 1406 1401 Total 1069 6779 6753 Tab. 3.1.: Distribution of debates, points, and counters over the topics in the dataset (Wachsmuth et al., 2018)
The first implementation used 12 debates marked as "Junior" from the dataset with the claims : "Ban online gambling", "Ban animal testing", "Kill One to Save Many", "Banning School Uniforms", "Poetry should not be taught in schools", "Raise the school leaving age to 18", "Ban the niqab and other face coverings in schools" ," Dreaming of a white Christmas", "Introduce a “fat tax”", "Homework is a waste of time", "Every child should have a mobile phone", "Sponsoring children in developing countries". These debates were designed for the younger audience and included simplified topics with simplified arguments, which aligned with the purpose of the first implementation of creating the platform with some basic functionalities with the use of simplified debates. Each debate included at least six arguments (at least three arguments for and three against the main claim). Each argument included one point and one counterpoint. Each point and counterpoint were generally 4- sentences long each.
3.1 Dataset 12
Every time the user chooses a new debate topic, the model finds the 100 most used words for that debate from the database and generates a debate object (memory object for the specific user to be used by the model) with response candidates that are also saved into the database. The 100 most used words in the dataset for that debate, hence called "debate-specific words", are then sent to the argument entity in Dialogflow through the API. When the user gives input in form of a chat message, the message is sent to Dialogflow that detects the intent (context) of that message using the debate-specific words in the argument entity and the sentence composition. If the user input is classified as an argument , it is then further analyzed by the model. The model checks how similar the user argument is to the argument candidates stored in the database and retrieves the appropriate response. If Dialogflow classifies the user input with some other intent, the model replies with a predefined response. Each section in Figure 3.2 marked with a blue rectangle will be described in more detail below.
Fig. 3.2.: Architecture of first implementation
Pre-processing included removing information within brackets, such as citations and explanations. Additional information for the debate backgrounds that explained the nature of the debate was also removed. These were removed by using regular expressions.
The debate names were changed through a written script from for example "This house Would Ban School Uniforms - Junior" to "banning school uniforms - Junior".
3.2 Architecture 14
The "This House" wording format belongs to the British Parliamentary debate style that is a default format for many university societies. British Parliament consists of "Houses", thus "this house.." represents a motion to be discussed in the debate.
The name change included tokenizing the name, removing the first two tokens if they were "this" and "house", checking the tense of the verb and changing it to the present participle ("-ing") form. Tokenizing and verb-checking were implemented by using the spaCy library 2. The arguments were then vectorized by using spaCy’s GloVe vectors model package "en_vectors_web_lg" and transformed into strings to save space. The use of these vectors will be further explained in the next section.
The pre-processed debates with their arguments were saved to SQLite database 3 to reduce the computing time for the model and make the retrieval process easier.
There are two main purposes for the model: one for handling the debate object (memory object to be used by the model) for each user and one to handle the analysis of the user input.
The user-input handler used the spaCy library to vectorize the input. It used cosine similarity to compare the vectorized user input to all the argument candidates for the chosen debate. It then retrieved the id of the argument that had the highest similarity and sent it to the debate model. The cosine similarity between two vectors is a measure that calculates the cosine of the angle between these vectors projected in a multi-dimensional space. Given two vectors −→ a and
b , their cosine similarity is
cos ϕ =
−→ a · −→ b ‖−→ a ‖ × ‖
b ‖
where −→ a and
b are multi-dimensional vectors over the term set T = { t 1 ,... , tm } and each dimension represents a word with its weight in the sentence. The cosine similarity is a non-negative number between 0 and 1 (Huang, 2008).
It then used NLTK Vader library 4 to classify the stance for the user input through sentiment analysis. The polarity of the user input (whether it has positive, neutral, or negative sentiment) was used to classify whether it was for or against the main claim of the debate. Positive sentiment was understood as a "pro" stance, negative sentiment as a "con" stance and neutral sentiment as undefined stance.
(^2) https://spacy.io/ (^3) https://www.sqlite.org/index.html (^4) https://www.nltk.org/
3.2 Architecture 15