Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Chatbots in Argument Mining: Types, Stance Classification, and Evaluation, Lecture notes of Machine Learning

ChatbotsArtificial IntelligenceInformation RetrievalNatural Language ProcessingArgument Mining

The limitations of current chatbot systems in understanding context and emotional cues, and proposes methods for argument mining and stance classification using chatbots. It covers different types of chatbots, evaluation tools, and datasets. The document also suggests potential solutions to improve chatbot performance in the debate domain.

What you will learn

  • How can personalized generated responses improve chatbot performance in the debate domain?
  • How can argument mining be used with chatbots for stance classification?
  • What are the limitations of current chatbot systems in understanding context and emotional cues?
  • What are the different types of chatbots and how do they differ in functionality?
  • What evaluation tools are used to assess the performance of chatbot systems?

Typology: Lecture notes

2021/2022

Uploaded on 09/07/2022

adnan_95
adnan_95 🇮🇶

4.3

(38)

918 documents

1 / 78

Toggle sidebar

Related documents


Partial preview of the text

Download Chatbots in Argument Mining: Types, Stance Classification, and Evaluation and more Lecture notes Machine Learning in PDF only on Docsity! University of Twente Human Computer Interaction and Design EIT Digital Master School Interactive Technology University of Twente P.O. Box 217 7500 AE Enschede The Netherlands M.Sc. Thesis ArgueBot: Enabling debates through a hybrid retrieval-generation-based chatbot Iryna Kulatska Supervisors from University of Twente Dr. M. Theune Prof. Dr. D.K.J. Heylen J.B. van Waterschoot, MSc Supervisor from Findwise J. Bratt, MSc 2019 Abstract The goal of this study is to develop a debate platform, the ArgueBot, that is able to maintain a meaningful debate with the user for various topics. The goal of the chatbot is to carry out human-like debates with the users. The Arguebot uses a hybrid model, combining retrieval- and generative-based models. The retrieval model uses cosine similarity to compare the user input with the argument candidates for a specific debate. The generative model is used to compensate for the limitations of the retrieval model that is restricted to the arguments stored in the database. The Arguebot utilizes Dialogflow, Flask, spaCy, and Machine Learning technologies within its architecture. The user tests and the survey are used to evaluate the chatbot’s performance. The user tests showed that there is potential in the Arguebot, but it needs better context understanding, a more accurate stance classifier and a better generative model. ii 5.2 Survey results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 User Background . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 Debate information . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.3 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.4 Conversation flow . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.5 Response quality . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3 Conversation length . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6 Discussion 54 6.1 ArgueBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2 Stance Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.4 Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7 Conclusion 59 Bibliography 61 Footnotes 65 A Appendix Survey ArgueBot 1.0 70 B Appendix Survey ArgueBot 2.0 72 v 1Introduction „Opinion is the medium between knowledge and ignorance. — Plato ( c. 427 BC – c. 347 BC) A debate can be defined as a “careful weighing of the reasons for or against some- thing”1. Debates can be tracked down to Ancient Greece, where philosophical minds were debating about politics and the nature of life. Throughout history, debating has been an essential tool in individual and collective decision making and has been helping in idea generation and policy building. Furthermore, the ability to articulate and evaluate arguments improves one’s critical thinking and creativity (Keller et al., 2001). In the time of flourishing social media worldwide, debates have become possible, where people with different backgrounds can engage in discussions about every possible topic across the globe. One such example is Doha Debates, that through live debates, videos, blogs, and podcasts evokes the discussions and collaborative solu- tions for today’s global challenges such as global refugee crisis, Artificial Intelligence (AI), gender inequality and water shortage 2. The latest advances in technology such as Natural Language and Speech Processing, Machine Learning algorithms, Argument Mining, Information Retrieval, and many others enabled human-computer interaction in the debate domain. One such exam- ple is the IBM Debater project, a conversational AI system that can give speech on a given topic and debate with humans 3. The system uses several technologies: Argu- ment Mining to identify argument components in the debate; Stance Classification and Sentiment Analysis to classify whether the argument is for or against a given topic; Deep Neural Nets (DNNs) and Weak Supervision, that is a Machine Learning 1https://www.merriam-webster.com/thesaurus/debate 2https://dohadebates.com/ 3https://www.research.ibm.com/artificial-intelligence/project-debater/ 1 algorithm that improves the argument detection; and finally Text-to-Speech (TTS) Systems that convert text into spoken voice output and gives the Debater its voice. In the meantime, chatbots are gaining more and more momentum as a new platform for human-computer interaction. According to Gartner, Inc by 2022, twenty-five percent of enterprises will have integrated virtual customer assistants and chatbots within their platforms 4. However, current chatbot systems still have several limita- tions such as incorrect understanding of the context (meaning) of the user utterance, a lack of empathy and the inability to understand social and emotional cues that exist in human-to-human communication (Klopfenstein et al., 2017; Moore et al., 2017). 1.1 Problem Statement The following research aims to create a chatbot that can maintain a meaningful debate with users on various topics. The goal of the chatbot, called ArgueBot, is to be able to carry out human-like debates with the users. The problem statement for the following research is defined as: How can a hybrid retrieval-generation-based chatbot maintain a debate with a user for various topics? The problem statement can be divided into sub-questions: SQ:1 How can the model recognize and handle the arguments? SQ:2 How can stance classification be applied for the conversational agents? SQ:3 What is an appropriate model for the chatbot’s response generation? SQ:4 How can human-like conversation with the chatbot be carried out in the debate domain? SQ:5 How can such a chatbot be evaluated? The research presented in this thesis was carried out at Findwise AB, a consultancy company that provides search-driven solutions 5. Findwise supported this project with guidance and testing. 4https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology- trends-for-2019/ 5https://findwise.com/en 1.1 Problem Statement 2 Consider the following text extracted from the Wikipedia article "Ethics of artificial intelligence": "Joseph Weizenbaum argued in 1976 that AI technology should not be used to replace people in positions that require respect and care, such as any of these: customer service representative [...], therapist [...], nursemaid for the elderly [...], soldier, judge, police officer. Weizenbaum explains that we require authentic feelings of empathy from people in these positions. If machines replace them, we will find ourselves alienated, devalued and frustrated. Artificial intelligence, if used in this way, represents a threat to human dignity [...]" 1 Here, "AI technology should not be used to replace people in positions that require respect and care" is a conclusion (the claim, the core of the argument). "Weizen- baum explains that we require authentic feelings of empathy from people in these positions", "If machines replace them, we will find ourselves alienated, devalued and frustrated", "Artificial intelligence, if used in this way, represents a threat to human dignity" are the premises (statements that provide reason, evidence or support for the conclusion). An inference is a process of drawing conclusions based on the premises and in the above-mentioned example would be: 1. humans need to feel empathy, that technologies cannot provide in the same way as professionals do; 2. the absence of empathy and authentic feelings can result in humans disap- pointment which threatens humans mental health; 3. therefore, AI should not replace professionals with positions that require respect and care. Habernal and Gurevych (2017) proposed a model based on machine learning for identifying argument components containing feature sets: baseline lexical features; structural, morphological, and syntactic features; semantic, coreference, and dis- course features; and embedding features. These sets of features were used to identify argument components and extract the arguments from the annotated forum posts. Another method was proposed by Levy et al. (2017), who used it for detecting topic-relevant claims from the data extracted from Wikipedia. The study used claim sentence query to extract sentences with the word “that” followed by the claim topic, followed by any word from a pre-defined lexicon. This lexicon included words characteristic to the claims such as argue, disagree, argument, claim, conflict and others (Levy et al., 2017). 1https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence 2.1 Argument mining 5 Furthermore, information retrieval techniques can be used to structure the arguments (indexing), relate them to each other by computing how similar or dissimilar they are to each other, making it possible to find and retrieve the most relevant arguments and counterarguments (Stab et al., 2018; Ma et al., 2018; Wachsmuth et al., 2017; Wachsmuth et al., 2018; Zhu et al., 2018). Information retrieval can be defined as finding unstructured (does not have a clear, semantically distinguishable structure that is easy to understand by computer) data that complies with the information need from within a large collection of data (Manning, 2008). 2.1.2 Stance classification Stance classification is a field within argument mining that helps to identify whether the argument is for or against the issue being debated. Mandya et al. (2016) proposed to extract the following features for stance classification: topic-stance fea- tures (specific words associated with topics); stance bearing terminology (words connected by adjectival modifier (amod) and noun compound modifier (nn) de- pendency relations that can indicate the stance in the argument); logical point features (extraction of words following the rule subject-verb-object (SVO) which might capture the claim); unigrams and dependency features (used to classify shorter posts). Levy et al. (2017) proposed a method of claim stance classification in regard to a given topic. The study used precise semantic analysis of the debate topic and the claim (the sentiment of the claim towards its target), including target identification (through detecting the noun phrases in the claim), and contrast detection between the claim and the topic targets (through their relations), where each of these tasks had a separate machine learning classifier. 2.2 Chatbots This section will present the recent developments of the conversational agents, also known as chatbots. Chatbot is a computer program that has an ability to mimic written or spoken human speech for interactions with humans (Kim et al., 2018). 2.2.1 Types of chatbots Chatbots can be broadly classified into generative which generate a response based on natural language generation techniques (Kim et al., 2018; Le et al., 2018), and 2.2 Chatbots 6 retrieval-based, which select the most appropriate response by using information retrieval techniques (Zhu et al., 2018; Rakshit et al., 2019; Le et al., 2018). Retrieval-based models require a database of possible responses to choose from. This model first retrieves the most possible candidates from the database that matches the current utterance and then selects the most appropriate response for the retrieval. Generative models, by contrast, build responses simultaneously by using machine learning techniques. Here, the model is trained on a dataset consisting of real dia- logues and is used to generate responses through “translating” inputs into responses. Statistical Machine Translation (SMT) models are some of the most recent models used for the generation of chatbot responses (Cahn, 2017). 2.2.2 Hybrid model A hybrid model for chatbots that combined generative and retrieval models was previously explored in several studies. Tammewar et al. (2018) developed a personal assistant application for scheduling and cancelling the reminders. In this study, a graph-based retrieval model contained a set of nodes that represented different conversational states to navigate between and was used for the expected conversation flow. The generative model was applied when the conversation flow deviated from the expected. Another study, Yang et al. (2019), proposed a hybrid neural conversational model by combining generation and retrieval models with a hybrid ranking module. The hybrid ranking module was used to select the best response from the generated and the retrieved candidates. The model describes in this work is similar to the model proposed by Tammewar et al. (2018) as it also applies the same strategy for using the generated model when the retrieval model is not able to give a response, while the chatbots’ purposes differ. The chatbot developed in this study is aimed to maintain a debate with the user instead of being a scheduling assistant as in the study done by Tammewar et al. (2018). It also differs from Yang et al. (2019) study, as it does not control the responses through a ranking module. This study prioritizes the retrieval module and applies the generative model to overcome the limitations of the datasets’ limitations, while Yang et al. (2019) threat the responses from different modules as equally important. 2.2 Chatbots 7 like the chatbot is. Chatbots can also be evaluated by conducting user tests and using surveys to determine user satisfaction (Higashinaka et al., 2018). 2.3 Conclusion Concluding the literature review, there are various methods for the argument mining and development of the chatbots. Argument mining can include component extrac- tion and stance classification. Chatbots can be retrieval-, generative-, or hybrid-based that include both model retrieval- and generative models. There are many online tools for building the chatbots, and some of these offer free versions. At least two debate-chatbots were previously made, retrieval-based chatbot Debbie and both retrieval and generative-based chatbot Dave. In this chapter following research sub-questions were answered in relation to the previous work: SQ:1 How can the model recognize and handle the arguments? Argument extraction can be done through feature extraction and rule-matching("that" word) method. The arguments can either be retrieved or generated depending on the chatbot model. SQ:2 How can stance classification be applied for the conversational agents? Feature extraction, semantic and sentiment analysis, and machine learning can be used to classify the stance of the argument. SQ:3 What is an appropriate model for the chatbot’s response generation? Depending on the chatbot’s type, the responses can be produced by a retrieval, generative, or hybrid model. The retrieval model can use Manhattan LSTM or UMBC STS similarity score to extract the appropriate response, while the generative model can use hierarchical recurrent (RNN) encoder-decoder architecture. A hybrid model can use the generated model when the retrieval model is not able to give a response, or use a hybrid ranking module to select the best response from both retrieved and generated candidates. SQ:5 How can such a chatbot be evaluated? ChatEval tool, Turing test, and user tests in the combination with the surveys can be used for the chatbot evaluation. 2.3 Conclusion 10 3First Implementation with Basic Functionalities „The smart way to keep people passive and obedient is to strictly limit the spectrum of acceptable opinion, but allow very lively debate within that spectrum—even encourage the more critical and dissident views. That gives people the sense that there’s free thinking going on, while all the time the presuppositions of the system are being reinforced by the limits put on the range of the debate. — Noam Chomsky (The Common Good (1998)) This chapter describes the first implementation of the ArgueBot platform, the design choices, and how it was tested. The goal of the first implementation was to build the base functionalities for interaction with the user. Henceforward, the ArgueBot chatbot will be referred to as an agent. 3.1 Dataset The knowledge base for the chatbot consists of the ArguAna Counterargs corpus (Wachsmuth et al., 2018). Table 3.1 lists the 15 topics used in the dataset containing 1069 debates with 6779 points and 6753 counterpoints (see an example of how a debate is composed in figure 3.1) distributed between test, training and validation folders. Arguments consist of points with both pro and con stance towards the debate’s statement. Each such point includes a conclusion, premises and an inference within its text, which are not separated or labelled (see chapter 2.1.1). Each debate has an introduction with the relevant information needed to make an argument. The data in the dataset was crawled from idebate.com 1, an international debate education association for young people that offers debates written by experienced debaters from around the world. The ArguAna Counterargs corpus includes therefore 1https://idebate.org/ 11 high qualitative arguments, strengthened with citations. The downside of the corpus is its formal nature of argumentation, which might differ from the written arguments provided by the user in the chatbot. This corpus was chosen because of it including debate background and arguments with different stances, providing, therefore, stance labels for each argument and eliminating the problem of stance classification of the existing data. Topic Debates Points Counterpoints Culture 46 278 278 Digital freedoms 48 341 341 Economy 95 590 588 Education 58 382 381 Environment 36 215 215 Free speech debate 43 274 273 Health 57 334 333 International 196 1315 1307 Law 116 732 730 Philosophy 50 320 320 Politics 155 982 978 Religion 30 179 179 Science 41 271 269 Society 75 436 431 Sport 23 130 130 Training set 644 4083 4065 Validation set 211 1290 1287 Test set 214 1406 1401 Total 1069 6779 6753 Tab. 3.1.: Distribution of debates, points, and counters over the topics in the dataset (Wachsmuth et al., 2018) The first implementation used 12 debates marked as "Junior" from the dataset with the claims : "Ban online gambling", "Ban animal testing", "Kill One to Save Many", "Banning School Uniforms", "Poetry should not be taught in schools", "Raise the school leaving age to 18", "Ban the niqab and other face coverings in schools" ," Dreaming of a white Christmas", "Introduce a “fat tax”", "Homework is a waste of time", "Every child should have a mobile phone", "Sponsoring children in developing countries". These debates were designed for the younger audience and included simplified topics with simplified arguments, which aligned with the purpose of the first implementation of creating the platform with some basic functionalities with the use of simplified debates. Each debate included at least six arguments (at least three arguments for and three against the main claim). Each argument included one point and one counterpoint. Each point and counterpoint were generally 4-8 sentences long each. 3.1 Dataset 12 The "This House" wording format belongs to the British Parliamentary debate style that is a default format for many university societies. British Parliament consists of "Houses", thus "this house.." represents a motion to be discussed in the debate. The name change included tokenizing the name, removing the first two tokens if they were "this" and "house", checking the tense of the verb and changing it to the present participle ("-ing") form. Tokenizing and verb-checking were implemented by using the spaCy library 2. The arguments were then vectorized by using spaCy’s GloVe vectors model package "en_vectors_web_lg" and transformed into strings to save space. The use of these vectors will be further explained in the next section. The pre-processed debates with their arguments were saved to SQLite database 3 to reduce the computing time for the model and make the retrieval process easier. 3.2.2 Model for data analysis There are two main purposes for the model: one for handling the debate object (memory object to be used by the model) for each user and one to handle the analysis of the user input. The user-input handler used the spaCy library to vectorize the input. It used cosine similarity to compare the vectorized user input to all the argument candidates for the chosen debate. It then retrieved the id of the argument that had the highest similarity and sent it to the debate model. The cosine similarity between two vectors is a measure that calculates the cosine of the angle between these vectors projected in a multi-dimensional space. Given two vectors −→a and −→ b , their cosine similarity is cosϕ = −→a · −→ b ‖−→a ‖ × ‖ −→ b ‖ (3.1) where −→a and −→ b are multi-dimensional vectors over the term set T = {t1, . . . , tm} and each dimension represents a word with its weight in the sentence. The cosine similarity is a non-negative number between 0 and 1 (Huang, 2008). It then used NLTK Vader library 4 to classify the stance for the user input through sentiment analysis. The polarity of the user input (whether it has positive, neutral, or negative sentiment) was used to classify whether it was for or against the main claim of the debate. Positive sentiment was understood as a "pro" stance, negative sentiment as a "con" stance and neutral sentiment as undefined stance. 2https://spacy.io/ 3https://www.sqlite.org/index.html 4https://www.nltk.org/ 3.2 Architecture 15 The debate-object handler was managing multiple things: • it randomly assigned the stance for the agent (for or against the main claim) • it transformed vectors for all argument candidates from a string (you can read more about why the vector was saved as a string in the Data Pre-Processing section 3.2.1) to a Numpy vector 5. Numpy is a Python library often used for computations. The model then rendered all the candidates with their id and the corresponding vector for the similarity computation done by the user-input handler. • it assigned the argument with the highest similarity, received from the user- input handler, as an active argument for retrieval to the user. The retrieval process included comparing the user’s stance (calculated with the help of polarity in the sentiment analysis in the user-input handler) and the agent’s stance and retrieving the first sentence from the point or the counterpoint based on the agent’s stance. If the user’s stance and the agent’s stance were the same, the agent would respond with "I agree" and when the stances differed with "I disagree". If the stance for the user input was undefined (when the polarity was neutral), the model would only retrieve the response without agreeing/disagreeing. If the agent’s stance was "pro" the main claim of the debate and the active argument was "pro" as well, the model would retrieve the first sentence of the point in the argument; if the active argument was "con", i.e. against the main claim, the model would retrieve first two sentences from the counterpoint. The model then updated the argument by removing the used sentences from the database for the user. When the next user input was assigned to the already used argument (that had the highest cosine similarity), the next two sentences would be retrieved, until the argument became empty. If there were no sentences left to retrieve, the agent’s response would be: "You already used this argument". This was done with the assumption that the user continues on the same argument as before because of the highest cosine similarity. • it retrieved the 100 of most frequent words in the dataset for the debate (re- ferred to as "debate-specific words") for the argument entity in the Dialogflow that helps with the argument detection (see section 3.2.3 for more informa- tion). It first tokenized all the sentences for all the arguments in the debate by using the spaCy library. Then it checked for each token if it wasn’t a stop word (such as “the”, “a”, “an”, “in” and other commonly used words that do not bear any necessary information) or a punctuation mark and saved the lemma form of the word, which is the base or the dictionary form of the word, to an array. It then used the Count function to retrieve the 100 most used words from the array. 5https://www.numpy.org/ 3.2 Architecture 16 3.2.3 Dialogflow Dialogflow is a platform for creating interfaces based on natural language conversa- tions 6. Dialogflow has a set of pre-built agents with intents that map user inputs to responses and entities that include information that can be extracted from the user input. Dialogflow has a console interface where it is possible to create intents, entities, fill in responses for specific phrases, and pre-train the intents with some phrases that the user would typically input. The Dialogflow API 7 makes it possible to access Dialogflow functionalities through the ArgueBot application and control the responses through a webhook. The webhook is a URL to the chatbot platform that sends the agent’s response retrieved from the model back to Dialogflow. The Dialogflow implementation included: • argument entity with debate-specific words (the 100 of the most used words in the debate) that the model updated for every chosen debate. These entities helped to detect the argument intent and made intent-detection for multiple users possible; • Default Welcome Intent, that recognized the greetings from the user; • Default Fallback Intent, that when the other intents were not matched re- sponded with Try to start your argument with "I think..." • argument intent, that consisted of debate-specific words and helped to differen- tiate whether the user input was an argument or not (pre-trained on phrases: "there is test", "I think that test", "I argue that test", "in some test" where "test" was the default word in the argument entity and was included in the argument entity for every specific debate); • stance intent for when the user was asking the agent for the stance (pre-trained on phrases: "Are you for or against the debate?", "What is your stance?", "Are you pro or con?" and such); • why intent, for when the user misunderstood the agent or wanted to have more explanation on the specific argument (pre-trained on phrases: "why?", "what?", "I don’t understand", "What do you mean by that?", "Can you explain more" and more) • Small talk pre-built agent, that was customized through manual input to give specific responses for when the user used small talk phrases. Small talk could be manually customized for several areas: about agent (questions such as "Who are you?", "Are you real?", "You are bad" and others); courtesy (including phrases: "Great!", "Thank you!", "Well done" and so on); emotions (including 6https://dialogflow.com/ 7https://github.com/googleapis/dialogflow-python-client-v2 3.2 Architecture 17 User Input Intent Response Model Agent Output Hello Default Welcome Intent response with a random ar- gument retrieved from the database Let’s start the debate! I think that forced educa- tion achieves little. What do you think about it? It’s not so much forcing I think Small talk Di- alogflow agent replies with customised re- sponse Understood. I think the brain is still too much in development early on. argument intent triggers argument intent, user-input analyser is then activated and determines that the polarity of the sentence is neutral and then retrieves first two sen- tences from the argument with the highest cosine similarity to the user input Working at an early age can be an advantage in cer- tain circumstances. Many families, particularly in countries with little wel- fare, need their children to bring income into the household. I think the return on invest- ment is solid. Children with more education can be more free in what they can do as an adult. argument intent Here the intent is classi- fied as an argument and the polarity is computed as "positive" which is un- derstood by the model as a "pro" stance I agree! Working at an earlier age can help these families to survive. Fur- thermore, anyone who is having difficulties getting educational qualifications can gain an advantage by leaving school and gaining work experience. I agree with you that if you start working at a younger age, say 16, children can still learn later if they want to Default Fallback intent this user input is not un- derstood by the model which triggers pre-defined Default Fallback response Try to start your argument with "I think..." What is an ap- prenticeship? argument intent user asks for the clarifica- tion, but the input is classi- fied as an argument which triggers the argument in- tent with neutral polarity As well as these direct costs there is also a huge amount of losses that a country would face. Young people who leave school and enter the workforce contribute to the econ- omy through taxes and contributions to pensions which the country would no longer receive if people remained in school. Tab. 3.2.: System’s response model with some examples from the dialogue on the debate topic "Make voting compulsory" 3.2 Architecture 20 3.3 User tests and results This section will describe the evaluation strategy for the first implementation of the chatbot and its results. The evaluation was conducted through the user tests via a platform build for the testing purpose (see figure 3.4) and a survey created in Google Forms (see Appendix A). The purpose of these user tests was to identify how the users interact with the chatbot and how it can be further improved. Before conducting the user tests, the platform was tested by one colleague in order to assure the test quality. This test is not included in the results for the first implementation. The user tests were conducted during three days and had 14 participants. The testers were mainly colleagues from Findwise that received the link to the ArgueBot through an internal communication system. Other testers were acquaintances contacted via Facebook. The user tests were anonymous and therefore there is no demographic information available for the participants. The platform for the testing had two pages. The front page for the platform included information about the project, and the user consent form (see figure 3.4a). After the user gave his/her consent, the ArgueBot platform redirected the user to the main page (see figure 3.4b). The main page had the option box where the debate topic could be selected, information about the debate, the chatbot box for conversing with the Arguebot and the link to the survey. To make interaction anonymous, an identification code was given to the user and could be found on the top of the page. The user was later instructed to provide this identification code when filling in the survey in order to connect the survey answer to a specific conversation for further analysis. The survey included: • two yes/no questions that asked the users whether they found the background information clearly presented respective helpful • three ranking questions where the users were asked to rank: on the scale from 1 to 10 how natural (human-like) the conversation flow with the chatbot felt; on the scale from 1 to 10 how satisfactory the grammar of the chatbot and its response quality was. • the users were also asked to elaborate on their answers in an open-question form • if the users wanted to leave additional feedback they had an opportunity to do so in a separate question at the end of the survey. 3.3 User tests and results 21 (a) The front page (b) The main page Fig. 3.4.: The interface of the ArgueBot for the first implementation 3.3 User tests and results 22 4Second Implementation with Machine Learning „[Language] makes infinite use of finite means — Wilhelm von Humboldt This chapter presents the second and final implementation of the ArgueBot platform and the design choices. The user test section for this implementation is more extensive and will, therefore, be described in the separate chapter 5. 4.1 ArgueBot 2.0 This section will introduce the changes made in the second implementation of the ArgueBot and their motivations. The major changes, such as stance classification and the generative model, will be described in a separate section each. 4.1.1 Dataset The dataset for argument retrieval was extended from the Junior topics to debates in the test dataset in the ArguAna Counterargs corpus (Wachsmuth et al., 2018). Some debates belonged to different topics which resulted in some duplicates. The code for filtering out the duplicates was included in the pre-processing code and resulted in 175 debates saved to the database from 214 existing in the dataset. The distribution of debates in the second implementation with their points and counterpoints for each topic can be found in Table 4.1. 4.1.2 Architecture Figure 4.1 shows the architecture for the second implementation. Similar to the first implementation, the dataset is first pre-processed and then saved into the database. Here, junior debates were replaced with more complex debates (see the distribution for different topics for selected debates in Table 4.1). Because of the higher argument complexity, more extended pre-processing was applied. This included removing 25 Topic Debates Points Counterpoints Culture 7 54 54 Digital freedoms 9 61 61 Economy 17 125 125 Education 10 76 76 Environment 5 36 36 Free speech debate 9 58 58 Health 10 77 77 International 30 233 233 Law 19 134 134 Philosophy 10 85 85 Politics 26 194 194 Religion 5 36 36 Science 8 57 57 Society 6 39 39 Sport 4 30 30 Total 175 1295 1295 Tab. 4.1.: Distribution of debates, points, and counters over the topics in the database for the second implementation notes, annotations, references, and footnotes. Moreover, the number of the most used words for extracting and updating the argument entity in the Dialogflow was increased from 100 to 300. The interaction with the user is quite similar to the first implementation, but here, instead of using sentiment analysis, a stance classifier developed using Machine Learning (ML) was added. Moreover, instead of the Fallback message Try to start your argument with "I think..." that was used by the Fallback intent (when the Dialogflow could not match the user input to any of the available intents), the user would get a generated argument created by the Generative Model. The model that retrieved the sentences from the database, instead of choosing one argument with the highest similarity to the user input, chose two arguments instead. If the first argument did not have any sentences left to retrieve, the next available sentence from the second argument was retrieved. This was done instead of replying with "You already used this argument", which according to the user tests of the first implementation was often wrong and felt unnatural. Moreover, instead av retrieving two sentences at the time from the argument as was done in the first implementation, one sentence was retrieved instead. This was done to maximize the number of candidates for the retrieval. 4.1 ArgueBot 2.0 26 To differentiate between different models used in this implementation, the model used to retrieve a response for the specific user from the database, will be hence referred to as the Retrieval Model. To differentiate responses created by different models during the user tests, the responses created by the generative model included "GR" token at the end of the sentence. The chatbot’s model is hence hybrid, as it uses both retrieval and generative models. Additionally, a new intent was created to explain the purpose of "GR" at the end, if the user would ask what "GR" means. The major changes made in the ArgueBot are marked in figure 4.1 with dashed rectangles. They will be explained thoroughly in next two sections : section 4.2 for the Stance Classification with ML and section 4.3 for the Generative Model. Fig. 4.1.: Architecture of second implementation Examine figure 4.2 showing a part of the debate conducted in by one of the partici- pants on the topic "Making voting compulsory" during the user tests for the second implementation. Here, "user" annotates the user input, "agent" annotates the Argue- Bot response and the response flow is explained within the brackets. The agent’s stance is for ("pro") making voting compulsory. A small part of the original dialogue was removed as it was most probably a typo made by the user at the beginning of the conversation. In the table 4.2, some of the user input/ agent response pairs were picked up from the dialogue to illustrate the response model. Here, the table shows the intent of the user input and the response model chosen by the ArgueBot followed by the agent output as the resulted response. 4.1 ArgueBot 2.0 27 4.2.1 Data To create the dataset for the stance classifier, the ArguAna Counterargs corpus (Wachsmuth et al., 2018) was used. This included more thorough pre-processing: • all the references, footnotes, additional information within brackets were removed from arguments by using regular expressions. • the sentences inside every point and counterpoint were then split by using the spaCy library 1. • a csv file was then created by saving each sentence and a corresponding stance from the dataset in a binary form (1 for "pro" and 0 for "con"). • the csv file was then manually reviewed to remove the references not captured by regular expressions. • the additional dataset for stance classification "IBM Debater Claim Stance Dataset" (Bar-Haim et al., 2017) was added to the existing csv file to improve the classifier, which resulted in a file with 49544 lines, where every line included a sentence with a respective stance as a label. • the resulting file was then split into train/validation/test files with ratio 75/15/15. The model is used to fit the parameters for classification on the training dataset (the model learns the features of the input and their relation to the corresponding stance). The validation dataset can be used to compare the performance of the model during the training and tune the hyper-parameters used by the model (it treats the input data in the validation dataset as unseen, predicts their stance and evaluates how many of these predictions are correct). The test dataset is used to provide an unbiased evaluation of how well the final model is fit on the unseen input data from the test dataset (it is similar to the validation process, but happens after the training is done). During the experiment, the validation set was only used for tuning the best performing classification model because of the time constraints. Table 4.3 shows the distribution for different stances in the dataset, where the sentences with the pro stance have 3% higher distribution, therefore, the dataset is slightly imbalanced. Training Validation Testing Data set 34686 7432 7432 Pro stance 18254 3907 3938 Con stance 16432 3525 3494 Tab. 4.3.: The Dataset used for the Stance Classification in number of lines 1https://spacy.io/ 4.2 Stance classification with ML 30 4.2.2 Methodology This section will provide an overview of the experiment done to determine the most suitable Machine Learning classification model for the given data. A general overview of different ML models used will be described here, while the chosen model will be explained in more details in the next section. Google Collaboration (Colab) is a free Jupyter notebook environment that allows one to use a limited amount of a hardware accelerator such as GPU or TPU. While TPU is working exclusively with Tensorflow (an open source machine learning library developed by Google), the GPU can be used with various machine learning libraries. The reason why hardware accelerators are preferable is because of their computational speed, as they can train models way faster than with the CPU available on modern computers. A set of experiments were, therefore, conducted on Google Colab to test different machine learning models for stance classification. The PyTorch machine learning library for Python 2 was used for six different machine learning models: CNN, Self- Attention Networks, LSTM, LSTM with self-attention mechanism, RCNN, and RNN networks. The code was based on the existing solution provided in GitHub 3. CNN (convolutional neural networks) is a commonly used method for image classifi- cation, that uses kernels (often 2x2, 3x3, 4x4 pixel squares) to first select important features and save them in hidden layers and then run through the new images in or- der to classify them with the help of these features. For NLP tasks the method works in the same way but by using tokens instead of pixels. The CNN model usually works well with data that has the same size (such as images with the same resolution) as it is able to extract the characteristic features for classification purposes. The Self-Attention Networks (SAN) uses weights distributed for different words in the sequence to find which combinations of the words are the most important. The method was successfully used for a range of NLP tasks, such as machine translation, sequence labeling, relation extraction, and others. RNN (recurrent neural networks) are networks that have loops in them. When dealing with text, RNNs understand words they encounter later in the sentence given the words they have encountered earlier. As the distance between the words grows, they cannot make the correct connections and therefore, focus more on the words close to one another. 2https://pytorch.org/ 3https://github.com/prakashpandey9/Text-Classification-Pytorch 4.2 Stance classification with ML 31 LSTM stands for Long Short-Term Memory and is built to overcome the before- mentioned issue in RNNs. LSTMs are capable of learning long-term dependencies with memory cells that can maintain their state over time through control gates, that are controlling which information should be let in or out. LSTM with a self-attention mechanism (LSTM SAM) enables RNNs to learn the correlation between the current words and the previous part of the sentence and save them into memory cells. RCNN (Recurrent Convolutional Neural Networks) captures contextual informa- tion with the recurrent structure and constructs the representation of text using a convolutional neural network. The first experiment included training each model 5 times with 10 epochs (number of iterations that the machine learning model uses to go through the same data) and choosing the model with the best accuracy. The hyper-parameters used in this experiment were: 32 batches (number of samples that will be propagated through the network) and 512 hidden features (number of features that the network learns from each sample). The accuracy is the number of correct predictions divided by the number of all samples (Müller and Guido, 2016). For the simplicity and because of the time constraints the same hyper-parameters were used for all the models. These hyper-parameters were chosen after some trial-and-error method, to find the hyper-parameters that showed the overall satisfactory test accuracies for most of the models. Table 4.4 shows the accuracy for the different models for the first experiment. Com- paring the accuracy for the training and the validation datasets can help developers to identify if the model is overfitting (if the validation accuracy is lower than the training accuracy) or underfitting (the opposite) during the training. When the model is overfitted, it works well on the training set (it learned rules specifically for the training set) but is not able to generalize to new data. Underfitting occurs when the model is not able to capture all the aspects of the training data (Müller and Guido, 2016). From table 4.4, we can observe that the SAN and LSTM SAM are prone to overfitting, CNN and LSTM are prone to overfit slightly, while RNN is prone to underfit. To overcome these issues the model can be tuned through adjusting the hyper-parameters so that they satisfy the trade-off of over/under-fitting. Due to the time constraints, the tuning was done if needed only for the best performing model. The testing accuracy shows the accuracy of the trained model on the test dataset, namely on the data the model has not seen before. This metric is therefore used to measure the performance of the trained model. 4.2 Stance classification with ML 32 chosen as a stance classifier for the chatbot. Due to overfitting problem discovered in the first experiment, the model was further tuned by increasing the batch size and the number of features for learning which thus improved LTMS SAM model’s generalization (with training accuracy 78.93% , validation accuracy 62.64%, testing accuracy 62.12%, F1-score: 59.64% and F2-score: 79.44%). This model used hyper-parameters: 64 batch size and 768 hidden features. To illustrate the difference between interaction with the chatbot using different stance classifiers two dialogues were chosen for the same debate "Banning the development of genetically modified organisms" from the user tests and are shown in figure 4.3. Figure 4.3a shows the ArgueBot version using the LSTM as a stance classifier, while figure 4.3b shows the dialog using the LSTM SAM as a stance classifier. The blue color to the left denotes the user input, while the green color to the right denotes the chatbot response. When analyzing the dialog in figure 4.3a, two problems can be recognized: incorrect stance classification (the chatbot agrees with each of the user’s arguments), an unrelated response (the user is talking about the danger of genetically modified organism, and the chatbot responds with the benefits of genetically modified food). The combination of these problems creates a disruption in the conversation, and the user might feel that the chatbot completely misunderstands him or her, as the chatbot continues to agree with everything that the user says. Moreover, the user in this example seems to be testing the system by trying the sentences with the semantic differences ("Genetic modification is good/bad"). This inspired to include such sentences when conducting the experiment for the classifier, mentioned previously. The second dialog in figure 4.3b uses the LSTM with the self-attention mechanism as the stance classifier. It is able to classify the user inputs with different stances (it responds with both "I agree" and "I disagree" statements), but the chatbot is still not able to analyze the more complex user inputs, such as questions. The chatbot was designed with the assumption that the user will input arguments on the topic and while it can handle simple questions for clarification such as "why?" or "what?", the question "Agree with what?" is too complex to handle. Here, a clear example of the dataset limitations can be observed: the agent response "There are two problems associated with scientifically testing the impact of genetically modifying food" denotes the beginning of the argument and results in an obvious response from the user that asks "What problems?", that the agent assumes is an argument and responds with a new argument from the dataset. Moreover, the user in this example got the generated response that ends with "GR", that while grammatically correct, is not related to the debate topic of genetically modified food. 4.2 Stance classification with ML 35 (a) LSTM stance classifier (b) LSTM SAM stance classifier Fig. 4.3.: Comparison between two dialogues using different classifiers for the same debate topic "Banning the development of genetically modified organisms", where the blue color represents the user and the green color represents the agent 4.2 Stance classification with ML 36 4.2.3 LSTM with Self-Attention Mechanism The LSTM SAM was chosen as a classification model for the ArgueBot based on the experiments demonstrated in the previous section and will be further explained here. First, the overview of the LSTM networks will be given, then the LSTM SAM will be explained with the help of figure 4.4. The LSTM method was first proposed by Hochreiter and Schmidhuber (1997) to overcome the vanishing/exploding gradient problem in the RNN networks. This problem occurs when the gradients that carry the information needed for updating the weights and setting up the network become too small/big and result in the model not being able to learn. The LSTM networks use a gating mechanism that controls the degree to which the LSTM units keep the previous state and store the extracted features in them. Figure 4.4 shows the architecture for the LSTM networks with the self-attention mechanism (LSTM SAM). First, the words are transformed into their vector repre- sentations "vn" (300-dimensional GloVe embeddings), which are then fed into the LSTM embedding layer. "A" here represents a chunk of the LSTM network that has a chain-like connected structure, where every repeating chunk "A" has three gates that control the information flow for its memory cell and pass it to the hidden state "hn" containing the word features and to the next chunk "A" in the chain. The three gates are: the input gate that regulates how much of the new information the cell should keep, the forget gate that regulates how much of the existing information the cell should throw away or keep, and the output gate that regulates what information to pass to the next chunk in the network and the hidden state. The attention layer then finds the contribution of each word to the whole input by assigning weights "wn" to each word. The sentence embeddings "M" is computed as the sum of these weights in the vector matrix, where each vector represents an aspect, or component of the semantics (long sentences can have multiple components) belonging to a "pro" and a "con" class. The sentence embedding in the attention mechanism is able to provide the semantic representations of the input (long term dependencies), allowing LSTM to carry only shorter term context information around each word (short term dependencies) and in doing so relieves some of the memory load from the LSTM network (Lin et al., 2017). The output "r" is a sentence feature vector, containing the sentence embedding for "pro" and "con" class. When testing a new, unknown input (such as the user input in the chatbot), the model returns the class with the highest weight. 4.2 Stance classification with ML 37 Seq2seq models take an input sequence and return an output sequence using fixed- size encoder-decoder architectures and can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more. Both en- coder and decoder use two separate GRU recurrent neural networks. The GRU is a multi-layered Gated Recurrent Unit (Cho et al., 2014), that similarly to the LSTM networks eliminates the vanishing/exploding gradient in the RNNs through the gate architecture. While the LSTM networks use three gates (input, forget and output), the GRU has only two gates (reset and update). The reset gate controls how much of existing information to forget between the chunks in the network, while the update gate controls what information to add to existing information and what information to throw away (a combination of input and forget gates in the LSTM). The GRUs don’t have any cell states as LSTMs do and use only hidden states to transfer the information and therefore require less computational power than the LSTM networks. Figure 4.5 illustrates the architecture of the seq2seq encoder-decoder mechanism with GRU and attention layers. "<SOS>" and "<EOS>" tokens represent the start and the end of the sentence and are added to every input fed into the model. Vectors "vn" are the words’ indexes in the dictionary. The encoder GRU iterates through every token in the input sentence and outputs the output vector "wn" and a hidden state vector "hn". It uses bi-derectional GRU that has two independent RNNs, one for feeding the input in the normal sequential order represented by chain-like network chunks "A" and one for the reverse order represented by chain-like network chunks "A‘", thus it can encode both past and future context of the sequence. The Luong attention layer (Luong et al., 2015) is used to calculate the attention weights from the encoder outputs "hn" and the hidden state of the decoder from the current time step "st". These attention weights allow the decoder to focus on the most important parts in the input sentence and are calculated through "score functions" that use three different methods: dot, general and concat (see formula 4.2 that displays the equations for the different methods, where "st" is the decoder hidden state from the current time step, "hn" are all encoder hidden states, while "Wa" and "va" are the model parameters for making the predictions based on the alignment vector). These functions are used to calculate the context vector "c" that is then used to make predictions for the next word by the decoder. 4.3 Generative Model 40 score(st, hn) =  st Thn dot st TWahn general va Ttanh(Wa[st;hn]) concat (4.2) The decoder GRU generates the output sentence one token at a time. It predicts the next token based on the context vector "c" and the current decoder hidden state sn, until it outputs the "<EOS>" token, which represents the end of the sentence. The GRU used here is unidirectional (it has only one direction and handles the past context). Fig. 4.5.: Architecture of the seq2seq encoder-decoder generative model with GRU RNNs and Self-Attention mechanism To test the performance of different Luong score functions, five sentences from the user tests conducted during the first implementation (see Chapter 3) were selected: "Global warming will make winters a thing of the past", "Yes, but private investment will always want to improve negative features of gambling", "So therefore there exists no further alternative path to explore beyond educational programs?", "I argue that saving anyone goes against natural selection", "There are situations where you would have to choose which persons to save as all cant be saved, would it not be ok in this situation to let some die to save more people?". These user inputs were chosen because they could not match the intents in the first implementation and triggered the Fallback message. The second implementation of ArgueBot was set to prompt the generative model instead of the Fallback message. The results are presented in the table 4.6. The general function was chosen for the generative model in the ArgueBot. Its generated sentences from the experiment were better composed than that of the other functions. 4.3 Generative Model 41 Sentence dot general concat Global warming will make winters a thing of the past the global economy is not necessarily bad a global warming is no longer a small priority global warming is a very short term economic and devel- opment measure to have produced hu- man rights Yes, but private in- vestment will al- ways want to im- prove negative fea- tures of gambling for example the private sector has been investing for decades as the best way to progress is that it is impossible to say whether she is be- ing spent or be- cause of the risk that is being spent on the condition but what a lot of money is going to be able to drive a service based mar- ket So therefore there exist no further al- ternative path to ex- plore beyond educa- tional programs? this is a problem that is simply not about what people want to learn about how they play a nat- ural process this is a clear ap- proach that can eas- ily be used malev- olently and is not consistent all this is true will not offer that all women are expected to abide by I argue that saving anyone goes against natural selection i am sure it is a good thing that you have to be scientifi- cally scientifically i think that is ex- actly why id is an in- correct theory in fact behes claim is that they know not only because they are unable to abuse their intel- ligence if i know there is no god There are situations where you would have to choose which persons to save as all cant be saved, would it not be ok in this situation to let some die to save more people? however if a person has a right to educa- tion those who are legally prepared to use guns for crimi- nal purposes to be able to save their life and then put their lives in life the court s respon- sibility is to the state and to iso- late people from the feeling of language and religion and is thus unacceptable in most cases when the state is being punished for society there is no reason why a uk ban on cer- tain areas where in a year there would be no need for the vast majority of life Tab. 4.6.: Comparison between text generations based on different Luong score functions 4.4 Conclusion This chapter described the second implementation of the ArgueBot platform. The model of the chatbot was changed from retrieval-based to hybrid, combining both retrieval and generative models. This was done to compensate for the limitations of the retrieval model being restricted to the arguments stored in the database. The stance classification method was changed from using sentiment analysis as a tool for treating it as a binary classification problem and applying machine learning classification methods. After conducting a set of experiments, the LSTM with self- attention mechanism classifier was chosen to include in the chatbot. 4.4 Conclusion 42 Facebook group with the link and information about the chatbot was created, where 355 acquaintances got invited. The user tests resulted in 38 survey answers. The low number of answers is most probably a result of bad timing and acquaintances being on vacation. One user test was removed from the resulting answers, as the user did not receive any response from the chatbot because of a server error and could therefore not evaluate the interaction with the chatbot fairly. The ArgueBot platform included a front page with the information about the chatbot, instructions for the user test, and a consent form (see figure 5.1a). After agreeing to the terms of the user test, the users were redirected to the main page with the chatbot (see figure 5.1b). The main page was separated into two sections. On the left side, there was a menu for choosing a debate topic and browse between different debates available for that topic. Every debate had background information. The chatbot message interface was located on the right side of the page. The identification number, every user was assigned with, was located at the top of the page, while the link to the survey with a reminder to copy the identification code before proceeding was located at the bottom of the page. The survey was created in Google Forms (see Apendix B) and included the same questions as the previous survey (for the first implementation see 3.3) with two differences: • The users were asked if they had any previous interactions with any chatbot before with predefined yes/no answers: yes (ArgueBot and other chatbots), yes (ArgueBot), yes (other chatbots), no. This question provided some user background on the users’ previous interactions with the chatbots. • The users were asked to rate the response quality for the generated responses (marked with GR at the end of the sentence) and retrieved responses separately. Answering the question that asked the users to rate the response quality for the generated responses was optional, as not all the users received such responses. As previously mentioned, the stance classifier was changed in the middle of the user tests after discovering that the previous classifier could not distinguish between different classes and predicted all the user inputs as the same class. The change of the classifier resulted in a split in the user test answers that were affected by the classifier. The user tests that used LSTM as a stance classier had 16 responses, while user tests with LSTM with self-attention mechanism (LSTM SAM) resulted in 21 responses (one survey response from 22 was removed as mentioned before). 5.1 Overview 45 (a) The front page (b) The main page Fig. 5.1.: The interface of the ArgueBot for the second implementation 5.1 Overview 46 5.2 Survey results This section presents the results of the survey and highlights the most important findings. The survey results were divided into two groups (two versions of the ArgueBot that used LSTM and LSTM-SAM as a stance classifier), but will jointly analyze them for survey responses that were not impacted by the stance classifier (such as grammar and debate background). 5.2.1 User Background Figure 5.2 shows the distribution of users with different interaction backgrounds for the classifiers used. The distribution is quite similar, with both classifiers having users that had previous experience chatting with a chatbot. The LSTM SAM stance classifier ArgueBot version was tested by more users who previously tested ArguBot in the first implementation. (a) LSTM stance classifier (b) LSTM SAM stance classifier Fig. 5.2.: The distribution of different user backgrounds (previous interaction experience with chatbots) for different classifiers. 5.2.2 Debate information The purpose of the debate information presented on the left side of the platform was to give the users the context of the debate and help them to formulate their argu- ments. This area of the survey responses was handled together for both classifiers, as it stayed the same for both versions of ArgueBot. All of the users (37 of 37) thought that the debate information was clearly presented and gave mostly positive feedback on its content. The users found it interesting and relevant. Moreover, 30 users of 37 found the debate information helpful. The users that did not find it helpful (7 of 37) commented that the information was too long or too heavy and suggested to either have a word cloud of the most important words or the outline with the major arguments. Some users who did not find it 5.2 Survey results 47 questions were handled separately for the different classifiers used, and can be seen in Figure 5.5 for the generated sentences and Figure 5.6 for the retrieved sentences. Even though the generated sentences were not affected by the change of the classifier, their ratings could still indirectly be affected by the performance of the Retrieval Model that used the classifier. The users could, for example, rate the retrieved sentences higher not because their quality was better, but because the quality of the generated responses was worse. The ArgueBot that used the LSTM stance classifier is represented in blue, while the ArgueBot version that used the LSTM SAM is represented in orange. The quality of the generated sentences was rated as unsatisfactory by the majority of the respondents for both classifiers with an average of 3.17± 2.04 for the LSTM version and 3.13± 2.85 for the LSTM SAM version. The t-test showed 0.97, there is, therefore, a statistically significant difference between the classifiers for the rating of the generated responses. The users thought that the generated sentences made little to no sense. Some of the users stated that the generated responses were very random and did not relate to the topic discussed. Fig. 5.5.: User ratings for the response quality of the generated sentences, where 1 is unsatisfactory and 10 is satisfactory. The percentage score shows the distribution for the rating amongst the users answering the question for the ArgueBot with LSTM (blue color) and LSTM SAM (orange color) as a stance classifier. The quality of the retrieved sentences was rated on average 5.14± 2.35 for the LSTM SAM version and 4.81± 2.14 for the LSTM version. The t-test for the retrieved re- sponses showed 0.66, there is, therefore, a statistically significant difference between the results for the different classifiers used compared to the 0.05 significance level. The respondents that tested the LSTM version of the ArgueBot felt that the chatbot didn’t understand them, that there was no common thread throughout the responses and that the responses were often unrelated to what the user said. The respondents 5.2 Survey results 50 that tested the LSTM SAM version felt that the retrieved responses were better than generated sentences and more related to the debate but still limited. Fig. 5.6.: User ratings for the response quality of the retrieved sentences, where 1 is unsat- isfactory and 10 is satisfactory. The percentage score shows the distribution for the rating amongst the users for the ArgueBot with LSTM (blue color) and LSTM SAM (orange color) as a stance classifier. The difference between the user rating for the retrieved and the generative responses can be seen in figure 5.7 for both classifiers together. The figure shows that retrieved responses marked with violet color (the average is 5± 2.24) were rated relatively better than the generated responses marked with pink color (the average is 3.15± 2.48). The t-test showed 0.003, which compared to the significance level 0.05 showed no statistically significant difference between the responses. Fig. 5.7.: User ratings for the response quality of the generated responses (pink color) versus retrieved responses (violet color), where 1 is unsatisfactory and 10 is satisfactory. The percentage score shows the distribution for the rating amongst the users answering the question for the ArgueBot for both classifiers 5.3 Conversation length When it comes to the conversation length, a number of turns was used as a metric. This metric shows how many input/output pairs the conversations have (one user could have had multiple conversations on different debate topics). The conversation 5.3 Conversation length 51 length can be an indicator of how engaging the conversation with the chatbot is perceived by the user (the longer it is, the more engaging it is perceived). Figure 5.8 shows the number of turns per conversation for the two different stance classifiers used in the second implementation of the ArgueBot. One conversation here is delimited to one specific debate. An average number of turns for the LSTM classifier is 8 ± 6 and for the LSTM SAM 7 ± 8. The standard deviation for both classifiers is higher than their means, which means the high spread of the data. The decision was, therefore, made to group the number of turns in the range of five to understand the data better. The t-test could not be performed here, as the number of turns were previously grouped. It was therefore not possible to determine if there was any statistically significant difference between different classifiers. But according to the figure 5.8, an assumption can be made that there is a similar distribution of the grouped number of turns for both classifiers with the majority of conversations being less than 10 turns. Fig. 5.8.: Comparison between different stance classifiers for number of turns per conversa- tion, where blue represents LSTM stance classifier and orange LSTM SAM stance classifier. The percentage score shows the distribution of the range for a number of turns for all the conversations for that classifier 5.3 Conversation length 52 The main claim of the argument was one sentence consisting of 3-10 words (see an example of a debate structure in Figure 3.1 with "PRO/CON + number" main claims). This was done with the assumption that the user input would also be 3-10 words long and the similarity between these can, therefore, be extracted. The model then chose the argument of which the main claim had the highest similarity with the user utterance. Sometimes this worked as intended, but sometimes the user utterance differed from all the main claims available for that debate, or the main claim available in the dataset was not descriptive enough. The latter case resulted in the model retrieving the argument with the highest similarity but with no actual relevance to the user input. A better approach might be to extract the most used words (debate-specific words) in every argument (separately for a point and a counterpoint) and use them as keywords in combination with the main claim to compare with the user input. Another problem occurred when two different user utterances were matched to the same argument. The model then assumed that it should continue to retrieve sentences from the same argument and removed the used sentence from the database. In the first implementation when the user input was matched to an argument that didn’t have any sentences left to retrieve, the chatbot would reply with the default message: "You already used this argument", which according to the user tests felt unnatural for many users as they used new arguments and the chatbot reply was false. For the second implementation, the retrieval strategy was slightly changed, with the model handling two arguments with the highest similarity to the user input, and if the first argument was empty, retrieving the next sentence from the second argument instead. This strategy, while fixing the issue with keeping replying with new sentences, did not fix the issue of the responses being irrelevant to the user input. To improve the relevance of the chatbot’s retrieved responses, two strategies can be applied: extend the dataset with more arguments and improve the similarity computations. Cosine similarity was chosen because of its compatibility with the spaCy library, but other methods could have shown better performance, such as Word Mover Distance, universal sentence encoder, and Siamese Manhattan LSTM model (Adrien Sieg, 2019). Correct context understanding (the meaning of the user input) is another challenging problem. To help the model understand the context of the user input, the Dialogflow service was used. The model extracted the 300 most used words in the debate and fed them in their lemma word form into Dialogflow as entities, that were later used by Dialogflow to classify the user input as an argument. This method does not take synonyms into consideration, and the context detection was therefore limited to the words used in the dataset for that debate. One solution could be to combine the most used words in the debate with their synonyms. 6.1 ArgueBot 55 Moreover, the ArgueBot hybrid model was built with the assumption that the user would input only arguments. It could therefore not handle complex questions that many users asked for example to clarify some term or statement used by the chatbot. It could not distinguish between statements and questions, which caused wrong agent responses. One solution could be to apply a self-attention mechanism on the user input in combination with the previous agent responses. Zhou et al. (2018) proposed DAM networks that used attention mechanism to capture sentence-level dependencies in a multi-turn chatbot. This solution can, therefore, be applied to improve the user engagement with the chatbot, as it is capable of taking into consideration the user’s previous inputs. 6.2 Stance Classification In the second implementation, the ArgueBot used machine learning techniques for stance classification. The stance classifier was trained on a dataset composed of sentences with their corresponding stances from the ArguAna Counterargs corpus (Wachsmuth et al., 2018) and the IBM Debater Claim Stance Dataset (Bar-Haim et al., 2017). The machine learning methods used were the LSTM and the LSTM with self-attentions, that focused on the most important words in the sequence. The classifier was used to predict the stance of the user input, which was then compared to the chatbot’s stance that later replied with either "I agree" or "I disagree" statement. This agreement statement was included at the beginning of the ArgueBot’s retrieved response. The classifier is therefore limited to the available arguments for training and is not topic dependent. If more time for the project would be available, this problem could be explored more. For example, it would be interesting to combine a machine learning classifier with feature extraction strategies suggested by Mandya et al. (2016) and sentiment towards the main claim as was suggested by Bar-Haim et al. (2017). The user tests showed that the two stance classifiers had a statistically significant difference in the users’ interaction with the chatbot in relation to conversation flow naturalness, and response quality for generated and retrieved sentences. Conver- sation flow and quality of generative responses were rated slightly higher for the LSTM version than LSTM SAM version of the ArgueBot. This can depend on several factors, for example, different rating understanding for different users. At the same time, the generated responses were depended on the performance of the classifier only in relation to the quality of retrieved responses. It can, therefore, have an opposite effect: the LSTM version was rated higher for the quality of generated responses because retrieved responses that used LSTM classifier performed worse. This agrees with the conclusion that the quality of the LSTM SAM version for the 6.2 Stance Classification 56 retrieved responses had a statistically significant difference and was rated higher than the LSTM version. It can, therefore, be concluded, that LSTM SAM classifier performed better and improved the quality of the retrieved responses. At the same time, it did not improve the naturalness of the conversation flow. A conclusion can, therefore, be made that the stance classification cannot be seen as a simple binary classification problem. The stance classification was shown to be a more complex problem that needs a more complex model for the classification than, unfortunately, was out of the scope for this project. The overall lesson learned from the user tests is that the stance classifier should be more reliable in order to improve the user experience. If the stance classifier predicts wrong and results in the wrong agreement statement, the user assumes that the chatbot did not understand their argument, which reduces their satisfaction of the interaction with the chatbot. 6.3 Generative Model The generation of sentences is also a very interesting and challenging problem. When done properly, it can enrich the conversation with the chatbot with always new arguments to debate on. When done poorly, it can damage the conversation flow and confuse the user instead. Unfortunately, the latter was the case for the ArgueBot. According to the user tests, generated sentences were often unrelated to the debate topic, sometimes grammatically incorrect and did not make sense in the debate context. Because of Google Colab’s limitations, that was used to pre-train all the machine learning models, the dataset used was not enough to generate sentences that made more sense. Google Colab and a smaller dataset were used because it was fast and convenient. If more time and computational power for the project were available, it would be interesting to train a model on the larger dataset composed of for example the Internet Argument Corpus (IAC) 1 and compare the quality of the generated sentences. 6.4 Hybrid Model It is important to keep in mind that although the generative models can build understandable and grammatically correct responses given the conversation context, they are likely to return general responses. One suggestion to overcome this issue can 1https://nlds.soe.ucsc.edu/iac2 6.3 Generative Model 57 SQ:3 What is the appropriate model for the chatbot’s response generation? The Argue- Bot used a hybrid model that combined both generative and retrieval models. The generative model used a self-attention mechanism to generate new argu- ments when the retrieval model could not be applied. The generated sentences often did not have good quality and were unrelated to the specific debate topic. When given more time and computational power, the generative model can be pre-trained with more data during more time, which might improve its performance. This was unfortunately out of scope for this study. SQ:4 How can human-like conversation with the chatbot be carried out in the debate domain? The ArgueBot used Dialogflow’s pre-trained agent for small talk. Stance classification was used to react on the user input with either agreement or disagreement. In retrospect, the stance classifier performed not well enough and decreased the naturalness of the conversation flow. The debate is often more complex than just agreeing and disagreeing with the opponent, for example including the arguments of the "I agree, but..." structure. SQ:5 How can such a chatbot be evaluated? The evaluation of the ArgueBot was conducted through the user tests via a created platform and a survey created in Google Forms. The survey results were analyzed with the help of graphs and statistical metrics (mean value with its standard deviation and t-test). Moreover, the user engagement was evaluated in relation to the number of turns within each conversation that users had. The hybrid retrieval-generation-based chatbot can maintain a debate with a user for various topics by creating an engaging, human-like experience. The agent should recognize the context of the user input, correctly classify the user’s stance and provide with the relevant responses, whether they are retrieved or generated. The next step for the ArgueBot would be to improve the conversation flow through a better similarity algorithm, context understanding, extended dataset and better classifier and generative model. To improve the similarity, keywords extracted from debate-specific words can in combination with the main claim be used for comparison with the user input. The knowledge base of the ArgueBot can be extended with more conversational data. The stance classifier can be improved by combining a machine learning classifier with feature extraction strategies and sentiment towards the main claim. The generative model can be improved by training it during a longer period on a bigger amount of data. Moreover, other models can be used to make the interaction with the chatbot more engaging, such as DAM attention mechanism and Microsoft’s Icecaps. I believe given these improvements, the Arguebot can become an interesting debate partner, broaden the political discussion, and promote critical thinking in many. 60 Bibliography Abbott, Rob, Brian Ecker, Pranav Anand, and Marilyn A Walker (2016). “Internet Argument Corpus 2.0: An SQL schema for Dialogic Social Media and the Corpora to go with it.” In: LREC (cit. on p. 8). Bar-Haim, Roy, Indrajit Bhattacharya, Francesco Dinuzzo, Amrita Saha, and Noam Slonim (2017). “Stance classification of context-dependent claims”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 251–261 (cit. on pp. 30, 56). Cahn, Jack (2017). “CHATBOT: Architecture, design, & development”. In: University of Pennsylvania School of Engineering and Applied Science Department of Computer and Information Science (cit. on p. 7). Carstens, Lucas and Francesca Toni (2015). “Towards relation based argumentation mining”. In: Proceedings of the 2nd Workshop on Argumentation Mining, pp. 29–34 (cit. on p. 39). Carstens, Lucas and Francesca Toni (2017). “Using argumentation to improve classification in natural language problems”. In: ACM Transactions on Internet Technology (TOIT) 17.3, p. 30 (cit. on p. 39). Chapman, Graham and Monty Python (1989). The Complete Monty Python’s Flying Circus: All the Words. Volume one. Vol. 1. Pantheon, p. 86 (cit. on p. 4). Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, et al. (2014). “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, pp. 1724–1734 (cit. on p. 40). Habernal, Ivan and Iryna Gurevych (2017). “Argumentation mining in user-generated web discourse”. In: Computational Linguistics 43.1, pp. 125–179 (cit. on p. 5). Higashinaka, Ryuichiro, Masahiro Mizukami, Hidetoshi Kawabata, et al. (2018). “Role play- based question-answering by real users for building chatbots with consistent personalities”. In: Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pp. 264–272 (cit. on p. 10). Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term memory”. In: Neural computation 9.8, pp. 1735–1780 (cit. on p. 37). Huang, Anna (2008). “Similarity measures for text document clustering”. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. Vol. 4, pp. 9–56 (cit. on p. 15). 61 Keller, Thomas E., James K. Whittaker, and Tracey K. Burke (2001). “Student Debates in Policy Courses: Promoting Policy Practice Skills and Knowledge through Active Learning”. In: pp. 343–355 (cit. on p. 1). Kim, Jintae, Hyeon-Gu Lee, Harksoo Kim, Yeonsoo Lee, and Young-Gil Kim (2018). “Two-Step Training and Mixed Encoding-Decoding for Implementing a Generative Chatbot with a Small Dialogue Corpus”. In: Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG), pp. 31–35 (cit. on p. 6). Klopfenstein, Lorenz Cuno, Saverio Delpriori, Silvia Malatini, and Alessandro Bogliolo (2017). “The rise of bots: A survey of conversational interfaces, patterns, and paradigms”. In: Proceedings of the 2017 Conference on Designing Interactive Systems. ACM, pp. 555–565 (cit. on p. 2). Kuhn, Deanna (1991). The Skills of Argument. Cambridge University Press (cit. on p. 4). Lawrence, John and Chris Reed (2014). “AIFdb Corpora.” In: COMMA, pp. 465–466 (cit. on p. 39). Le, Dieu-Thu, Cam Tu Nguyen, and Kim Anh Nguyen (2018). “Dave the debater: a retrieval- based and generative argumentative dialogue agent”. In: Proceedings of the 5th Workshop on Argument Mining, pp. 121–130 (cit. on pp. 6–9). Levy, Ran, Shai Gretz, Benjamin Sznajder, et al. (2017). “Unsupervised corpus–wide claim detection”. In: Proceedings of the 4th Workshop on Argument Mining, pp. 79–84 (cit. on pp. 5, 6). Lin, Zhouhan, Minwei Feng, Cicero Nogueira dos Santos, et al. (2017). “A structured self- attentive sentence embedding”. In: CoRR abs/1703.03130. arXiv: 1703.03130 (cit. on p. 37). Lippi, Marco and Paolo Torroni (2016). “Argumentation mining: State of the art and emerging trends”. In: ACM Transactions on Internet Technology (TOIT) 16.2, p. 10 (cit. on p. 4). Luong, Thang, Hieu Pham, and Christopher D. Manning (2015). “Effective Approaches to Attention-based Neural Machine Translation”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, pp. 1412–1421 (cit. on p. 40). Ma, Wenjia, WenHan Chao, Zhunchen Luo, and Xin Jiang (2018). “CRST: a claim retrieval system in Twitter”. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 43–47 (cit. on p. 6). Mandya, Angrosh, Advaith Siddharthan, and Adam Wyner (2016). “Scrutable feature sets for stance classification”. In: Proceedings of the Third Workshop on Argument Mining (ArgMining2016), pp. 60–69 (cit. on pp. 6, 56). Manning, Christopher D (2008). Introduction to information retrieval. Cambridge: Cambridge University Press (cit. on pp. 6, 8). Moore, Robert J, Raphael Arar, Guang-Jie Ren, and Margaret H Szymanski (2017). “Con- versational UX design”. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems. ACM, pp. 492–497 (cit. on p. 2). Müller, Andreas C, Sarah Guido, et al. (2016). Introduction to machine learning with Python: a guide for data scientists. " O’Reilly Media, Inc." (cit. on pp. 32, 33). Bibliography 62 Footnotes Adrien Sieg (2019). Text Similarities : Estimate the degree of similarity between two texts. URL: https://medium.com/@adriensieg/text-similarities-da019229c894 (visited on June 28, 2019) (cit. on p. 55). ChatEval (2019). URL: https://chateval.org/ (visited on Mar. 4, 2019) (cit. on p. 9). Dialogflow (2019). URL: https://dialogflow.com/ (visited on Mar. 4, 2019) (cit. on pp. 9, 17). Doha Debates (2019). Why This Debate and Why Now? URL: https://dohadebates.com/ (visited on Apr. 2, 2019) (cit. on p. 1). Findwise (2019). Search Driven Solutions. URL: https://findwise.com/en (visited on Apr. 2, 2019) (cit. on p. 2). Github (2019). Dialogflow: Python Client, Github. URL: https://github.com/googleapis/ dialogflow-python-client-v2 (visited on June 10, 2019) (cit. on p. 17). IBM Research AI (2019). Project Debater. URL: https : / / www . research . ibm . com / artificial-intelligence/project-debater/ (visited on Feb. 6, 2019) (cit. on p. 1). idebate (2019). International Debate Education Association. URL: https://idebate.org/ (visited on Apr. 4, 2019) (cit. on p. 11). kaggle (2019). Quora Question Pairs. URL: https://www.kaggle.com/c/quora-question- pairs (visited on Apr. 8, 2019) (cit. on p. 8). Matthew Inkawhich (2019). Chatbot tutorial. URL: https://pytorch.org/tutorials/ beginner/chatbot_tutorial.html (visited on June 10, 2019) (cit. on p. 39). Merriam-Webster (2019). Debate. URL: https://www.merriam-webster.com/thesaurus/ debate (visited on Feb. 12, 2019) (cit. on p. 1). Ngrok (2019). URL: https://ngrok.com/ (visited on July 18, 2019) (cit. on p. 18). NLTK (2019). Natural Language Toolkit. URL: https://www.nltk.org/ (visited on Mar. 21, 2019) (cit. on p. 15). Numpy (2019). Numpy Homepage. URL: https://www.numpy.org/ (visited on July 17, 2019) (cit. on p. 16). Pallets (2019). Flask SQLAlchemy. URL: https://flask-sqlalchemy.palletsprojects. com/en/2.x/ (visited on Apr. 8, 2019) (cit. on p. 18). 65 Pandorabots (2019). Pandorabots Documentation. URL: https://pandorabots.com/docs/ (visited on Mar. 4, 2019) (cit. on p. 9). Panetta, K. (2019). Gartner Top 10 Strategic Technology Trends for 2019. URL: https:// www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology- trends-for-2019/ (visited on Feb. 6, 2019) (cit. on p. 2). Prakash Pandey (2019). Text Classification Pytorch. URL: https://github.com/prakashpandey9/ Text-Classification-Pytorch (visited on June 8, 2019) (cit. on p. 31). Projects, The Pallets (2019). Flask. URL: https://palletsprojects.com/p/flask/ (cit. on p. 18). PyTorch (2019). URL: https://pytorch.org/ (visited on July 26, 2019) (cit. on p. 31). spaCy (2019). Industrial-Strength Natural Language Processing. URL: https://spacy.io/ (visited on Mar. 21, 2019) (cit. on pp. 15, 30, 38). SQLite (2019). SQLite Homepage. URL: https://www.sqlite.org/index.html (visited on July 17, 2019) (cit. on p. 15). UC Santa Cruz (2019). Natural Language and Dialogue Systems: Internet Argument Corpus. URL: https://nlds.soe.ucsc.edu/iac2 (visited on Mar. 21, 2019) (cit. on pp. 39, 57). Wikipedia (2019). Ethics of artificial intelligence. URL: https://en.wikipedia.org/wiki/ Ethics_of_artificial_intelligence (visited on Mar. 20, 2019) (cit. on p. 5). Footnotes 66 List of Figures 3.1 An example of a debate’s architecture . . . . . . . . . . . . . . . . . . . 13 3.2 Architecture of first implementation . . . . . . . . . . . . . . . . . . . . 14 3.3 A conversation conducted during the user tests for the first implementa- tion of the ArgueBot on the debate topic "Raise the school leaving age to 18" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 The interface of the ArgueBot for the first implementation . . . . . . . 22 4.1 Architecture of second implementation . . . . . . . . . . . . . . . . . . 27 4.2 A conversation conducted during the user tests for the second implemen- tation of the ArgueBot on the debate topic "Making voting compulsory" 28 4.3 Comparison between two dialogues using different classifiers for the same debate topic "Banning the development of genetically modified organisms", where the blue color represents the user and the green color represents the agent . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Architecture of the LSTM with Self-Attention Mechanism . . . . . . . . 38 4.5 Architecture of the seq2seq encoder-decoder generative model with GRU RNNs and Self-Attention mechanism . . . . . . . . . . . . . . . . 41 5.1 The interface of the ArgueBot for the second implementation . . . . . . 46 5.2 The distribution of different user backgrounds (previous interaction experience with chatbots) for different classifiers. . . . . . . . . . . . . 47 5.3 How the users rated the grammar of the chatbot’s responses . . . . . . 48 5.4 User ratings for how natural (human-like) the conversation flow with the chatbot felt, where 1 is unnatural and 10 is natural. The percentage score shows the distribution for the rating amongst the users for the ArgueBot with LSTM (blue color) and LSTM SAM (orange color) as a stance classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.5 User ratings for the response quality of the generated sentences, where 1 is unsatisfactory and 10 is satisfactory. The percentage score shows the distribution for the rating amongst the users answering the question for the ArgueBot with LSTM (blue color) and LSTM SAM (orange color) as a stance classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 67 Appendix Survey ArgueBot 1.0 ArgueBot This research is a part of master thesis, entitled “ArgueBot: Enabling a debate through a multi-tum retrieval-based chatbot”. The aim of the research is to create a chatbot that is able to maintain a meaningful debate with users on various topics throughout the whole interaction. The goal of the chatbot is to inform, challenge and provoke the user into critical thinking In this research, you will interact with the chatbot in the debate on different topics. Be aware that the opinions stated by the chatbot do not represent its developer's opinions in any way but are retrieved from the dataset containing multiple arguments with different points of view. The interaction with the chatbot will be saved for further analysis. Please, do not enter any private information into the chatbot such as your password or email. The data collected by the chatbot and this survey: + will not be passed on to third parties; will be treated confidentially; will only be used in anonymised form for this research; will be stored so that unauthorized persons cannot access them. If you have questions about your rights as a research participant or wish to obtain information, ask questions, or discuss any concerns about this study with someone ather than the researcher, please contact the Secretary of the Ethics Committee of the department of EEMCS, mail: ethics- [email protected] Thank you for taking part in this research! Yours sincerely, Iryna Kulatska [email protected] * Required By choosing "I consent" option, you confirm you hereby understood your rights and agree to continue. * © |Iconsent © |do not consent NEXT 70 Please write down the identification code, that can be found on the top of the ArgueBot webpage. Your answer Was the information about the debates clearly presented? * © Yes O No Please elaborate on your answer above Your answer Did you find the information about the debates helpful? * © Yes O No Please elaborate on your answer above Your answer Please rate the conversation flow with the chatbot in terms of how natural(human-like) you perceived it. * 1°92 3 4 5 6 7 8 9 10 umaurt OOO 00O00 0 © Natural Please elaborate on your answer above Your answer How did you perceive the grammar used by the chatbot? * 12 3 4 5 6 7 8 9 10 veybed OO OO0O000 0 O eeyaood Please elaborate on your answer above Your answer How would you rate the response quality of the chatbot. * 12345678 9 10 Unsatisfactory OOO OOO000 O Satisfactory Please elaborate on your answer above Your answer Please give any additional feedback here Your answer BACK ATs Never submit passwords through Google Farms, 71 Appendix Survey ArgueBot 2.0 ArgueBot 2.0 This research is a part of master thesis, entitled "ArgueBot: Enabling a debate through a hybrid retrieval-generation-based chatbot'”. The aim of the research is to create a chatbot that is able to maintain a meaningful debate with users on various topics throughout the whole interaction. The goal of the chatbot is to inform, challenge and provoke the user into critical thinking, The data collected by the chatbot and this survey: will not be passed an to third parties; + will be treated confidentially; will only be used in anonymised form for this research; + will be stored so that unauthorized persons cannot access them If you have questions about your rights as 4 research participant or wish to obtain information, ask questions, or discuss any concerns about this study with someone other than the researcher, please contact the Secretary of the Ethics Committee of the department of EEMCS, mail: ethics- [email protected]! Thank you for taking part in this research! Yours sincerely, Iryna Kulatska [email protected]| + Required By choosing "I consent" option, you confirm you hereby understood your rights and agree to continue. * © Iconsent O |do not consent NEXT Never submit passwords through Google Forms: 72