Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Chatbots in Argument Mining: Types, Stance Classification, and Evaluation, Lecture notes of Machine Learning

Mansour University College (MUC)Machine Learning

The limitations of current chatbot systems in understanding context and emotional cues, and proposes methods for argument mining and stance classification using chatbots. It covers different types of chatbots, evaluation tools, and datasets. The document also suggests potential solutions to improve chatbot performance in the debate domain.

Typology: Lecture notes

2021/2022

Uploaded on 09/07/2022

adnan_95 🇮🇶

4.3

(39)

918 documents

1 / 78

This page cannot be seen from the preview

Don't miss anything!

University of Twente

Human Computer Interaction and Design

EIT Digital Master School

Interactive Technology

University of Twente

P.O. Box 217

7500 AE Enschede

The Netherlands

M.Sc. Thesis

ArgueBot: Enabling debates through a hybrid

retrieval-generation-based chatbot

Iryna Kulatska

Supervisors from

University of Twente

Dr. M. Theune

Prof. Dr. D.K.J. Heylen

J.B. van Waterschoot,

MSc

Supervisor from

Findwise

J. Bratt, MSc

2019

Discover Lecture notes of Machine Learning Mansour University College (MUC)

Partial preview of the text

Download Chatbots in Argument Mining: Types, Stance Classification, and Evaluation and more Lecture notes Machine Learning in PDF only on Docsity!

University of Twente

Human Computer Interaction and Design EIT Digital Master School Interactive Technology University of Twente P.O. Box 217 7500 AE Enschede The Netherlands

M.Sc. Thesis

ArgueBot: Enabling debates through a hybrid

retrieval-generation-based chatbot

Iryna Kulatska

Supervisors from University of Twente

Dr. M. Theune Prof. Dr. D.K.J. Heylen J.B. van Waterschoot, MSc

Supervisor from Findwise

J. Bratt, MSc

Abstract

The goal of this study is to develop a debate platform, the ArgueBot, that is able to maintain a meaningful debate with the user for various topics. The goal of the chatbot is to carry out human-like debates with the users. The Arguebot uses a hybrid model, combining retrieval- and generative-based models. The retrieval model uses cosine similarity to compare the user input with the argument candidates for a specific debate. The generative model is used to compensate for the limitations of the retrieval model that is restricted to the arguments stored in the database. The Arguebot utilizes Dialogflow, Flask, spaCy, and Machine Learning technologies within its architecture. The user tests and the survey are used to evaluate the chatbot’s performance. The user tests showed that there is potential in the Arguebot, but it needs better context understanding, a more accurate stance classifier and a better generative model.

1 Introduction
- 1.1 Problem Statement
- 1.2 Thesis Structure
2 Background and Related Work
- 2.1 Argument mining
  - 2.1.1 Arguments and their components
  - 2.1.2 Stance classification
- 2.2 Chatbots
  - 2.2.1 Types of chatbots
  - 2.2.2 Hybrid model
  - 2.2.3 Debate-chatbots
  - 2.2.4 Building a chatbot
  - 2.2.5 Evaluation
- 2.3 Conclusion
3 First Implementation with Basic Functionalities
- 3.1 Dataset
- 3.2 Architecture
  - 3.2.1 Pre-Processing
  - 3.2.2 Model for data analysis
  - 3.2.3 Dialogflow
  - 3.2.4 Flask
- 3.3 User tests and results
- 3.4 Conclusion
4 Second Implementation with Machine Learning
- 4.1 ArgueBot 2.0
  - 4.1.1 Dataset
  - 4.1.2 Architecture
- 4.2 Stance classification with ML
  - 4.2.1 Data
  - 4.2.2 Methodology
  - 4.2.3 LSTM with Self-Attention Mechanism
- 4.3 Generative Model
  - 4.3.1 Data
  - 4.3.2 Methodology
- 4.4 Conclusion
5 Final evaluation of the ArgueBot
- 5.1 Overview
- 5.2 Survey results
  - 5.2.1 User Background
  - 5.2.2 Debate information
  - 5.2.3 Grammar
  - 5.2.4 Conversation flow
  - 5.2.5 Response quality
- 5.3 Conversation length
- 5.4 Conclusion
6 Discussion
- 6.1 ArgueBot
- 6.2 Stance Classification
- 6.3 Generative Model
- 6.4 Hybrid Model
7 Conclusion
Bibliography
Footnotes
A Appendix Survey ArgueBot 1.0
B Appendix Survey ArgueBot 2.0

algorithm that improves the argument detection; and finally Text-to-Speech (TTS) Systems that convert text into spoken voice output and gives the Debater its voice.

In the meantime, chatbots are gaining more and more momentum as a new platform for human-computer interaction. According to Gartner, Inc by 2022, twenty-five percent of enterprises will have integrated virtual customer assistants and chatbots within their platforms 4. However, current chatbot systems still have several limita- tions such as incorrect understanding of the context (meaning) of the user utterance, a lack of empathy and the inability to understand social and emotional cues that exist in human-to-human communication (Klopfenstein et al., 2017; Moore et al., 2017).

1.1 Problem Statement

The following research aims to create a chatbot that can maintain a meaningful debate with users on various topics. The goal of the chatbot, called ArgueBot, is to be able to carry out human-like debates with the users.

The problem statement for the following research is defined as:

How can a hybrid retrieval-generation-based chatbot maintain a debate with a user for various topics?

The problem statement can be divided into sub-questions:

SQ:1 How can the model recognize and handle the arguments?

SQ:2 How can stance classification be applied for the conversational agents?

SQ:3 What is an appropriate model for the chatbot’s response generation?

SQ:4 How can human-like conversation with the chatbot be carried out in the debate domain?

SQ:5 How can such a chatbot be evaluated?

The research presented in this thesis was carried out at Findwise AB, a consultancy company that provides search-driven solutions 5. Findwise supported this project with guidance and testing.

(^4) https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology- trends-for-2019/ 5 https://findwise.com/en

1.1 Problem Statement 2

1.2 Thesis Structure

Chapter 2. Background and Related Work

This chapter elaborates on the background for the research topic and related work done within the field. Here, more information about existing methods for argumen- tation mining, building, and evaluating chatbots can be found. Moreover, research questions SQ: 1, 2, 3 ad 5 will be answered in relation to the previous work.

Chapter 3. First Implementation with Basic Functionalities

This chapter describes the chosen methods and user tests for the first implementation of the ArgueBot. Here research questions SQ: 1, 2, 3, 4 and 5 will be answered in relation to the first implementation of the ArgueBot.

Chapter 4. Second Implementation with Machine Learning

This chapter describes the changes made in the second implementation of the ArgueBot. Here, research questions SQ: 1, 2 and 3 will be answered in relation to the second implementation of the ArgueBot.

Chapter 5. Final evaluation of the ArgueBot

This chapter will present the results for the evaluation of the second implementation of the ArgueBot. Here, research question SQ: 5 will be answered in relation to the second implementation of the ArgueBot.

Chapter 6. Discussion

Here, the results presented in the previous chapter with their challenges and limita- tions will be discussed.

Chapter 7. Conclusion

This chapter will summarize the findings and propose how they can be further improved in future work. Here, all research questions will be answered with regard to the whole project.

1.2 Thesis Structure 3

Consider the following text extracted from the Wikipedia article "Ethics of artificial intelligence": "Joseph Weizenbaum argued in 1976 that AI technology should not be used to replace people in positions that require respect and care, such as any of these: customer service representative [...], therapist [...], nursemaid for the elderly [...], soldier, judge, police officer. Weizenbaum explains that we require authentic feelings of empathy from people in these positions. If machines replace them, we will find ourselves alienated, devalued and frustrated. Artificial intelligence, if used in this way, represents a threat to human dignity [...]"^1

Here, "AI technology should not be used to replace people in positions that require respect and care" is a conclusion (the claim, the core of the argument). "Weizen- baum explains that we require authentic feelings of empathy from people in these positions", "If machines replace them, we will find ourselves alienated, devalued and frustrated", "Artificial intelligence, if used in this way, represents a threat to human dignity" are the premises (statements that provide reason, evidence or support for the conclusion).

An inference is a process of drawing conclusions based on the premises and in the above-mentioned example would be:

humans need to feel empathy, that technologies cannot provide in the same way as professionals do;
the absence of empathy and authentic feelings can result in humans disap- pointment which threatens humans mental health;
therefore, AI should not replace professionals with positions that require respect and care.

Habernal and Gurevych (2017) proposed a model based on machine learning for identifying argument components containing feature sets: baseline lexical features; structural, morphological, and syntactic features; semantic, coreference, and dis- course features; and embedding features. These sets of features were used to identify argument components and extract the arguments from the annotated forum posts.

Another method was proposed by Levy et al. (2017), who used it for detecting topic-relevant claims from the data extracted from Wikipedia. The study used claim sentence query to extract sentences with the word “that” followed by the claim topic, followed by any word from a pre-defined lexicon. This lexicon included words characteristic to the claims such as argue, disagree, argument, claim, conflict and others (Levy et al., 2017).

(^1) https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence

2.1 Argument mining 5

Furthermore, information retrieval techniques can be used to structure the arguments ( indexing ), relate them to each other by computing how similar or dissimilar they are to each other, making it possible to find and retrieve the most relevant arguments and counterarguments (Stab et al., 2018; Ma et al., 2018; Wachsmuth et al., 2017; Wachsmuth et al., 2018; Zhu et al., 2018). Information retrieval can be defined as finding unstructured (does not have a clear, semantically distinguishable structure that is easy to understand by computer) data that complies with the information need from within a large collection of data (Manning, 2008).

2.1.2 Stance classification

Stance classification is a field within argument mining that helps to identify whether the argument is for or against the issue being debated. Mandya et al. (2016) proposed to extract the following features for stance classification: topic-stance fea- tures (specific words associated with topics); stance bearing terminology (words connected by adjectival modifier (amod) and noun compound modifier (nn) de- pendency relations that can indicate the stance in the argument); logical point features (extraction of words following the rule subject-verb-object (SVO) which might capture the claim); unigrams and dependency features (used to classify shorter posts).

Levy et al. (2017) proposed a method of claim stance classification in regard to a given topic. The study used precise semantic analysis of the debate topic and the claim (the sentiment of the claim towards its target), including target identification (through detecting the noun phrases in the claim), and contrast detection between the claim and the topic targets (through their relations), where each of these tasks had a separate machine learning classifier.

2.2 Chatbots

This section will present the recent developments of the conversational agents, also known as chatbots. Chatbot is a computer program that has an ability to mimic written or spoken human speech for interactions with humans (Kim et al., 2018).

2.2.1 Types of chatbots

Chatbots can be broadly classified into generative which generate a response based on natural language generation techniques (Kim et al., 2018; Le et al., 2018), and

2.2 Chatbots 6

2.2.3 Debate-chatbots

To date, at least two debate-chatbots were made: a chatbot Debbie, that uses a similarity algorithm to retrieve counter-arguments (Rakshit et al., 2019) and a chatbot Dave that used retrieval- and generative-based models separately (Le et al., 2018).

Chatbot Debbie used corpora compiled by (Swanson et al., 2015) containing contro- versial topics from the Internet Argument Corpus (Abbott et al., 2016) and dialogues from online debate forums. The authors were using the Argument Quality (AQ) regressor to choose the best arguments from the database containing statements for and against three controversial topics: death penalty, gay marriage, and gun control. Through Debbie chatbot, users were able to pick a topic and specify their stance (the chatbot assumes that the user utterance is always argumentative). The system then used a similarity algorithm based on the UMBC STS score (that combines lexical similarity features such as latent semantic word similarity and WordNet knowledge) to retrieve a ranked list of the most appropriate counter-arguments that was not previously used by the chatbot. The authors created clusters (groups of documents that are semantically similar (Manning, 2008)) with arguments to speed up the retrieval process. Chatbot Debbie continues the debate until the user terminates the chat. The chatbot was evaluated by comparing the average response times for different retrieval methods used for implementation (Rakshit et al., 2019).

Chatbot Dave (Le et al., 2018) also used Internet Argument Corpora (Abbott et al.,

for its knowledge base. The chatbot incorporates both a retrieval-based and a generative conversational model separately. The retrieval-based model used the Manhattan LSTM (MaLSTM) similarity model to learn the semantic similarity between messages and compare the user message with the knowledge base. To train and evaluate the MaLSTM model, a parallel corpus consisting of the Quora question pairs from Kaggle 2 was used. The Quora dataset was used as a "ground truth" for evaluation of the similarity model. Additionally, a context tracker function was implemented to keep track of the user and system responses. The generative model used a hierarchical recurrent (RNN) encoder-decoder architecture, where each word in the response was embedded using pre-trained word embeddings. The generative model was evaluated with a perplexity metric, distinct-1 and distinct-2 metrics (that is the number of distinct uni- and bi-grams in generated responses, scaled by the total number of tokens that are used to measure the degree of diversity of responses). These metrics were able to show the diversity of the generative model but were not useful for evaluating the conversational system. Instead, a rating system was

(^2) https://www.kaggle.com/c/quora-question-pairs

2.2 Chatbots 8

implemented in the chatbot interface, where the users were able to rate each chatbot responses from 1 (very bad) to 5 (very good) (Le et al., 2018).

The chatbot described in this work is different from the above-mentioned chatbots in several ways: firstly, the dataset that is used for the knowledge base of the chatbot is different resulting in different discussion topics within the chatbot; secondly, the model for implementation is different. While chatbot Debbie uses UMBC STS similarity score and chatbot Dave uses Manhattan LSTM similarity model, this project uses cosine similarity in the combination with the GloVe embedding vectors. Additionally, the final implementation of the chatbot presented in this work uses a hybrid model, combining both the retrieval and the generative models.

2.2.4 Building a chatbot

The Turing test that tests a machine’s ability to perform intelligent behavior equiva- lent to human intelligence (Turing, 1950), inspired many researchers and engineers to develop multiple conversational systems. One such example is Eliza, a computer program that through pattern matching and specific phrasing could imitate human- to-human conversations (Weizenbaum, 1966). The most recent chatbot that passed the Turing test is Mitsuku (four-time Loebner Prize winner), built in Pandorabots (^3) by using the artificial intelligence markup language (AIML). However, chatbots

built using AIML have difficulties with maintaining a dialogue for a longer time (Shum et al., 2018) and are not able to extract complex information needed in the debate-domain.

Currently, there are many online tools available for building chatbots: Dialogflow, Microsoft Bot Framework (Cortana), IBM Watson Conversation, and many others. Among these, Dialogflow 4 is a free platform for creating interfaces based on natural language conversations which functionalities can be expanded by using webhooks (is a way to send information within different applications). Both Microsoft Bot Framework and IBM Watson Conversation have a free version that allows only a limited number of API calls per month.

2.2.5 Evaluation

When it comes to evaluating chatbot’s performance, the most recent tool is ChatEval (^5) that includes evaluation datasets with both human-annotated and automated

baselines (Sedoc et al., 2018). The Turing test can be used to evaluate how human-

(^3) https://pandorabots.com/docs/ (^4) https://dialogflow.com/ (^5) https://chateval.org/

2.2 Chatbots 9

First Implementation with Basic^3

Functionalities

„ The smart way to keep people passive and

obedient is to strictly limit the spectrum of acceptable opinion, but allow very lively debate within that spectrum—even encourage the more critical and dissident views. That gives people the sense that there’s free thinking going on, while all the time the presuppositions of the system are being reinforced by the limits put on the range of the debate.

— Noam Chomsky (The Common Good (1998))

This chapter describes the first implementation of the ArgueBot platform, the design choices, and how it was tested. The goal of the first implementation was to build the base functionalities for interaction with the user. Henceforward, the ArgueBot chatbot will be referred to as an agent.

3.1 Dataset

The knowledge base for the chatbot consists of the ArguAna Counterargs corpus (Wachsmuth et al., 2018). Table 3.1 lists the 15 topics used in the dataset containing 1069 debates with 6779 points and 6753 counterpoints (see an example of how a debate is composed in figure 3.1) distributed between test, training and validation folders. Arguments consist of points with both pro and con stance towards the debate’s statement. Each such point includes a conclusion, premises and an inference within its text, which are not separated or labelled (see chapter 2.1.1). Each debate has an introduction with the relevant information needed to make an argument. The data in the dataset was crawled from idebate.com 1 , an international debate education association for young people that offers debates written by experienced debaters from around the world. The ArguAna Counterargs corpus includes therefore

(^1) https://idebate.org/

high qualitative arguments, strengthened with citations. The downside of the corpus is its formal nature of argumentation, which might differ from the written arguments provided by the user in the chatbot. This corpus was chosen because of it including debate background and arguments with different stances, providing, therefore, stance labels for each argument and eliminating the problem of stance classification of the existing data.

Topic Debates Points Counterpoints Culture 46 278 278 Digital freedoms 48 341 341 Economy 95 590 588 Education 58 382 381 Environment 36 215 215 Free speech debate 43 274 273 Health 57 334 333 International 196 1315 1307 Law 116 732 730 Philosophy 50 320 320 Politics 155 982 978 Religion 30 179 179 Science 41 271 269 Society 75 436 431 Sport 23 130 130 Training set 644 4083 4065 Validation set 211 1290 1287 Test set 214 1406 1401 Total 1069 6779 6753 Tab. 3.1.: Distribution of debates, points, and counters over the topics in the dataset (Wachsmuth et al., 2018)

The first implementation used 12 debates marked as "Junior" from the dataset with the claims : "Ban online gambling", "Ban animal testing", "Kill One to Save Many", "Banning School Uniforms", "Poetry should not be taught in schools", "Raise the school leaving age to 18", "Ban the niqab and other face coverings in schools" ," Dreaming of a white Christmas", "Introduce a “fat tax”", "Homework is a waste of time", "Every child should have a mobile phone", "Sponsoring children in developing countries". These debates were designed for the younger audience and included simplified topics with simplified arguments, which aligned with the purpose of the first implementation of creating the platform with some basic functionalities with the use of simplified debates. Each debate included at least six arguments (at least three arguments for and three against the main claim). Each argument included one point and one counterpoint. Each point and counterpoint were generally 4- sentences long each.

3.1 Dataset 12

Every time the user chooses a new debate topic, the model finds the 100 most used words for that debate from the database and generates a debate object (memory object for the specific user to be used by the model) with response candidates that are also saved into the database. The 100 most used words in the dataset for that debate, hence called "debate-specific words", are then sent to the argument entity in Dialogflow through the API. When the user gives input in form of a chat message, the message is sent to Dialogflow that detects the intent (context) of that message using the debate-specific words in the argument entity and the sentence composition. If the user input is classified as an argument , it is then further analyzed by the model. The model checks how similar the user argument is to the argument candidates stored in the database and retrieves the appropriate response. If Dialogflow classifies the user input with some other intent, the model replies with a predefined response. Each section in Figure 3.2 marked with a blue rectangle will be described in more detail below.

Fig. 3.2.: Architecture of first implementation

3.2.1 Pre-Processing

Pre-processing included removing information within brackets, such as citations and explanations. Additional information for the debate backgrounds that explained the nature of the debate was also removed. These were removed by using regular expressions.

The debate names were changed through a written script from for example "This house Would Ban School Uniforms - Junior" to "banning school uniforms - Junior".

3.2 Architecture 14

The "This House" wording format belongs to the British Parliamentary debate style that is a default format for many university societies. British Parliament consists of "Houses", thus "this house.." represents a motion to be discussed in the debate.

The name change included tokenizing the name, removing the first two tokens if they were "this" and "house", checking the tense of the verb and changing it to the present participle ("-ing") form. Tokenizing and verb-checking were implemented by using the spaCy library 2. The arguments were then vectorized by using spaCy’s GloVe vectors model package "en_vectors_web_lg" and transformed into strings to save space. The use of these vectors will be further explained in the next section.

The pre-processed debates with their arguments were saved to SQLite database 3 to reduce the computing time for the model and make the retrieval process easier.

3.2.2 Model for data analysis

There are two main purposes for the model: one for handling the debate object (memory object to be used by the model) for each user and one to handle the analysis of the user input.

The user-input handler used the spaCy library to vectorize the input. It used cosine similarity to compare the vectorized user input to all the argument candidates for the chosen debate. It then retrieved the id of the argument that had the highest similarity and sent it to the debate model. The cosine similarity between two vectors is a measure that calculates the cosine of the angle between these vectors projected in a multi-dimensional space. Given two vectors −→ a and

b , their cosine similarity is

cos ϕ =

−→ a · −→ b ‖−→ a ‖ × ‖

b ‖

where −→ a and

b are multi-dimensional vectors over the term set T = { t 1 ,... , tm } and each dimension represents a word with its weight in the sentence. The cosine similarity is a non-negative number between 0 and 1 (Huang, 2008).

It then used NLTK Vader library 4 to classify the stance for the user input through sentiment analysis. The polarity of the user input (whether it has positive, neutral, or negative sentiment) was used to classify whether it was for or against the main claim of the debate. Positive sentiment was understood as a "pro" stance, negative sentiment as a "con" stance and neutral sentiment as undefined stance.

(^2) https://spacy.io/ (^3) https://www.sqlite.org/index.html (^4) https://www.nltk.org/

3.2 Architecture 15

Chatbots in Argument Mining: Types, Stance Classification, and Evaluation, Lecture notes of Machine Learning

Related documents

Partial preview of the text

Download Chatbots in Argument Mining: Types, Stance Classification, and Evaluation and more Lecture notes Machine Learning in PDF only on Docsity!

University of Twente

M.Sc. Thesis

Iryna Kulatska

Contents

2.1.2 Stance classification

2.2.1 Types of chatbots

2.2.3 Debate-chatbots

2.2.4 Building a chatbot

2.2.5 Evaluation

First Implementation with Basic^3

„ The smart way to keep people passive and

3.2.1 Pre-Processing

3.2.2 Model for data analysis