Download A Framework for the Design and Evaluation of Machine ... and more Exercises Machine Learning in PDF only on Docsity! A Framework for the Design and Evaluation of Machine Learning Applications Northwestern University Machine Learning Impact Initiative September 2021 This document was compiled by Kristian J. Hammond, Ryan Jenkins, Leilani H. Gilpin, and Sarah Loehr with assistance from Mohammed A. Alam, Alexander Einarsson, Andong L. Li Zhao, Andrew R. Paley, and Marko Sterbentz. The content reflects materials and meetings that were held as part of the Machine Learning Impact Initiative in 2020 and 2021, with the participation of a network of researchers and practitioners. See the Machine Learning Impact Initiative Summary Report for a full list of all who participated and engaged in these processes. This work was supported through a gift from Underwriters Laboratories. A Framework for the Design and Evaluation of Machine Learning Applications 2 Table of Contents Introduction .................................................................................................................................................. 3 Motivation and Context ................................................................................................................................ 5 Framework Structure .................................................................................................................................... 6 Features and Facts .................................................................................................................................... 7 Evaluation ................................................................................................................................................. 7 Framework Components .............................................................................................................................. 7 Data ........................................................................................................................................................... 8 Algorithmic Choices .................................................................................................................................. 9 Interaction ............................................................................................................................................... 10 Evaluation ............................................................................................................................................... 10 Goals ................................................................................................................................................... 11 Values .................................................................................................................................................. 12 Applying the Framework ............................................................................................................................. 13 Data Issues .............................................................................................................................................. 13 Algorithmic Issues ................................................................................................................................... 14 Interaction Issues .................................................................................................................................... 16 Goals and Values Alignment Issues ......................................................................................................... 17 Research Vision, Roadmap and Next Steps ................................................................................................ 19 Citations ...................................................................................................................................................... 21 A Framework for the Design and Evaluation of Machine Learning Applications 5 hoc analysis. The framework could allow us to ask the questions necessary to identify the gaps in research and the techniques that need to be developed in order to avoid, prevent, mitigate and evaluate the issues regarding the viability of using a proxy. Motivation and Context Efforts to guide the ethical development of artificial intelligence have proliferated at an astonishing pace over the last few years — various organizations have published at least 96 reports, guidelines, sets of best practices, and so on since 2018 alone, and another 21 before that (Zhang et al. 2021: 130). While these reports have been produced by governments, professional organizations, NGOs, private corporations, universities, think tanks, and others, they are largely top-down (Allen, Smit, and Wallach 2005): they put forth general principles at the highest level which can arch over the development and deployment of artificial intelligence in any domain. Common themes among these guidelines are fairness, accountability, transparency, and explainability (the so-called ‘FATE’ concepts); justice, human rights, and so on. This discussion has spawned additional work across disciplines to investigate the nature of these values and operationalize them. With the introduction of the European Union’s General Data Protection Regulation (GDPR), for example, which guarantees a citizen’s right to explanation (Goodman and Flaxman 2017), there has been a flurry of activity exploring the nature of explanation and the nature of this purported duty held by governments, corporations, or others towards data subjects (Kaminski 2019; Selbst and Powles 2018). As the development and refinement of AI techniques continues apace, identifying these overarching values and investigating their nature is clearly important. But it comes with several costs which are also becoming clearer. First, what these frameworks boast in generality they sacrifice in power and capability for action guidance. We are not the only ones to share this view. See, for example, Zhang et al., who bemoan that “the vague and abstract nature of those principles fails to offer direction on how to implement AI-related ethics guidelines” (2021: 129). For one thing, these principles require a tremendous amount of work to operationalize, and they have led to disagreements even at the technical level around what measures of success might be appropriate for judging, for example, the fairness of a model (Alikhademi et al. 2021) or its accountability (Wieringa 2020). Second, these approaches are also ignorant of the context of particular deployments. This is just what it is to say that these principles are maximally general; in fact, we believe they are general to a fault. The fact that they are agnostic about the domains in which artificial intelligence is deployed is an additional obstacle to their operationalization. There are significant and reasonable disagreements between domains about the nature and relevance of the FATE and other concepts. Similarly, the importance of each of these considerations might differ from one domain to another. If a model is opaque (i.e. inscrutable to human users), this might be unproblematic if the model is used to recommend ads to a user — but this could be a conclusive reason to reject its use in the context of banking. On the other end of the spectrum, there is a vast and growing literature that examines and critiques specific instances of artificial intelligence: e.g. advertising recommendation systems online (Rodgers 2021); facial recognition in law enforcement (Brey 2004; Raji et al. 2020; Selinger and Leong 2021); prediction models in finance (Max, Kriebitz, and Von Websky 2020; Davis, Kumiega, and Van Vliet 2013), A Framework for the Design and Evaluation of Machine Learning Applications 6 and so on. This “bottom-up” work is also important, but it suffers from weaknesses that are the inverse of the top-down approach. This literature provides some of the most precise critiques and useful action guidance; but the utility of these insights is limited because they are not portable. It is difficult to generalize these findings for AI broadly, and often for other applications in the same domain. Moreover, it would be onerous to examine the impacts of every single instance of use of machine learning in society. Spurred by these observations, we are attempting to thread the needle by developing what we characterize as a middle-out approach. This approach takes social domains as the appropriate level of analysis, identifies the individual goals and values of those domains, and then explores how particular implementations of machine learning are liable to interact with those goals and values to produce positive or negative human impacts. Our approach seeks to balance the benefits of both generality and action-guidance while acknowledging the context-sensitivity of different values in AI ethics. Of course, a full evaluation of the human impacts of machine learning systems ought to include some reference to their broader context, since AI systems are but one part of a sociotechnical system (see van de Poel, 2020; Kroes et al., 2006). The ultimate consequences of ML systems will be the outcomes of the interactions between human behavior, AI systems, and the norms of the institutions in which they are embedded. This underscores the importance of working at the level of domains, since those provide natural boundaries for evaluating the impacts of a system as a function of the relevant goals and values and its embedded use. Framework Structure As described above, this Framework for the evaluation of the impact of systems based on machine learning divides the task into two phases: extraction of the facts related to the design and development of systems and the subsequent examination of how, given those facts, the system impacts the goals and values associated with a particular domain or field of use. (See “Appendix A – Machine Learning Evaluation Framework: Questions.”) Our starting point is the flow of processing that designers and developers go through in developing Machine Learning systems. Starting with data, and the different forms they take, the ML process then involves algorithmic choices, decisions about functionality, and considerations of the domain in which a system will be deployed. In looking at this process, we want to decouple this last step from the earlier ones in that the domain level considerations are more squarely aligned with evaluation. The core facts of a system remain the same regardless of where it is deployed, but its impact and evaluation are defined by the domain in which it is used. (For more detailed A Framework for the Design and Evaluation of Machine Learning Applications 7 discussion of the machine learning pipeline and development process, see the “Machine Learning Overview and Tutorial” as well as “Appendix B – Machine Learning Algorithm Compendium.”) This division allows us to establish the facts using agreed upon methods. It also isolates where areas of agreement or disagreement exist and reduces the complexity of the process by separating fact from evaluation. Features and Facts As we consider applications, we need to tease out the features that are specific to the applications such as data sourcing and quality, the nature of the algorithms used, and the dynamic of user interactions with the resulting system; these considerations parallel the high-level ML development process. For any given ML application, designers and developers need to move through the same three core phases: 1. Data are gathered, cleaned, normalized, and harmonized against the task at hand and desired learning outcomes. 2. Specific algorithms are selected, specific features of the data are selected, and the model is iteratively trained and tested. 3. User interactions are designed and developed to facilitate functionality and usability. In each of these phases, designers, developers, and product managers make decisions that impact the performance of the resulting system. The goal of the Framework is to develop a set of questions for each one that uncovers the basic facts of how the system was built and what its expected performance will be. Evaluation The process of fact gathering results in a functional description of both the features of an application and the decisions that were made in the process of developing it. With this description in hand, the second stage of the process, domain level evaluation, can proceed. Framework Components In each of the core component areas, the framework process utilizes a set of questions to interrogate the choices, facts, assumptions, features, constraints and methodologies that were used to develop and provide the structure of the system. (To review the detailed set of Framework questions by component area, see “Appendix A – Machine Learning Evaluation Framework: Questions.”) A Framework for the Design and Evaluation of Machine Learning Applications 10 Interaction Once a model has been produced, it is incorporated into a larger system. The design of the Human/Computer interaction with the machine learning model has tremendous impact on the ways in which the system can be guided and the ways in which the system guides (Inkpen et al. 2019), which ultimately influence the outcomes and the impact that result. As we examine interactions, we need to ask questions aimed at uncovering the ways in which model results are interpreted and guide user decision-making. While these questions require an examination of how humans interact with these systems, they are still in the realm of establishing the facts of the matter. These key questions help to ascertain the likely outcomes of the system in practice, in context. What is the core functionality of the system? What does it do (e.g., categorization, recommendation, decision support, prediction, diagnosis, etc.)? Who are the users and what are their skills (Vredenburg et al. 2002)? Are they to judge, or even correct, the outputs of the model (Amershi et al. 2014) – or is there a danger of over-trusting those outputs (Kirkpatrick, Hahn, and Haufler 2017)? And what is the role of the user in the system and the nature of the handoff between the machine and the human who is utilizing it? The goal is not to determine whether the interactions are appropriate but to understand exactly what they are (Inkpen et al. 2019). An interaction involving a handoff from machine to user when the machine is unable to make a decision is a fact of the matter. Whether that handoff can be managed by a user in a given circumstance is a matter of evaluation. Evaluation The facts related to the data, the algorithm, and the system interaction are input to the evaluation process. Evaluation is done within the context of the domain in which this application will be utilized. Analyzing an algorithm against the goals and values of a domain equips us with a set of analytical tools to judge the deployment of machine learning as appropriate or inappropriate and to articulate the tradeoffs, benefits, and drawbacks of its human impact. The domains provide the core goals and values against which to compare the functional facts and the context for reasoning about tradeoffs. This allows the assessment to move beyond generic issues of fairness, accountability, and transparency to consider the specific impact as defined by specific domain-level goals. Additional, more general, societal goals are drawn in using the same structure and mechanisms that the various domains employ themselves. A Framework for the Design and Evaluation of Machine Learning Applications 11 In evaluating the propriety of a machine learning system in the context of its domain goals and values, there are six key elements to consider: the Primary Goals of the application, impact on Secondary Goals, impact on Implicit or Background Goals, possible Negative Impact on Individuals, possible Negative Impact on Groups, and possible Negative Impact over Time. Goals Institutions, professions, and (loosely) social contexts — what we are clustering together as “domains” — all have goals (Walzer 2008). The goal of a domain is the contribution it makes to society or what those inside the domain are trying to accomplish. Much like specifying requirements during the standard engineering process, we suggest viewing these goals as impact requirements that must be met for a system to be acceptable (Van de Poel 2013; Richardson 1997). We defined a goal as “an outcome we hope to accomplish in an institution, profession or social context.” In this situation, “goal,” is used aspirationally as opposed to descriptively. We are not trying to describe people’s actual motivations, in that they might be motivated by fame, reputation, money, vengeance or any number of personal aims outside of domain. We are interested in augmenting and catalyzing the positive contributions that these domains make to society. And each of these domains, such as journalism, medicine, or business, has some positive, characteristic benefit that it supposes itself to make to society. Even domain practitioners might have difficulty articulating the goals and values of their domain let alone considering the impact of an automated system with regard them. With that in mind, we developed several prompts to help practitioners identify the goals of their domain. Ideally, these prompts would converge on one or a set of general answers. In some cases, there may be empirical evidence that these goals are endorsed by the profession, e.g., statements from professional organizations, or professional codes of ethics. To identify the goals of a domain, it is helpful to ask: • Why do people choose to go into this field over others? • What are people within this domain hoping to contribute to society? What do they take their raison d’être to be? • How do the people working in this domain praise themselves, e.g. in their advertisements, award ceremonies, or public statements? • What is the benefit that is peculiar to this institution, that is not provided by other institutions in society? • What benefits do consumers, users, or broader society expect these institutions to furnish? What is the point of these domains in the eyes of outsiders? This list is valuable for identifying the core goals and values of domain and establishing a set of target requirements. This provides us with the requirements that can now be used to test the utility of an application by considering how well the facts and performance of the system satisfy those requirements. A Framework for the Design and Evaluation of Machine Learning Applications 12 With the facts of a system and domain-level requirements, we can identify potential mis-alignment between the goals of a domain and the functional performance of a machine learning system. For example, consider the core purpose of the application and whether the model output could be optimizing to a proxy value of what is desired. Might that proxy fail to reflect the genuine goal of the domain? And when a domain has multiple goals, consider the possibility that implementing a model to optimize for one goal could undermine the peripheral goals of the domain (Mesthene 1997); for example, by efficiently selecting applicants for higher education but reducing the diversity of the students selected. Are there goals associated with the task that are different from the goals that the system is focused on? Are there any goals outside of the focus area of application that are impacted by its utilization? Similarly, some effects of a model might not manifest in the immediate term, but only over time. Could using this system as a long-term policy distort the functioning of the institution or domain it’s deployed into? Values There is some nuance to the distinction between goals and values. We defined values as “an aspect of our activity within a domain that we wish to promote or preserve; features or qualities of our actions that merit attention while we are pursuing our goals.” This is broadly consonant with other discussions of values in the technology ethics literature (see Van de Poel 2013, especially pgs. 262 and following; and the other authors cited there, e.g., Anderson 1993 and Dancy 2005). In our usage, a value is a feature of our actions that is important to us. For example, a teacher might have the goal of spurring her students’ interest in her field, but she might value honesty in doing so. Valuing honesty means that certain ways of spurring her students’ interest, e.g. lying, misleading, or acting in bad faith, are unacceptable. Values can be thought of as providing constraints that rule out certain methods of accomplishing our goals, or reasons that count in favor of certain methods over others. To identify the values of a domain, it is helpful to consider what kinds of actions are criticized or punished, either legally or by the censure of one's colleagues. What kind of behavior is seen as unbecoming? To put these definitions together: goals define what an application is designed to accomplish. Values define the issues that need to be considered as the application is doing so. These domain-level requirements include goals and values that go well beyond the application. A system that is able to use images to provide on-point identification of renal tumors has a set of very specific application-level requirements related to accuracy and precision. But the goals associated with medicine in general, such as “Do no harm,” and societal goals driven by fairness and access are also part of the consideration of goals and values of the domain. There is one issue worth discussing here, which is a cluster of concerns about the theoretical viability of attributing goals and values to institutions. As one participant at our 2020 workshop put it: “People have goals; domains do not have goals.” Second, we might worry that people within a domain have different goals (e.g. within the film industry, compare the goals of producers, directors, and writers). Third, individuals within a domain might have different goals than the goals we attribute to the domain itself. The people inside an institution who are answering phones or processing invoices might not have any A Framework for the Design and Evaluation of Machine Learning Applications 15 methods available to evaluators were extensive manual testing of input and output behaviors (Larson et al. 2016). In this instance, the testing was necessary in that the system itself had massive impact on human life. In other areas such as product recommendations, the need for this kind of deep dive may be less pressing. But in both deployment instances the facts regarding algorithmic decisions and transparency remain the same. Similarly, as systems are trained, designers and developers make decisions about which features of a data set should be included (Chandrashekar and Sahin 2014), which data sets are going to be part of the training (Roh, Heo and Whang 2019), and how the data are segmented into training and testing corpora (Reitermanova 2010). Each of these decision areas impacts the use of systems employing the resulting model, potentially to ill effect. For example, feature selection, used to reduce training time and remove either redundant or irrelevant features, can result in removing independent variables that, in turn, skew results. In medicine, the removal of meta data identifying sources of examples can weaken results in that different hospitals may have different diagnostic thresholds or techniques which makes transferring learning from one data set to another difficult. In a recent Q&A session1, Andrew Ng, noted ML leader, commented on this issue: when we collect data from Stanford Hospital, then we train and test on data from the same hospital, indeed, we can publish papers showing [the algorithms] are comparable to human radiologists in spotting certain conditions. It turns out [that when] you take that same model, that same AI system, to an older hospital down the street, with an older machine, and the technician uses a slightly different imaging protocol, that data drifts to cause the performance of AI system to degrade significantly. Likewise, efforts to remove meta-data such as race or ethnicity from examples, in service of removing bias, can introduce error into the model. A study published in 2020 showed that including race data in medical records and using it to correct for existing inequalities improved overall performance and significantly reduced bias (Allen et al. 2020). Testing segmentation – dividing a data set into a corpus for training and a corpus for testing – is often done iteratively with the goal of developing the most accurate results. Unfortunately, this can result in testing corpora that are populated by the clearest of cases in that they provide the clearest of results. In applications that have significant edge cases – medicine, the law, hiring decisions – this results in systems that miss all but the obvious decisions. By surfacing the techniques used to determine the training/test sets, we surface the areas in which there may be problems. Likewise, as developers train systems they make decisions about which data sets to include or exclude. The effect of this is that some data that provided the appropriate coverage in a space is removed from the training, or data that augmented the features in a core set are not incorporated. In many cases, this is intended to remove outliers or noise in the data but has the outcome of narrowing the scope of 1 May 3rd, 2021 reported by IEEE Spectrum. https://spectrum.ieee.org/andrew-ng-xrays-the-ai-hype Accessed 9/2/2021 A Framework for the Design and Evaluation of Machine Learning Applications 16 effectiveness of the model. Specific drug treatment policies at one hospital, if removed from the data to generalize the corpus, can have the effect of making the model inapplicable at other hospitals where the policies are different (Futoma et al. 2020). As with many decisions, this may be less about the immediate impact and more understanding that the features excluded may have been able to provide more nuance or precision; their exclusion and the reasons why are part of the landscape of facts that will participate in both evaluating systems and subsequently correcting them. Interaction Issues The dynamic of how systems interact with users is often ignored in the consideration of their impact. But the reality is that the details of interaction between intelligent systems and human users can both introduce problems in systems that are otherwise benign and compensate for biases that might be inherent in the system. For real-time systems in which control is shared between a human user and a system, we must understand the nature of the hand off and how context is maintained. For example, when we look at self-driving cars that maintain control under most circumstances but hand that control over to the driver under difficult conditions, the question of how to communicate (or maintain) a driver’s understanding of the situation is paramount. Having drivers reestablish context is simply not feasible in real-time (and often emergency) situations and, as we have seen, can lead to deadly outcomes (Griggs and Wakabayashi 2018). In evaluating such systems, we must first establish the mechanisms for trade-off (time frame, context sharing, warning mechanisms) and then consider the ways in which user attention can be maintained during the process. A set of somewhat more subtle issues arise from the problem of loss of skills because of reliance on automated systems. As was discovered with auto-pilot systems, human pilots’ skills tended to grow stale because of lack of engagement. This has resulted in situations in which the machine ceded control of the plane to human users whose skills were not sharp enough to respond appropriately (Oliver, Calvard, and Potočnik 2017). For the aviation industry, this problem is mitigated through training maintenance requirements, but it is not clear (and an open research question) how to maintain those skills for every driver sitting in the seat of a self-driving car as they roll out. For decision support systems, we need to focus on a different set of questions primarily focused on the role of support. In medicine, the view is that diagnostic systems provide “suggestions” that a physician uses to incorporate into a broader diagnostic picture. Unfortunately, physicians are human and their reliance on the machine’s suggestions can drift toward seeing the machine’s diagnosis as ground truth. As with the hand-off of control in self-driving cars, an approach to the problem from the perspective of policy alone is not enough. We must consider how to either enforce that policy or provide mechanisms at the interaction level to explicitly integrate the machine’s output with other features. In a similar view, the framing of results shifts the focus of how users incorporate them. The same underlying technology system can be used to suggest possible solutions that a user can incorporate into their decision-making process or be used to critique solutions proposed by the user. In the former, the machine’s solution becomes the starting point that may or may not be modified. It becomes the default A Framework for the Design and Evaluation of Machine Learning Applications 17 answer. In the latter, the user’s solution becomes the default that may or may not be changed in light of the machine’s comments. No matter how accurate the automated solution is, the role that it plays in the decision dynamic can change how it impacts the decision-making process and outcomes. Decisions about interactions can have dramatic consequences but are often overlooked in the deployment of ML systems. The framing of questions as either opt-in or opt-out – for example, in organ donation questions during driver’s license renewal – can determine outcomes more than almost any other feature. In the organ donor case, the question framed as an opt-in (“Check here if you want to be an organ donor.”) nets on the order of 42% participation while the same question framed as an opt-out (“Check here if you do not want to be an organ donor.”) nets 82% participation (Johnson and Goldstein 2003). The dynamic of these interactions and how decisions are managed are crucial in determining outcomes and require us to ask questions as to how they are managed. User attitude toward machine recommendations is impacted by the level of understanding that is provided to users (e.g., explanations, alternatives, trade-offs) as well as control. In the former, answers that are simply that, answers with no rationale, can either be ignored or adopted on faith. With explanations, users are presented with more information about the basis of the recommendation that can then be used to evaluate it. In the latter, the ability to change inputs or context create a different relationship between the machine and its users. The outputs become responses to hypotheticals rather than unquestioned answers. The experience provokes more rather than less thought. There is an irony here. While industry has developed a set of what tend to be called “dark patterns” – interactions that are designed to addict or manipulate users (e.g., teaser recommendations, gamification of decision-making, framing) – consideration of the impact of these sorts of interaction design decisions is kept at arm’s length by the human-computer interaction community. As a result, they are used – primarily by organizations that are attempting to use them for commercial ends. Regardless of intent, however, understanding the design and consequences of these interactions is crucial to considerations of human health and safety. Goals and Values Alignment Issues It is helpful to approach the language of goals and values by examining several contentious examples of the use of machine learning. For the first example, again consider the case of COMPAS. ProPublica reported that a company in Florida, NorthPointe, developed an algorithm called “Correctional Offender Management Profiling for Alternative Sanctions” (COMPAS) for assessing the risk that someone convicted of a crime would reoffend, based on 100+ factors (Angwin et al. 2016). This algorithm was used during parole hearings to help parole boards decide whether to release a convict eligible for parole. ProPublica showed that this model was biased in a specific way: the algorithm tended to overestimate the risk of black convicts reoffending and underestimate the risk of white convicts reoffending. Thus, the system seemed biased in precisely the way that the criminal justice system has been historically biased against people of color. If ProPublica’s analysis is correct — which is controversial (Corbett-Davies et al. 2019) — this algorithm is clearly problematic because it is unfair. However, imagine that the predictions that COMPAS yielded were perfect, i.e., that it could perfectly predict whether someone who is up for parole would commit a crime if they were released from jail. A Framework for the Design and Evaluation of Machine Learning Applications 20 research problems that are necessary in order to further operationalize the design, development, and evaluation of AI systems from the perspective of human health and safety. (See the “Roadmap for Research on the Human Impact of Machine Learning Applications” document for the consolidated results of these efforts.) Both the Framework and the Roadmap are intended to evolve. In their current states, they provide us with a starting point for working through approaches and research that will allow us to develop a set of practices aimed at improving the effectiveness of ML applications and assuring that they are designed to function without threatening human health and safety. At its base, the Framework provides a foundational structure for the design and evaluation of Machine Learning applications. It also provides us with a touchpoint for defining a Research Roadmap for work needed to operationalize it. Utilizing the key questions and concerns of each component, we established the core platform and used it to establish a central set of important research problems. As we move forward, we have three primary thrusts to future work on the Framework itself: application, refinement, and road mapping. Application: While the Framework has theoretical validity, it needs to be tested using real world applications. The testing that was done at the workshop level gave us important initial information; the next step is to disseminate the model and track how it is used and how effective it is in guiding the evaluation process. Refinement: As with any approach to best-practice, the Framework remains a work in progress. Our goal in the near term is to continue to refine the model in response to any issues that arise in its utilization. Identifying these issues can result in either refinement or transformation of the Framework itself or establishment of a research plan to develop approaches that are needed to fill in information gaps. Road Mapping: While some issues that arise in the application of the Framework may lead to refinement of the model, often the result is the realization that certain core facts associated with a system – either while being evaluated or developed – might not be readily available. Methods for uncovering these facts are now possible research problems that can be investigated by the research community. As we move this approach forward, the goal is to provide the community with an evolving set of tools that both evaluators and developers can use to operationalize assessment of concerns about the impact of Machine Learning systems on human health and safety. A Framework for the Design and Evaluation of Machine Learning Applications 21 Citations Adadi, Amina, and Mohammed Berrada. "Peeking inside the black-box: a survey on explainable artificial intelligence (XAI)." IEEE access 6 (2018): 52138-52160. Ali, Shawkat, and Kate A. Smith. "On learning algorithm selection for classification." Applied Soft Computing, 6(2), (2006): 119-138. Alikhademi, Kiana, Emma Drobina, Diandra Prioleau, Brianna Richardson, Duncan Purves and Juan E. Gilbert. A review of predictive policing from the perspective of fairness. Artif Intell Law (2021). https://doi.org/10.1007/s10506-021-09286-4 Allen, Angier, Samson Mataraso, Anna Siefkas, Hoyt Burdick, Gregory Braden, R Phillip Dellinger, Andrea McCoy, et al. “A Racially Unbiased, Machine Learning Approach to Prediction of Mortality: Algorithm Development Study.” JMIR public health and surveillance vol. 6,4 e22400. 22 Oct. 2020, doi:10.2196/22400 Allen, Colin, Iva Smit, and Wendell Wallach. "Artificial morality: Top-down, bottom-up, and hybrid approaches." Ethics and information technology 7.3 (2005): 149-155. Amershi, Saleema, Maya Cakmak, William Bradley Knox, and Todd Kulesza. "Power to the people: The role of humans in interactive machine learning." AI Magazine, 35(4), (2014): 105-120. https://doi.org/10.1609/aimag.v35i4.2513 Anderson, Elizabeth. Value in ethics and economics. Cambridge, MA: Harvard University Press, 1993. Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine Bias.” ProPublica, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing. May 23, 2016. Accessed 28 May 2021. Baier, Lucas, Fabian Jöhren, and S. Seebacher. "Challenges in the deployment and operation of machine learning in practice." ECIS (2019). Boehm, Matthias, Arun Kumar, and Jun Yang. Data Management in Machine Learning Systems. Synthesis Lectures on Data Management, 11(1), (2019). Brey, Philip. "Ethical aspects of facial recognition systems in public places." Journal of information, communication and ethics in society (2004). Carbone, Anna, Meiko Jensen, and Aki-Hiro Sato. "Challenges in data science: A complex systems perspective." Chaos, Solitons & Fractals, 90, (2016): 1–7. https://doi.org/10.1016/j.chaos.2016.04.020 A Framework for the Design and Evaluation of Machine Learning Applications 22 Carvalho, Diogo V., E. M. Pereira and Jaime S. Cardoso. “Machine Learning Interpretability: A Survey on Methods and Metrics.” Electronics 8 (2019): 832. Castiglioni, Isabella, Davide Ippolito, Matteo Interlenghi, Caterina Beatrice Monti, Christian Salvatore, Simone Schiaffino, Annalisa Polidori, Davide Gandola, Cristina Messa, and Francesco Sardanelli. "Artificial intelligence applied on chest X-ray can aid in the diagnosis of COVID-19 infection: a first experience from Lombardy, Italy." MedRxiv (2020). Chandrashekar, Girish and Ferat Sahin. "A survey on feature selection methods." Computers & Electrical Engineering, 40(1), (2014):16-28. Corbett-Davies, Sam, Emma Pierson, Avi Feller, and Sharad Goel. “A Computer Program Used for Bail and Sentencing Decisions Was Labeled Biased against Blacks. It’s Actually Not That Clear.” Washington Post. April 18, 2019. https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can-an- algorithm-be-racist-our-analysis-is-more-cautious-than-propublicas/. Accessed 5/28/2021. Dancy, J. "Should we pass the buck?" In T. Rønnow-Rasmussen & M. J. Zimmerman (Eds.), Recent Work on Intrinsic Value. Dordrecht: Springer. (2005): 33–44. Danks, David, and Alex John London. "Algorithmic Bias in Autonomous Systems." In IJCAI, vol. 17, (2017): 4691-4697. Davis, Michael, Andrew Kumiega, and Ben Van Vliet. "Ethics, finance, and automation: A preliminary survey of problems in high frequency trading." Science and engineering ethics 19.3 (2013): 851-874. DeGrave, Alex J., Joseph D. Janizek, and Su-In Lee. "AI for radiographic COVID-19 detection selects shortcuts over signal." Nat Mach Intell 3, 610-619 (2021). https://doi.org/10.1038/s42256-021-00338-7 Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From data mining to knowledge discovery in databases." AI magazine, 17(3), (1996):37-37. https://doi.org/10.1609/aimag.v17i3.1230. Futoma, Joseph, Morgan Simons, Trishan Panch, Finale Doshi-Velez, and Leo Anthony Celi. "The myth of generalisability in clinical research and machine learning in health care." The Lancet Digital Health. Vol 2 Issue 9 (2020): e489-e492. https://doi.org/10.1016/S2589-7500(20)30186-2. Goodman, Bryce, and Seth Flaxman. "European Union regulations on algorithmic decision-making and a “right to explanation”." AI magazine 38(3), (2017): 50-57. Griggs, Troy and Daisuke Wakabayashi. "How a Self-Driving Uber Killed a Pedestrian in Arizona." The New York Times. https://www.nytimes.com/interactive/2018/03/20/us/self-driving-uber-pedestrian-killed.html Accessed 9/2/2021. A Framework for the Design and Evaluation of Machine Learning Applications 25 Selbst, Andrew, and Julia Powles. "“Meaningful Information” and the Right to Explanation." Conference on Fairness, Accountability and Transparency. PMLR, (2018). Selinger, Evan, and Brenda Leong. "The ethics of facial recognition technology." Forthcoming in The Oxford Handbook of Digital Ethics (ed. Carissa Véliz) (2021). Seyyed-Kalantari, Laleh, Guanxiong Liu, Matthew McDermott, Irene Y. Chen, and Marzyeh Ghassemi. "CheXclusion: Fairness gaps in deep chest X-ray classifiers." In BIOCOMPUTING 2021: Proceedings of the Pacific Symposium, (2020): 232-243. Simon, Jonathan. Poor discipline. University of Chicago Press, 1993. Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information processing & management, 45(4), (2009): 427-437. https://doi.org/10.1016/j.ipm.2009.03.002. Suresh, Harini and J. Guttag. "A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle." arXiv preprint arXiv:1901.10002. (2019) Van de Poel, Ibo. "Translating values into design requirements." Philosophy and engineering: Reflections on practice, principles and process. Springer, Dordrecht. (2013): 253-266. Van de Poel, Ibo. "Embedding Values in Artificial Intelligence (AI) Systems." Minds and Machines 30.3 (2020): 385-409. Vredenburg, Karel, Ji-Ye Mao, Paul W. Smith, and Tom Carey. A survey of user-centered design practice. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '02) (2002): 471- 478. https://doi.org/10.1145/503376.503460 Walzer, Michael. Spheres of justice: A defense of pluralism and equality. Basic books, 2008. Wang, Linda, Zhong Qiu Lin, and Alexander Wong. "Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images." Scientific Reports 10, no. 1 (2020): 1-12. Wehbe, Ramsey M., Jiayue Sheng, Shinjan Dutta, Siyuan Chai, Amil Dravid, Semih Barutcu, Yunan Wu et al. "DeepCOVID-XR: an artificial intelligence algorithm to detect COVID-19 on chest radiographs trained and tested on a large US clinical data set." Radiology 299, no. 1 (2021): E167-E176. Wieringa, Maranke. "What to account for when accounting for algorithms: A systematic literature review on algorithmic accountability." Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. (2020). A Framework for the Design and Evaluation of Machine Learning Applications 26 Wilcox, Clair. "Parole: Principles and Practice." Am. Inst. Crim. L. & Criminology 20 (1929): 345. Zhang, Daniel, Saurabh Mishra, Erik Brynjolfsson, John Etchemendy, Deep Ganguli, Barbara Grosz, Terah Lyons, et al. “The AI Index 2021 Annual Report,” AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA. (2021)