Algorithmic Bias Playbook, Study notes of Algorithms and Programming

The Algorithmic Bias Playbook is a guide that describes four steps organizations can take to identify and mitigate bias in live algorithms. The playbook distills insights from years of applied work helping others diagnose and mitigate bias in various sectors, including healthcare, technology, and regulation. The guide defines algorithmic bias and provides practical examples of how to measure and mitigate racial bias in live algorithms. The playbook is useful for C-suite leaders, technical teams working in healthcare, and policymakers and regulators.

Typology: Study notes

2021/2022

Uploaded on 05/11/2023

ekanaaa
ekanaaa 🇺🇸

4.3

(28)

268 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Algorithmic Bias
Playbook
Ziad Obermeyer
Rebecca Nissan
Michael Stern
Stephanie Eaneff
Emily Joy Bembeneck
Sendhil Mullainathan
June, 2021
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Algorithmic Bias Playbook and more Study notes Algorithms and Programming in PDF only on Docsity!

Algorithmic Bias

Playbook

Ziad Obermeyer

Rebecca Nissan

Michael Stern

Stephanie Eaneff

Emily Joy Bembeneck

Sendhil Mullainathan

June, 2021

Is your organization using biased algorithms? How would you know? What would you do if so? This playbook describes 4 steps your organization can take to answer these questions. It distills insights from our years of applied work helping others diagnose and mitigate bias in live algorithms.

Algorithmic bias is everywhere. Our work with dozens of organizations—healthcare providers, insurers, technology companies, and regulators—has taught us that biased algorithms are deployed throughout the healthcare system, influencing clinical care, operational workflows, and policy.

This playbook will teach you how to define, measure, and mitigate racial bias in live algorithms. By working through concrete examples—cautionary tales—you’ll learn what bias looks like. You’ll also see reasons for optimism—success stories—that demonstrate how bias can be mitigated , transforming flawed algorithms into tools that fight injustice.

Who should read this? We wrote this playbook with three kinds of people in mind. ● C-suite leaders (CTOs, CMOs, CMIOs, etc.) : Algorithms may be operating at scale in your organization—but what are they doing? And who is responsible? This playbook will help you think strategically about how algorithms can go wrong, and what your technical teams can do about it. It also lays out oversight structures you can put in place to prevent bias. ● Technical teams working in health care : We’ve found that the difference between biased and unbiased algorithms is often a matter of subtle technical choices. If you build algorithms, this playbook will help you make those choices better. If you purchase or apply them, it will make you a more ‘educated consumer’ who can identify problems before they scale. ● Policymakers and regulators need to clearly define what algorithmic bias looks like. This playbook’s practical approach to bias, which parallels discrimination law, can be used to craft prospective guidance for industry, or to guide retrospective civil investigations.^1

How do we define ‘bias’? There are many definitions of algorithmic bias.^2 We use a practical one, grounded in the real-world use cases of algorithms we’ve encountered. In health care, we are often faced with a limited supply of resources: tests, treatments, or other forms of care or extra help. Algorithms are used to help decision-makers identify who needs these resources. More generally, in many important social sectors, algorithms guide decisions about who gets what. In these situations, we believe that if an algorithm scores two people the same, those two people should have the same basic needs—no matter the color of their skin, or other sensitive attributes. (This is related to ‘calibration’ in the literature.) We consider algorithms that fail this test to be biased.

(^1) Robert P. Bartlett et al. "Algorithmic Discrimination and Input Accountability under the Civil Rights Acts." Available at SSRN 3674665 (2020). (^2) There is a wealth of literature on this topic. Good starting points are: Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness in machine learning. fairmlbook.org, 2019; Irene Y. Chen et al. "Ethical Machine Learning in Healthcare." Annual Review of Biomedical Data Science 4 (2020); Alvin Rajkomar et al. "Ensuring fairness in machine learning to advance health equity." Annals of internal medicine 169, no. 12 (2018): 866-872; Harini Suresh and John V. Guttag. "A framework for understanding unintended consequences of machine learning." arXiv :1901.10002 (2019).

1

Center for Applied AI at Chicago Booth

Are there automated checks I can run to detect bias? When algorithms are predicting the ideal target , like in the pulse oximeter example above, basic checks can suggest or confirm bias: under-representation of underserved groups in the training data, or poor accuracy in a way that fits the definition of bias above. These checks can be informative—but only if the algorithm’s actual target matches the ideal target. If not, you should not be reassured by good performance.

Unfortunately, no basic checks will tell you when algorithms are not predicting the ideal target. This is why label choice bias often goes undetected. In our example above, where cost was being used as a proxy for health needs, basic checks would have shown that the algorithm was working well, for the narrow task it was asked to do: predicting cost, which it did accurately for Black and White patients alike. That was the problem — it predicted a biased target very well. Had we been falsely reassured by this fact, we would have missed large-scale label choice bias. The only way to reveal label choice bias is for a human to articulate the ideal target, and hold the algorithm accountable for that.

Can biased algorithms be fixed? Defining an algorithm’s ideal target is at the core of our definition of bias. It can also be a blueprint for improving biased algorithms: once we know what the algorithm should be doing, we know how to retrain the algorithm to do better. If the cause is non-representative training data or failure to generalize, the algorithm can be improved with better data. If the cause is label choice bias, the algorithm can be retrained, to predict a variable closer to its ideal target. In our work, we have learned that the re-trained algorithms are far more fair: they get resources to those who need them, not those who are already well-represented in data. But they also just work better, for everyone: they better match the purpose they were actually designed for.

Is this playbook specific to health care? Health care is a good ‘model system’ to study algorithmic bias: algorithms operate at a massive scale and can be studied on the servers of a diverse set of organizations. For this reason, our examples come from health care—but the lessons we’ve learned are very general. We have applied them in follow-on work in financial technology, criminal justice, and a range of other fields.^6 We’ve found that label choice bias in particular is common in these settings too: for example, finance datasets don’t have a variable called ‘creditworthiness,’ but they do have ‘income’; criminal justice datasets don’t have a variable called ‘criminality,’ but they do have ‘arrests’ and ‘convictions.’ All of these proxy variables are distorted, biased versions of the ideal target, and similar problems—and solutions—apply.

How do I get started? Our framework is simple and practical, and involves four steps: ● STEP 1: INVENTORY: List all the algorithms being used or developed in your organization. ● STEP 2: SCREEN: Screen each algorithm for bias, relative to its ideal target. ● STEP 3: RETRAIN: Improve or suspend the use of biased algorithms. ● STEP 4: PREVENT: Set up structures to prevent future bias.

(^6) For criminal justice applications, see also Kristian Lum and William Isaac. "To predict and serve?." Significance 13, no. 5 (2016): 14-19.

Center for Applied AI at Chicago Booth

ALGORITHMIC BIAS CHEAT SHEET

How to use this checklist: This outline of our framework is intended to help you navigate this document and guide your approach to the algorithms in your own institution. You can find detailed instructions, research sources, and case studies by following the links to the appropriate sections.

Step 1: Inventory Algorithms

Step 1A: Talk to relevant stakeholders about how and when algorithms are used: Create a list of algorithms within your organization; consider broad definitions of algorithms and ask open ended questions.

Step 1B: Designate a ‘steward’ to maintain and update the inventory: Choose a person to be responsible for keeping the inventory current, in consultation with a diverse group.

Step 2: Screen for Bias

Step 2A: Articulate the ideal target (what the algorithm should be predicting) vs. the actual target (what it is actually predicting): Consider whether there is a mismatch that can cause bias.

Step 2B: Analyze and interrogate bias: Choose comparison groups (e.g. race), and perform some basic checks of how well the algorithm predicts its actual target. Then, investigate how label choice might create bias in how well the algorithm predicts its ideal target.

Step 3: Retrain Biased Algorithms (or Throw Them Out)

Step 3A: Try retraining the model on a label closer to the ideal target: Assess possible mitigations to label choice bias by comparing results between different labels.

Step 3B: Consider alternative options (if necessary): If you are unable to improve or retrain the algorithm, consider other possible solutions. If data is the problem — a non-representative dataset, or no variables that match the ideal target — consider collecting new data.

Step 3C: Consider suspending or discontinuing use of the algorithm (if necessary): If you are unable to improve the algorithm and/or its inputs, pause the use of the algorithm until you find a solution — or discontinue use altogether.

Step 4: Set Up Structures to Prevent Future Bias

Step 4A: Implement best practices for organizations working with algorithms: Under the aegis of the steward and a diverse team, conduct recurring audits and ensure rigorous documentation of current and future models.

Center for Applied AI at Chicago Booth

Tip: Search central databases or health records for keywords that relate to algorithms

For example, algorithm outputs such as clinical risk scores may be stored in Electronic Health Records (EHR) alongside other clinical and laboratory data. In that case, string searches for variables containing “score”, “scale”, “screen”, “assess”, “index”, “tool”, “risk”, “predict”, “model”, “algorithm” may help highlight existing repositories of algorithm scores. In other contexts, algorithm outputs may be archived in existing internal databases or databases maintained by partners. Once scores from a new algorithm are identified in this search, proactive outreach to stakeholders can provide additional context for how the algorithm is used to make decisions.

OUTPUT OF STEP 1A: An inventory listing all algorithms your organization is currently using or developing (see example).

Step 1B: Designate a ‘steward’ to maintain and update the inventory

Someone needs to take responsibility for algorithmic oversight.^7 Developing and maintaining algorithm inventories will require active upkeep and should be overseen by a centralized person. Since algorithms impact entire organizations, the steward should have oversight on broad strategic decisions (i.e., somebody in the C-suite). While this individual will shoulder responsibility for this effort, they should not work alone, but rather in close collaboration with a diverse committee of internal and external stakeholders.

Engaging Communities to Support Bias Mitigation Efforts

Community stakeholders offer valuable input on frameworks for bias mitigation. Involving them in the creation of these structures facilitates transparency and builds trust. For example, we worked with a health plan whose existing Healthcare Ethics Program organized a committee of diverse stakeholders — providers, employers, policy makers, and, crucially, patients themselves. The committee generated a framework that people could reference when using or procuring algorithms and iterated on that framework to ensure it was functional and practical.

We provide further detail on this group’s long term responsibilities in Step 4 , but forming such a group should be a thoughtful exercise undertaken in consultation with affected communities.The team should include people from different racial groups, genders, etc. and people with diverse areas of expertise such as clinicians, data scientists, business analysts, bioethicists, social science researchers and others. We also encourage a particular focus on representation of members of groups you anticipate focusing on in analyses. For more in-depth information on how to assemble diverse teams and create a culture that promotes responsible AI, we recommend the resources provided by the Center for Equity, Gender, and

(^7) Stephanie Eaneff, Ziad Obermeyer, and Atul J. Butte. "The case for algorithmic stewardship for artificial intelligence and machine learning technologies." Jama 324, no. 14 (2020): 1397-1398.

Center for Applied AI at Chicago Booth

Leadership at the University of California, Berkeley. Their guide on mitigating bias in artificial intelligence is a perfect complement to our work.

In addition to overseeing the algorithm inventory in its current form, the steward and their team may also want to consider adding past algorithms no longer in deployment, and/or ideas for future algorithms (and labeling them accordingly) to make the inventory even more comprehensive.

OUTPUT OF STEP 1B: A designated steward and an oversight structure for algorithms and algorithmic bias.

STEP 2: Screen for Bias

An old programming adage defines debugging as: figuring out what you told the computer to do, as opposed to what you thought you told it to do. This is how we think of Step 2: debugging algorithms. And as any programmer will tell you, debugging requires careful, meticulous work. To build intuition, we’ll work through one row of the inventory—an example of an algorithm your organization might be using.

Step 2A: Articulate the algorithm’s ideal target vs. its actual target

A good place to start is where we started: our first study on algorithmic bias in health, which we have since replicated in other settings, with other partners.^8

Imagine you work in an accountable care organization (ACO: a health system that takes responsibility for both the medical care and finances of their patients). One of the algorithms that came up in the inventory is used by the population health team. The team lead describes it as a “risk engine” that helps them better understand their patient population. Your first task in this step is to articulate the actual target , the variable that the algorithm actually predicts. You ask them directly which specific variable the algorithm is predicting. They are a bit confused by the question, and answer by saying that it predicts risk, and helps them identify patient groups that need attention. You are confused yourself, until you carefully review the algorithm developer’s promotional materials. You learn that, concretely, the model predicts a patient’s healthcare costs over the next year. This is the actual target : total one-year medical expenditures.

Checking how well the algorithm predicts its actual target is important. Below we’ll cover a few basic checks you can do, that can indicate poor performance in an underserved group. If you see this, you should suspect bias. But a key learning from our work is that accurate prediction of the actual target doesn’t guarantee fairness : the actual target can itself encode biases. Indeed, that is the most common mechanism

(^8) Ziad Obermeyer et al., “Dissecting Racial Bias.”

Center for Applied AI at Chicago Booth

wish?” Algorithms are literal genies - they give us exactly what we ask for, even if we meant something very different. That’s why it’s so important to ensure that our actual target variable matches our ideal target as closely as possible.

The subtle but pernicious discrepancy between cost and health needs is just one example of label choice bias as a broader phenomenon. The table below details just a few of the many examples we’ve found throughout our collaborations with large organizations including hospital systems, for- and non-profit insurers, state and federal agencies, software companies, and others.

Example: Screening for Label Choice Bias

Algorithm Ideal Target Actual Target Risk of Bias

Care Management Health needs, benefit Total costs of care High. Less money is Prioritization : Identifying from high-risk care spent on Black patients patients for additional management programs who have the same level services of need

Emergency Severity Medical condition Nurse-rated acuity, High. Resource Index (ESI) : emergency needing immediate “resources patient is consumption varies by triage attention expected to consume” race and insurance for any given acuity

6-Clicks Mobility Score : Inability to care for self Physical measures of High. Similar physical Decisions about and live independently mobility and daily mobility scores have discharge destination at home without help activities larger impact on those lacking income

“No-show” prediction : Voluntary no-show to Any no-show to prior High. No shows relate Clinic scheduling appointment appointment to access: barriers are unequally distributed

Predicting Disease New disease onset (e.g., Provider–insurer High. Probability of Onset : Targeting heart failure, kidney transaction with ICD being coded varies by preventative care failure) code for disease physician quality, hospital billing, insurance, etc.

Kellgren-Lawrence Severity of knee Severity of High. Radiologists miss Grade : Osteoarthritis on osteoarthritis osteoarthritis seen by causes of knee pain knee x-rays radiologist on knee affecting underserved x-rays groups

Table 1^12

(^12) Sendhil Mullainathan and Ziad Obermeyer. “On the Inequality of Predicting A While Hoping for B.” AER Papers and Proceedings 111:37-42.

Center for Applied AI at Chicago Booth

In the course of screening for label choice bias, organizations should fill out a table much like this, using the last column to determine the extent to which the discrepancy between the ideal and actual target is likely to create bias for underserved groups. In our running example, which is presented in the first row, we have known for decades that health spending – conditional on need – is lower for Black patients than for White patients.^13 This means there is high risk for bias in an algorithm that predicts cost when the ideal target is need – and indicates that we should prioritize this algorithm in step 2B.

How Label Choice Bias Relates to Discrimination Law^14

The Supreme Court’s 1977 decision in Dothard v. Rawlinson ruled against a prison system’s minimum height and weight requirement for hiring. The prison was using these characteristics as proxies for strength, which was required for the job. But because they used proxies—not actual strength tests—the Court ruled they were discriminating against female applicants.

OUTPUT OF STEP 2A: A 4-column table detailing algorithm name, ideal target, actual target, and hypothesized risk of bias

Step 2B: Analyze and interrogate bias

After getting the lay of the land, you’re ready to choose a high priority algorithm for further study.

Choosing Populations of Interest

The first thing you’ll want to do is choose comparison groups. You might have specific interests in some group comparisons going into the analysis — for example, comparing patients by geography in an area where rural patients face barriers to access, or by language spoken in places where non-English speakers may be underserved. Of course, you should also be aware of protected classes designated by the law such as race, color, religion, national origin, sex, and disability. Additionally, you should consider examining implications for multiple groups that are overlapping or intersectional.^15 Think creatively about the groups within the population you serve that may be subject to bias. Speak to a diverse group of stakeholders to understand their hypotheses of bias and to inform your choices of comparison groups.

(^13) José J. Escarce and Frank W. Puffer. “Black-white differences in the use of medical care by the elderly: a contemporary analysis,” in Racial and Ethnic Differences in the Health of Older Americans, eds. Linda G. Martin and Beth J. Soldo. (Washington, DC: National Academy Press, 1997), pp. 183-209. (^14) Robert P. Bartlett et al. “Algorithmic Discrimination and Input Accountability under the Civil Rights Acts (August 1, 2020).” Available at SSRN: https://ssrn.com/abstract=3674665 or http://dx.doi.org/10.2139/ssrn.3674665. (^15) Buolamwini and Gebru, "Gender shades”; James R. Foulds, et al. “An Intersectional Definition of Fairness.” In 2020 IEEE 36th International Conference on Data Engineering (ICDE) , pp. 1918-1921. IEEE, 2020.

Center for Applied AI at Chicago Booth

biases. For (b), note that often, algorithms are trained on non-diverse datasets (because more privileged populations have more data available), but applied in very different settings. If the fraction of Black or female patients, for example, looks very different, you should be on the lookout for poor performance in underserved groups in your population.

Next, we’ll do a basic check on whether the algorithm is performing well in underserved groups. This is referred to as ‘calibration’ in the literature: at a given algorithm score, do patients have the same level of the actual target across underserved groups?

Comparing the Actual Target for Groups of Interest

Fig. 1

Notice the performance is good — but don’t be reassured. You’ve ruled out one basic source of bias, namely poor performance for predicting the actual target in an underserved group. But remember, good performance in an underserved group doesn’t guarantee fairness. You have not ruled out the most common source of bias in the algorithm’s we’ve studied: label choice bias. To do this, you will need to articulate the ideal target — not just take the actual target at face value — and hold the algorithm accountable for predicting that.

Center for Applied AI at Chicago Booth

Articulating and measuring performance for predicting the ideal target

In the example we walked through in Step 2A, an algorithm is used to prioritize the patients who have the greatest health needs, to get them extra help. In this step, we’ll use that example to study how the algorithm (trained to predict the actual target of cost) predicts that ideal target (patient healthcare needs).

Measuring the ideal target. So far we’ve talked about the ideal target in an abstract way. But now we’ll need to get concrete about measuring the ideal target. What do we mean by ‘health needs,’ exactly? Health is by nature multidimensional and complex — and yet, to quantify bias, we need to measure it precisely, in one or more variables in our dataset. How did we handle this task? We first created an overall measure of health status: the number of active chronic conditions (or “comorbidity score,” a metric used extensively in medical research), to provide a comprehensive view of a patient’s health.^18 The figure below plots this relationship for the cost-prediction algorithm we have been using as an example:

Difference between Ideal Target and Actual Target by Race

Fig. 219

(^18) Vincent de Groot, et al. "How to measure comorbidity: a critical review of available methods." Journal of Clinical Epidemiology 56, no. 3 (2003): 221-229. (^19) Confidence intervals for Fig. 2 are present, but narrow, and may be difficult to see when viewed at less than full size.

Center for Applied AI at Chicago Booth

Measuring the ideal target: another, more nuanced example. A key principle we’ve learned is that the algorithm’s ideal target depends on the decision the algorithm informs. Let’s walk through another example that is subtly, but importantly, different from our running example above.

With another one of our partners, a large academic medical center, we worked to study potential bias in triaging patients in their Emergency Department (ED). Triage is a critical part of emergency care: the aim (and the ideal target for any triage algorithm) is to prioritize high-acuity patients for a rapid initial assessment by the medical team. Nearly every ED in the country triages patients using the Emergency Severity Index (ESI), a rule-based algorithm that incorporates (i) a nurse’s judgment of acuity, and (ii) a prediction on how many resources a patient was likely to consume in the ED. Our partner worried that resource utilization, a large component of the actual target , in particular might be a biased proxy for high-acuity conditions, the ideal target, because resource consumption varies by many factors, including race and insurance.

The decision the algorithm informs is which patients get a rapid initial assessment. Notice that this initial assessment is relatively ‘cheap’: the medical team can always decide that a patient does not need immediate care, and prioritize another patient instead. Because of that, we care much more about making sure patients with critical conditions don’t wait than we care whether patients without a critical condition have a (negative) rapid assessment. In other words, we care more about reducing false negatives than we do about reducing false positives. Of course, accurate prediction of both positives and negatives is always an important goal (‘calibration’). But the decision context of the algorithm implies that we should be particularly attuned to how often the algorithm misses critical conditions (this is related to metrics of ‘recall,’ or ‘sensitivity’). This is very different from the algorithm above: in that example, prioritizing a patient who doesn’t need extra help takes a slot away from another patient who does need it. Extra help is expensive and there are limited slots in the program. So in that setting, we care most about accurate prediction (‘calibration’), pure and simple.

How did we measure the ideal target in this case? We convened a group of emergency physicians and experienced nurses to generate a list of the high-acuity conditions they wouldn't want to miss. We then worked with a team of physicians and data scientists to translate that list into a set of diagnoses, laboratory studies, and outcomes that we could measure in the electronic health record data we had. We used that to quantify bias, and show that the existing algorithm did much better for catching critical conditions in White patients than in Black patients. Articulating the ideal target of a triage algorithm is also helping us to lay the groundwork for a better algorithm that corrects some of the problems with ESI and focuses on not missing high-acuity conditions for all patients, irrespective of biases in existing resource use.

OUTPUT OF STEP 2B: A diagnostic chart (or set of key metrics) that illustrates bias in the context of what matters in your specific situation

Center for Applied AI at Chicago Booth

STEP 3: Retrain Biased Algorithms (or Throw Them Out)

This section is structured as a series of prioritized actions, which we present in the order we typically do them. Some solutions are better than others, so it makes sense to try those first before other options.

Step 3A: Try re-training the model on a label closer to the ideal target

If you made it through Step 2, you’ve shown bias in an algorithm by comparing its predictions to an ideal target. Now it’s time to do something about it. The good news is that much of the hard work is behind you: to fix the biased algorithm, the first thing we try is to retrain it on the same label(s) you used to show bias to begin with — those that match the ideal target.

For example, in the case of the cost prediction algorithm, we found the existing label of cost was biased, by showing that Black patients had far more chronic conditions, higher blood pressure, etc. All of those variables were in our dataset — that’s how we were able to show the bias — so mitigating the bias could leverage those same variables. We retrained a new candidate model using active chronic conditions as the label, while leaving the rest of the pipeline intact. This simple change doubled the fraction of Black patients in the high-priority group: from 14% to 27%.^21 That said, there are many choices for the alternative label. We could also consider ‘avoidable cost’ if it was closer to the ideal target for your particular decision and use case. When we did this, it increased the fraction of Black patients identified for the program to 21%. We could train an algorithm to predict a high hemoglobin A1c for diabetes, among those in whom the lab was checked. There are many options, and the best one will depend on your particular circumstances, but the bottom line is that there are often many variables in your datasets that are reasonable proxies for the ideal target.

Before and after making a change, you will want to retrace your analysis to estimate the effect of any given mitigation. Begin by generating a new version of the calibration plot by replacing the old scores with the new model’s prediction scores on the x-axis. To complement that number, you can also look at this change of the percent of patients in a given group (e.g., Black patients, non-English speakers, etc.). If your situation is similar to the triage example, looking at your context-specific, prioritized performance metric (e.g. recall) before and after the change may also be useful.

OUTPUT OF STEP 3A: Analysis comparing the level of bias before and after a change OR an assessment that changing the label is infeasible (if the latter, proceed to Step 3B)

(^21) Ziad Obermeyer et al. “Dissecting racial bias.”

Center for Applied AI at Chicago Booth

In the next step, we will discuss the organizational structure needed to ensure that newly procured algorithms meet the preventative standards you set.

OUTPUT OF STEP 3C: Suspended or discontinued use of the algorithm and criteria (ideal target) for a new solution

STEP 4: Set Up Structures to Prevent Future Bias

So you’ve read through steps 1-3 above, and you’re excited to get started. But in order to operationalize this framework, you need to think big picture: what kind of team do you need to support this work – both immediately and in the long term? As you prepare to audit existing algorithms, it is important to consider how, through this process, you can also create the structures necessary to prevent bias in future algorithms that you create or purchase. Ultimately, bias-prevention practices need to be customized for your organization, but we have included suggestions below based on our experience with a diverse set of partners to help you get started.

Step 4A: Implement best practices for organizations working with algorithms

Organizations working with algorithms should establish protocols for ongoing bias mitigation and set up a permanent team to uphold those protocols.

1. Establish protocols for ongoing bias mitigation. The following systems (at a minimum) should be in place to help your organization consistently and proactively avoid bias:

A pathway for reporting algorithmic bias concerns. Outline a clear process for anyone in the organization to safely report concerns about algorithmic bias to the team without repercussions, and decide on a process for responding to these concerns. ❏ Requirements for documenting algorithms. Timnit Gebru and others make the point that “in the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet describing its operating characteristics, test results, recommended usage, and other information”.^23 We should strive to do something similar with algorithms. The team should uphold organization-wide standards for documenting the items below.^24 In order to efficiently track information about your algorithms, consider adding columns with these items directly to your inventory.

(^23) Timnit Gebru, et al. "Datasheets for datasets." arXiv preprint arXiv:1803.09010 (2018). (^24) Cf. Beau Norgeot et al. “Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist.” Nature Medicine 26(9), pp, 1320-1324.

Center for Applied AI at Chicago Booth

  • The goal: the algorithm’s ideal target, it’s actual target, and a bias risk assessment
  • The training process: the data an algorithm employs, its training sample, and how the use-case sample differs from the training sample
  • Performance: the algorithm’s performance overall, for underserved groups, and for both the actual and ideal targets ❏ A written plan for regular inventory updates and audits. Decide on a cadence and/or a set of cues to trigger audits. Since many aspects can change after a model has been deployed, audits should be conducted on a routine basis. Algorithm performance can change when the underlying data change (for example, when a hospital starts using new medical imaging technology) or when algorithms are used in new locations and/or in different contexts (for example, pediatric vs. geriatric or inpatient vs outpatient populations). 2. Assign a permanent team to oversee ongoing bias mitigation efforts. In Step 1B, you designated a steward and diverse group to oversee the algorithmic bias audits efforts you’ve taken so far. Now it’s time to be sure that key responsibilities have been assigned on a permanent basis to sustain the systems and protocols you’ve developed. The exact structure of the team will vary by organization, but their collective tasks should include those listed below at a minimum. Ask: is someone responsible for each of the tasks listed below? ❏ Address feedback: Lead the response when a member of your organization identifies a concern related to algorithmic bias. ❏ Check documentation: Hold members of your organization accountable to documenting all decision-making when creating new models. ❏ Maintain the inventory: Update the list of algorithms frequently (exactly how often this is necessary will depend on the speed at which your organization develops algorithms). ❏ Instigate audits: Determine when algorithmic bias audits are necessary, and oversee all audits. 3. Consider working with a third-party to ensure accountability and ongoing guidance. For some organizations, it is helpful to involve a third-party that can oversee or conduct audits. This approach holds organizations accountable and offloads some of the work from your internal team. 4. Stay on top of changes in the field. Keep in mind that this field is developing rapidly, and regulators, quality agencies, and accreditors are increasingly releasing explicit guidelines on the topic (in fact, we are working with several of them), so be sure to look out for future communication on recommendations.

OUTPUT OF STEP 4: Protocols for ongoing bias mitigation and a permanent team responsible for this work