Training Policymakers in Econometrics, Study notes of Public Policy

2 We designed a rigorous “mastering metrics” training workshop for these deputy ministers and delivered it as they participated in the Academy's ...

Typology: Study notes

2021/2022

Uploaded on 09/12/2022

claire67
claire67 🇬🇧

4.6

(5)

264 documents

1 / 63

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Training Policymakers in Econometrics
By Sultan Mehmood, Shaheen Naseer and Daniel L. Chen
1
August 2021
The credibility revolution triggered a paradigm shift in economics. This paper examines
its causal effects on deputy ministers in a “mastering metrics” training program. We separated
the demand for econometrics training from its impact with a simplified Becker Degroot
Marshak mechanism. Policymakers could choose a high or low probability for randomly
receiving a popular econometrics or a self-help placebo book. After receiving the book,
policymakers participated in an intense training workshop that included watching lecture
videos made by the authors of the book, summarizing each chapter, discussing, presenting, and
applying the book’s concepts in their policymaking. Three results emerge. First, we document
large persistent effects. After six months, treated individuals' ratings on the importance of
quantitative analysis increase by 50%. Treated individuals' performance in national research
methods and public policy exams improves by 0.5-0.8 sigma. Text analysis of their writings
reflect an increase in perceived importance of causal inference. Second, treated individuals’
willingness-to-pay for commissioning Randomized Control Trials using public funding
increases by 300% and decreases by 50% for correlational studies. Third, treated ministers are
twice as likely to choose a policy for which there is RCT evidence. We use click behavior as a
behavioral proxy of IV defiers. Few defiers are observed, and they are less affected by
treatment. Overall, we provide experimental evidence that training policymakers in the school
of thought associated with the credibility revolution increases demand and responsiveness to
causal evidence. (JEL D72, D78, O17, O18)
Keywords: randomized evaluations, policy, credibility revolution, paradigm shifts.
1
New Economic School (E-mail: smeh[email protected]). Lahore School of Economics (E-mail: [email protected]) and Toulouse School of Economics (Email:
[email protected]). We thank the Government of Pakistan for their cooperation and access to these policymakers. Daniel L. Chen acknowledges IAST funding fro m the
French National Research Agency (ANR) under the Investments for the Future (Investissements d’Avenir) program, g rant ANR-17-EUR-0010. This research has benefited
from financial support of the res earch foundation TSE- Partnership and ANITI funding. We thank Josh Ang rist, Peter Hull, Ben Olken, Micheal Kremer, Mathias Sutt er and
Gautam Rao for their helpful comments and suggestions. Mohammad Ahmed N asif provided excellent research assistance.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf29
pf2a
pf2c
pf2d
pf2e
pf34
pf35
pf36
pf3b
pf3c
pf3d
pf3e
pf3f

Partial preview of the text

Download Training Policymakers in Econometrics and more Study notes Public Policy in PDF only on Docsity!

Training Policymakers in Econometrics

By Sultan Mehmood, Shaheen Naseer and Daniel L. Chen^1 August 2021 The credibility revolution triggered a paradigm shift in economics. This paper examines its causal effects on deputy ministers in a “mastering metrics” training program. We separated the demand for econometrics training from its impact with a simplified Becker Degroot Marshak mechanism. Policymakers could choose a high or low probability for randomly receiving a popular econometrics or a self-help placebo book. After receiving the book, policymakers participated in an intense training workshop that included watching lecture videos made by the authors of the book, summarizing each chapter, discussing, presenting, and applying the book’s concepts in their policymaking. Three results emerge. First, we document large persistent effects. After six months, treated individuals' ratings on the importance of quantitative analysis increase by 50%. Treated individuals' performance in national research methods and public policy exams improves by 0.5-0.8 sigma. Text analysis of their writings reflect an increase in perceived importance of causal inference. Second, treated individuals’ willingness-to-pay for commissioning Randomized Control Trials using public funding increases by 300% and decreases by 50% for correlational studies. Third, treated ministers are twice as likely to choose a policy for which there is RCT evidence. We use click behavior as a behavioral proxy of IV defiers. Few defiers are observed, and they are less affected by treatment. Overall, we provide experimental evidence that training policymakers in the school of thought associated with the credibility revolution increases demand and responsiveness to causal evidence. (JEL D72, D78, O17, O18) Keywords : randomized evaluations , policy, credibility revolution, paradigm shifts. 1 New Economic School (E-mail: [email protected]). Lahore School of Economics (E-mail: [email protected]) and Toulouse School of Economics (Email: [email protected]). We thank the Government of Pakistan for their cooperation and access to these policymakers. Daniel L. Chen acknowledges IAST funding from the French National Research Agency (ANR) under the Investments for the Future (Investissements d’Avenir) program, grant ANR- 17 - EUR-0010. This research has benefited from financial support of the research foundation TSE- Partnership and ANITI funding. We thank Josh Angrist, Peter Hull, Ben Olken, Micheal Kremer, Mathias Sutter and Gautam Rao for their helpful comments and suggestions. Mohammad Ahmed Nasif provided excellent research assistance.

RCTs can play an important role in the rigorous evaluation of how policies actually work in practice. Theory is often ambiguous on the effects of policy intervention. Thus, trials can help shed light on the overall effect of policy interventions. ” Deputy Minister in Pakistan (after our workshops)

1. Introduction Over the last half century empirical economics has gone through a paradigm shift (Angrist and Pischke 2010). The credibility revolution, with its careful attention to causality, has presented itself as a new paradigm for “taking the con out of econometrics” (Leamer 1983). We study causal effects of a paradigm shift in science (Kuhn 1962) on practitioners– policymakers–using the training of the paradigm as its instrument. Policymakers demand and even respond to causal evidence (Hjort et al. 2021), but they are unlikely to distinguish between different types of evidence and change their policy choices in response to new evidence. There seems to be consensus emerging in the literature that policymakers are highly averse to shifting their beliefs and engage in motivated reasoning to justify their initial policy choices (Baekgaard et al. 2019; Banuri et al. 2019; Vivalt and Coville 2021; Lu and Chen 2021). Sticking to priors and being inattentive to evidence may stymie the implementation of good policies that might otherwise spur economic development (Kremer, Rao, Schilbach 2019). How can policymakers be made more receptive to evidence? Will training them in concepts associated with the credibility revolution make them more likely to shift their beliefs? Will it induce them to change their policy choices? To address these questions, we conducted a randomized trial. We identified the causal effects of the credibility revolution among deputy ministers in Pakistan using an instrument: Mastering ’Metrics: The Path from Cause to Effect, a prominent summary of the credibility revolution (Angrist and Pischke 2014). These deputy ministers are considered by the government of Pakistan the “key wheels on which the entire engine of the state runs” (Government of Pakistan, 2019). We studied the impact of training causal thinking in a policy decision involving deworming, a policy that shares many essential characteristics with other development policies as policymakers aim to make a decision in light of its potential consequences. In this context, we experimentally modified individuals’ causal thinking and

The first essay was to summarize every chapter of their assigned book, while the second essay involved discussing how the materials would apply to their career. The essays were graded and rated in a competitive manner. Writers of the top essays were given monetary vouchers and received peer recognition by their colleagues (via commemorative shields, a presentation and discussion of their essays in a workshop within the treatment arm). Deputy ministers in each treatment group also participated in a zoom session to present, discuss the lessons and applications of their assigned book in a structured discussion. The last stage of our experiment was a suite of measurements of attitude, behavior, and decisions in a framed field experiment. Because we embedded our analyses in administrative data, we had essentially no attrition. Our collaboration with the training academy gave us direct access to administrative baseline measures of ability and background. Importantly, we observed balance on pretreatment quantitative ability as measured from mathematics scores in the entry examinations of the deputy ministers obtained from the Federal Public Service Commission (FPSC). Likewise, we observed balance on demographics, pretreatment writing and interview assessments. Finally, we obtained data on the ministers’ regular policy assessments from the training academy. These were conducted 4–6 months following our workshop and scored deputy ministers in national research methods and policy assessments. Our first main finding is that training causal thinking shifts policy attitudes. We conducted a survey of policy attitudes on the importance of causal inference several months after treatment assignment and performed a textual analysis of the high-stakes writing assignment. We find substantial effects. While attitudes on importance of qualitative evidence are unaffected, treated individuals' beliefs about the importance of quantitative evidence in making policy decisions increases from 35% after reading the book and completing the writing assignment and grows to 50% after attending the lecture, presenting, discussing and participating in the workshop. We also find that deputy ministers randomly assigned to causal training have higher perceived value of causal inference, quantitative data, and randomized control trials. Metrics training increases how policymakers rate the importance of quantitative evidence in policymaking by about 1 full standard deviation. In the writing assignment and demand assessment, treated deputy ministers also showed an increased desire to run a randomized evaluation before rolling out a policy. In the text of their writings, the treated policymakers discussed their understanding of concepts such as ''selection bias'', ''correlation is not causation'' and ''randomized evaluations allow for apples to apples comparisons''. When

asked what actions to undertake before rolling out a new policy, they were more likely to choose to run a randomized trial, with an effect size of 0.33 sigma after completing the book and writing assignment (partial training) and 0.44 sigma after attending the lecture, presentation, discussion and workshop (full training). We also observe substantial performance improvements in scores on national research methods and public policy assessments. These regular assessments were not specially requested by the research team, so performance improvements are unlikely to be due to experimenter demand. Our second main result emerges from a framed field experiment designed to measure deputy ministers’ willingness-to-pay for evidence and to measure changing of policy decisions in response to evidence. We measured ministers’ willingness-to-pay for three sources of information: RCTs, correlational data, and expert bureaucrat advice. We elicited willingness- to-pay for correlational data and senior bureaucrats’ advice because these two alternative sources of information are the status quo that deputy ministers use to inform their policy decisions. We observed that treated deputy ministers were much more willing to spend out of pocket (50% more) and from public funds (300% more) for RCTs and less willing to pay for correlation data (50% less). Demand for senior bureaucrats’ advice is unaffected. This indicates that econometrics training increased demand for causal evidence by deputy ministers involved in high-stakes policymaking. In this framed field experiment, we also studied the impact of causal thinking in a policy decision involving deworming. First, we elicited initial beliefs about the efficacy of deworming on long-run labor market outcomes. Then, they were asked to choose between implementing a deworming policy versus a policy to build computer labs in schools. This scenario was particularly realistic as these policy choices were an actual decision facing ministers during the timeframe of our study. Next, we provided a signal a summary of a recently published randomized evaluation on the long-run impacts of deworming (Kremer et al. 2021). After this signal, we asked the same deputy ministers about their post-signal beliefs and to make the policy choice again. From this experiment, we observe that only those assigned to receive training in causal thinking showed a shift in their beliefs about the efficacy of deworming: the treated ministers became more likely to choose deworming as a policy after receiving the RCT evidence signal. The magnitudes are substantial trained deputy ministers doubled the likelihood to choose deworming, from 40% to 80%. Notably, this shift occurs only for those ministers whose previously believed the impacts of deworming were lower than the effects

pretreatment mathematics assessment scores. Together, they indicate that small or idiosyncratic samples assigned to treatment or control are unlikely to explain our results. In addition, we observed variation in the data that is inconsistent with experimenter demand since not everyone in the treatment group responded positively to information only those individuals whose priors are less than the signal value of 13% impact change their policy choices. Experimenter demand is also not reflected in assessment of teamwork and attitudes on other sources of evidence since our treatment had no effect on any of these outcomes. Finally, we also bound experimental demand by using a methodology proposed by De Quidt et al. (2018). We suggested the subjects choose the ICT policy, so any shift in the direction of the deworming policy would be the opposite of experimenter demand. The administrative data also included a suite of behavioral data in the field, for example, a choice of field visits to orphanages and volunteering in low-income schools. This allowed us to assess potential crowdout of prosociality, an oft-raised concern about the teaching of neoclassical economics (Selten and Ockenfels 1998; Frank et al. 1993; Rubinstein 2006; Ifcher 2018). We detected no evidence of econometrics training crowding out prosocial behavior orphanage field visits, volunteering in low-income schools and language associated with compassion, kindness and social cohesion is not significantly impacted. Scores on teamwork assessments as a proxy of soft skills were also unaffected (Deming and Weidmann, 2021). Our paper contributes to three key literatures. First, our study pivots the literature on how and why paradigm shifts occur in science (Kuhn 1962) to study its consequences. We studied one of the most prominent schools of thought in empirical economics: the credibility revolution (Angrist and Pischke 2010). We, to the best of our knowledge, are the first to study the causal effects of paradigm shifts using a field experiment with high-stakes decision-makers. Economists, in contrast to philosophers, historians, and sociologists (Kuhn 1962, Shapin 1982, Merton 1973; Foucault 1970) have devoted little attention to paradigm shifts (see Azoulay et al. 2019 for a notable exception). We randomly assigned a book associated with the paradigm and showed its teachings to be highly transmissible via a training workshop. Mastering ’Metrics provides a concatenation of the school of thought associated with the credibility revolution and provides, in five short chapters, a set of principles for policymakers to abide by. This highlights how sparse thinking and parsimony may be important for influencing human thinking (Gabaix 2014).

Second, our study on econometrics literacy adds to the expansive literature on economics and financial literacy (Lusardi and Mitchell 2014). Recent work attributes up to 40% of inequality in end-of-life wealth to financial literacy through the mediating channel of financial decision-making (Lusardi, Michaud, and Mitchell 2017). Economics training also impacts high-stakes decisions of policymakers and explains up to 30% of the recent shift towards economic conservatism in the American judiciary (Ash, Chen, and Naidu 2021). Our study is closest to a RCT of eight hours of financial literacy training that impacts economic preferences of adolescents (Sutter, Weyland, Untertrifaller, Froitzheim 2020) and a RCT that included two hours of financial literacy training that impacted those who had low levels of financial literacy (Cole, Sampson, and Zia 2011). We, however, study the impact of econometric literacy training in causal thinking on attitudes and behavior of adults who make policy decisions. Third, we contribute to the new and vibrant literature on behavioral economics of development and growth (Kremer, Rao, Schilbach 2019). We show that a key factor in demand for and responsiveness to rigorous evidence on the effects of policies (Hjort et al., 2021) is an understanding and appreciation of causal evidence. This, in turn, may promote the implementation of good policies that might otherwise have high rates of return for economic growth. By shaping deputy ministers’ causal thinking with a scalable basic econometrics training and measuring its consequences, we show the key role that developing causal thinking plays when evaluating evidence. In our experiment, policymakers without training in causal inference were unresponsive to causal evidence. While many training studies focus on lay populations, we examine high-stakes decision-makers like central bankers (Malmendier et al.,

  1. and judges (Chen et al., 2016). We trained deputy ministers’ causal thinking and estimated impact on attitudes and subsequent demand for evidence and policy choices. The rest of the paper is organized as follows. Section II provides the background and details on the experimental set-up. Section III describes the data and empirical specification, while Section IV presents the main results. Section V conducts an heterogeneity analysis. Section VI discusses a series of sensitivity tests. A final section concludes.

ideal, regressions as comparison of means, instrumental variables, difference-in-differences and regression discontinuity designs with particular focus on public policy applications. The book is written for undergraduates and is particularly appropriate for our policymakers since all of them at least hold a bachelor's degree. The second book is a popular self-help book emphasizing “personal transformation” and serves as our placebo.^3 November 2020: Assignment of treatment. On 10th November, the director of Academy sent an official email to complete an assignment associated with the designated book to all deputy ministers. All the deputy ministers in the cohort sent a confirmation message that they will complete the assignment within the deadline. The mandatory nature of the workshop and close collaboration of the director and the staff at the Academy implied we had about 90% take-up of our intervention. We randomly assigned the book through a lottery where the person who chose either of the books had a certain probability of actually being assigned that book.^4 That is, the participants were randomly assigned either Metrics or self-help books but conditional on their choice. They were then requested to complete two open ended assignments related to the contents of the respective books: “ Main Task 1: After reading the assigned book, we request you provide a chapter-by-chapter summary of the whole book of around 1500 words (+/-100 words). Main Task 2: After reading the assigned book, we request you provide an analysis of how you would apply the lessons learned from the book in your job. This again should be around 1500 words (+/- 100 words) .” All assignments were submitted by 10 December 2020 (the set deadline). The full detailed transcript of the message by the director detailing their assignment tasks can be found in Table A1 of Appendix A. (^3) For the table of contents of both books, see Figure B1 of Appendix B. (^4) Specifically, a person choosing the metrics book had 60% probability of being randomly assigned the metrics training, while the person choosing placebo book had 85% probability of being randomly assigned the placebo training. Shipment of the books caused these probabilities to differ.

March 2021: Attitude Survey, lecture, presentation, discussion and workshop.— On 10 March 2020, in collaboration with the Academy, we organized two Zoom sessions, one for a randomly assigned metrics group and the other for the placebo self-help group. First there was an ‘endline’ survey, i.e., before the lectures and discussion, where we elicited participants’ attitudes towards quantitative and qualitative evidence, randomized evaluations and causal inference. This gives us outcomes to assess the impact of metrics books and writing assignment tasks 4 months following the assignment of the books (partial treatment). These writing assignments were high-stakes not only because the overall grade became part of the permanent record at the Academy and may influence the ministers’ future career trajectories, but also because we distributed commemorative shields, often accompanied by peer recognition, and monetary gift vouchers to the top 6 performers. After conducting the endline survey on attitudes, we announced the first three positions for both groups and distributed the commemorative shields and gift vouchers to a luxury departmental store. The 1st position received a monetary voucher of USD 150, the 2nd position received a USD 100 voucher, while the 3rd position received a USD 80 voucher. The placebo group also received the vouchers and hence we had 6 winners. These winners also gave a 30 minute presentation summarizing key lessons of the respective books and how the training will inform their policymaking. This was followed by 30 minute video lectures delivered by the authors of the books to the respective randomly assigned groups. The group assigned Mastering ’Metrics attended the video lecture by Joshua Angrist and the group assigned Mindsight attended the video lecture by Daniel Siegel. A structured discussion of 30 minutes for both arms followed. In particular, we asked participants the following questions: (a) What do you think is the main point of the lecture? (b) How can you apply the concepts learned in this lecture to your job? In the end, this part of our engagement with the ministers concluded by asking the same questions on attitudes towards quantitative and qualitative evidence, randomized evaluations and causal inference. This allowed us to assess the short-run impact of our complete metrics training, i.e., essays summarizing the book, essays applying the lessons to policy, attending the video lecture, receiving commemorative shields and gift vouchers, presentation and discussion of key lessons learned. Table B2 in the Appendix B presents screenshots of commemorative shields and gift- vouchers distributed to the deputy ministers. May 2021: Initial Beliefs, Post-Signal Beliefs, Willingness-to-Pay and Project Choice. On 16 May 2021, about 6 months following the book assignments, we elicited policymakers initial beliefs on policy impact and policy choices. Specifically, we elicited their

A. For the whole set-up and summary of the complete experimental design, see Figure B2 in Appendix B. COVID-19 and Consequences for Our Design. — At the Academy, the officers typically reside at the Academy in Lahore for the entire period. Nevertheless, the cohort we studied was instructed to remain in their home cities due to the COVID-19 pandemic. The training, therefore, took place online. The Academy has strict training protocols that do not allow for random assignment by experimenters on this “elite group” of public officials. However, these procedures were valid only for on-site training, therefore, the unique circumstances arising due to the COVID-19 pandemic provided us an unusual opportunity to randomly assign training at the individual level. The combination of the Academy’s express instructions that the participants may not share or discuss our workshop material with their peers, the geographical dispersion of the ministers due to the pandemic at the time of the training, and the non-shareability of the link likely reduced treatment contamination. However, it should be noted that treatment contamination would only mean that our estimates are underestimated. III. Data and Empirical Specification Data. — The data was collected from about 200 deputy ministers entering service in a single year. The entry year is anonymized to protect their identity. The close collaboration with the Academy implied we had about 90% take-up of our intervention. The administrative data on individual policymakers' characteristics were obtained from the administrative records of the Academy. We used this in our balance check over demographics and as control variables in our regressions. The outcomes of field visits to orphanages, volunteering at low-income schools, teamwork, national public policy and research methods assessments were also obtained from the Academy. The pretreatment mathematics, written and interview assessment scores of the ministers were obtained from Pakistan’s Federal Public Service Commission (FPSC) that administers the entry examinations for these elite policymakers.^6 Data on WTP, attitudes and beliefs were collected by our research team under the auspices of the Federal Government of Pakistan. (^6) The FPSC is a statutory body of the Government of Pakistan, constituted at the time of independence in 1947. It obtains its jurisdiction from the Constitution of Pakistan and its responsibilities include recruiting elite policy advisors and administering their entry examinations and assessments.

Empirical Specification. — The impact of metrics training can be evaluated in a simple regression framework. For each individual-level outcome, the estimation equation is: 𝑌𝑖 = 𝛼 + 𝛽Metrics Assigned𝑖 + 𝑿𝑖 𝜇 + 𝜖𝑖 (1) where 𝑌𝑖 is the respective outcome for the policymaker i , this includes attitudes, assessment scores, WTP and policy choices. Metrics Assigned𝑖 is a dummy equal to one if the policymaker is randomly assigned to metrics training. 𝑿𝒊 is a vector of individual-level controls, which includes written test scores, interview test scores, gender, birth in political capitals, asset ownership, income before joining service, age, prior education, foreign visits and occupational designation dummies. Importantly, the list of explanatory variables also includes our randomization strata, Metrics Chosen. This is a dummy variable equal to one if the policymaker chooses the metrics book. We cluster standard errors at the individual level since that is our level of randomization. 𝛽 is our main coefficient of interest and estimates the causal effect metrics training conditional on the policymakers choosing metrics. Balance and Attrition. — Table 1 reports the results on the balance check on those randomly assigned to metrics treatment. Differences across treatment groups and placebo are small in magnitude, and statistically insignificant, suggesting that the randomization was effective at creating balance. Salient to note are the policymakers’ pretreatment written, interview and mathematics assessments (Table 1, Columns 9, 10 and 11). Since the policymakers obtained these scores before the metrics training, the similarity of test scores across written and interview assessments suggest that those assigned the metrics training are likely balanced in their academic and interpersonal ability. Most important to note is the balance on pretreatment scores on the mathematics assessment. This indicates our sample is also balanced in quantitative ability. The close collaboration with the training Academy and the director resulted in our intervention to have take-up of about 90%, there is, however, a possibility of differential attrition with respect to our treatment. However, this is unlikely because in Table B3 of Appendix B, we find that there is essentially no effect of metrics training on attrition.

budget constraints rendering randomized evaluations unfeasible). For raw comparison of means across treatment and placebo groups, see Figure B3 in Appendix B. In Table 2, we also report results of metrics training on attitudes about the importance of RCTs in policymaking with partial and full training. The dependent variable is constructed based on the following scenario: “ You are in charge of assigning people to a public policy program and before rolling it out, you want to learn if the policy is effective, what you would do? ” One of the options is to “Run a randomized control trial”, while other options are unrelated or inconsistent with the main message of the book, such as “Compare two groups of people who had previously benefited most from the policy with those that did not?” and “Survey if there is demand for the policy”. The dependent variable takes the value of one if the policymaker answered “Run a randomized trial” and zero for all other options. The results of estimating equation (1) with this dependent variable is reported in Table 2 (Columns 5 and 6).We observe that the group assigned the metrics book tasks (partial training) is about 15 percentage points more likely to choose randomized evaluation before rolling out a public policy relative to the placebo group; with full training this effect increases to about 20 percentage points or a 55% increase over the placebo mean. Taken together, these results indicate that months after the training, treated policymakers’ perceived importance of quantitative evidence and randomized evaluations increased, while we observe no effect on importance of qualitative evidence. Why did the policymakers demand randomized evaluations?— What explains policymakers attaching greater importance to quantitative analysis and randomized evaluations? Here we present evidence that the results are likely driven by the fact that policymakers learn new knowledge about causal inference and selection issues. In the last two columns of Table 2, we elicit beliefs on why randomized evaluations are important for policymaking. Specifically, we continue with the earlier question and ask: “Continuing with the previous example, why does the previous answer make sense?” One of the options to the above question is “Because comparisons in a RCT are apples to apples comparisons” while other options are unrelated to use of randomization to circumvent selection issues. For instance, “People's feelings are an important determinant whether the public policy will work”, “Survey

methods are known to produce causal effects”, “Comparing two groups of non-randomly selected people allows us to infer causality” are the other options. The dependent variable takes the value of one if the policymaker chooses “Because comparisons in a RCT are apples to apples comparisons” and zero for all other options. Columns 7 and 8 of Table 2 report the results on metrics training before the full and partial training. Subjects assigned metrics training are 15 percentage points more likely to answer that randomized evaluations are “Apple to apple comparisons” suggesting that they understand that random assignment of subjects in the control and treatment groups solves the selection problem by “comparing apples to apples”. This is a likely mechanism that why our treated group have higher perceived importance of quantitative evidence and randomized evaluations. In Figure 1, we report all results of Table 2 but standardized to mean zero and standard deviation one. This includes beliefs of policymakers on quantitative and qualitative evidence, as well as importance of RCTs in policymaking. The coefficient estimates and confidence intervals associated with the metrics assigned variable are reported in the figure with equation (1) estimated with all individual-level baseline controls. The group assigned metrics training see about a 0.85–1.32 standard deviation increase in rating assigned to quantitative evidence, a 0.33–0.44 standard deviation increase in requesting randomized evaluation to evaluate effectiveness of public policy, and about 0.30 standard deviation increase answering that randomized evaluations allow for apple to apple comparisons relative to the placebo group. We find no effect of metrics training on beliefs about qualitative evidence, however. These results are corroborated by textual analysis of the ministers’ high-stakes assignments. The analysis of their writings reveals that metrics assigned group likely learned causal inference concepts. In Figure 2, we observe that the treated group witnessed a large increase in use of the following phrases: “Causal inference is important”, “Correlation is not causation”, “Quantitative Evidence” and “Observational studies are not apple to apple comparisons”. Specifically, the group assigned the metrics book is about 40 percentage points more likely to use phrases associated with “causal inference is important”, a doubling of phrase usage over the placebo mean and 10 percentage points more likely to use phrases similar to “Observational studies are not apple to apple comparisons”, which is a 33% increase over the placebo mean. This suggests that metrics training shifted policymakers' beliefs towards the paradigm associated with the credibility revolution.

public policy assessments. The treated policymakers score about 0.5σ higher in national public policy and 0.8σ higher in the research methods assessments. This strongly suggests a substantial impact of our treatment on their regular policy assessments that take place at the Academy, one that is not solicited by the research team. Scores on teamwork assessments, however, are unaffected (Table 3, Columns 5 and 6), suggesting that the metrics training did not crowd out quality of team decisions, a critical soft skill in effective policymaking (Deming and Weidmann, 2021). C. Does Metrics Training Crowd out Prosocial Behavior? Prosocial Behavior in the Field. — A large body of evidence documents that economics training may make individuals less prosocial. Individuals trained in neoclassical economic concepts are more likely to free ride, less likely to donate or cooperate (see, e.g., Marwell and Ames, 1981; Frey and Meier, 2003; Bauman and Rose, 2011). We present evidence that the metrics training program does not come at the expense of reduced prosociality. We measure prosocial behavior by field measures such as visits to orphanages and volunteering in impoverished schools, as well as language use in their writings. The field measures are obtained from the Academy on the policymakers’ “syndicate field trip” workshops. The policymakers undertook two field trips, one 4 months and the other 6 months following the training. In the first, they were provided a choice to either visit a prominent orphanage ( Dar-ul-Aman ) or attend lectures on a specific government program from a senior bureaucrat. In the second syndicate field trip, the policymakers are asked to choose between volunteering to teach in any impoverished government school that falls under the government’s Progressive Education Network (PEN) or once again choosing to attend a lecture on government programs from a senior public official. Figure 2 presents these results. From the top of Figure 2, we can observe that metrics training is unlikely to come at the expense of reduced field visits to orphanages or volunteering in low-income schools. Treated ministers are neither less likely to visit orphanages nor volunteer in impoverished schools. Prosocial Language. — These field results are corroborated by analyzing language use in the policymakers’ writing assignments. We found that the metrics training program does not come at the expense of reduction in the use of prosocial language. The dependent variable is the Soft-Cosine Measure (SCM) representing the similarity of writings of policymakers with

specific phrases related to prosocial behavior with higher values representing greater similarity.^7 Prosocial phrases were chosen based on recent work that shows that these phrases were correlated with field measures of prosocial behavior such as blood donations of these deputy ministers (Mehmood, Naseer and Chen, 2021). Specifically, in Figure 2, we find words associated with prosociality are unaffected by our treatment; if anything, the metrics training is likely to reduce the use of “them” and “I”, words associated with lack of social cohesion. D. Treatment Effect on WTP for Evidence Effect of Metrics Training on WTP. — In May 2021, 6 months following the metrics training workshop, all policymakers’ initial beliefs on the effect of deworming were elicited. This is followed by a “signal” on causal evidence regarding the effect of deworming on various outcomes including income. We also elicited WTP from both private and public funds for three pieces of information for: (1) results from a RCT on impact of deworming on long-run income; (2) correlational data on incomes of schools with and without a deworming program; (3) advice from senior public officials on the impact of deworming policy. The latter two choices are status quo sources of information available to the ministers. The WTP was elicited both before and after the signal. In particular, the following signal was revealed: Recent randomized evaluation finds deworming impacts on economic outcomes up to 20 years later. Individuals who received deworming experience up to 3 additional years of schooling, 14% increases in consumption expenditure, 13% increases in hourly earnings, 9% in non- agricultural work hours (Source: PNAS, 2021). We found that the policymakers underestimated the long-run impact of deworming relative to impact from the randomized evaluation results presented in the signal. Figure 3 reports distributions of initial and post-signal beliefs for placebo and metrics assigned groups. The ministers’ initial beliefs on the impact of deworming on long-run income is about 5% (for both treated and placebo groups). Nevertheless, as can be observed from Figure 3, the group 7 It is a continuous variable with 0 denoting no similarity and 1 indicating perfect match with the phrase. SCM is a machine learning textual analysis algorithm that compares similarity between words and accurately detects similarity when they have no words in common between phrases (using pre-trained word-embeddings). It is shown to outperform many of the state-of-the-art methods in the semantic text similarity tasks and is widely used commercially e.g. by Google Translate (for more details, see for instance, Sidorov et al., 2014)