The Pitfalls and Possibilities of the Measurement Revolution for National Security
“It was impossible to create good metrics. We tried using troop numbers trained, violence levels, control of territory and none of it painted an accurate picture,” said one former NSC official. Said another: “There was no head turning movement or measurement on how things are going to be improving.”
Petabytes of data were collected throughout the war in Afghanistan, yet as the recently published Afghanistan Papers highlight, they rarely informed strategy. Instead, conflicting priorities and changing benchmarks of success ruled the day. And even when leaders did settle on metrics, such as the number of Taliban attacks, their interpretation was often tailored to match a desired high-level message, rather than being based on clear-eyed, consistent arguments about what different trends meant about the underlying political process. Realizing the measurement revolution’s potential to enable better security policy does not require a military led by statisticians. It requires a military led by thinkers.
Because the trifecta of big data, the “Internet of Things,” and machine learning creates tremendous potential for quantifying human behavior. From tracking the spread of diseases to measuring refugee integration, data that were impossible to gather even a decade ago can now be used to inform policy decisions with great precision. The paralyzing issue for today’s policy leaders is how to figure out which data-driven claims are credible and which are not. Nowhere is this more true than in national security policy, where hard-to-interpret data abound and the stakes couldn’t be higher.
To see the measurement revolution’s promise, consider some hard security policy questions.
Do countering violent extremism programs work, and if so, where should they be targeted? Recent work leveraging social media and high-resolution data on program administration can help answer both questions. New research by Tamar Mitts geolocates roughly 35,000 Twitter users in the United States who followed one or more Islamic State propaganda accounts and parses their tweets from 2014 to 2016 to identify which tweets explicitly express pro-ISIL sentiment. She finds that those living in areas where the Department of Homeland Security held community engagement events posted less content sympathetic to ISIL and followed fewer propaganda accounts in the period after the event compared to the period before. No similar change happened in places where Homeland Security did not hold events. Of course, we’d like to target effective countering violent extremism programming at communities that in fact have a significant pro-ISIL presence. Here Mitts again provides helpful evidence. A new paper uses geolocated Twitter data to show that pro-ISIL sentiment increases following anti-Muslim protests in Europe and does so more strongly in regions with more far-right voters.
How about whether decision-makers should target aid in conflict zones at microenterprises or larger firms? This is not merely a development question; getting economies growing again is widely viewed as important for long-term stability. A recent study uses three years’ worth of cellphone data to assess how the war affected firm-level economic activity in Afghanistan. Not surprisingly, the authors find that companies avoid conflict-prone areas. One major violent event in a district is associated with a 6 percent reduction in the number of firms operating in that district in the next month, and the effect persists for six months. But only firms with more than 12 employees are able to adjust in this way; there is no statistical relationship between violence in one month and the level of activity among smaller firms in the next month. In Afghanistan, at least, this evidence suggests policies to reduce the impact of conflict on the economy should target larger enterprises.
Is the American defense community building the capacity to spot such learning opportunities and ask the right questions?
The United States spends billions of dollars every year to ensure that its forces have great equipment. And there are entire training bases, such as the Joint Readiness Training Center, devoted to preparing the force to make tactical and operational decisions under pressure in sensitive circumstances. Army infantry train in the most realistic settings money can buy, practicing interacting with local civilians, coordinating supporting fires, and helping wounded comrades under fire. Air Force, Marine, and Navy fighter pilots spend hundreds of hours learning to operate highly technical systems under tremendous physical stress, including exercises such as Red Flag, which can involve hundreds of aircraft and more than 10,000 airmen, sailors, and soldiers. These kinds of exercises train structured responses, everything from the kind of immediate, nearly automatic reactions needed to handle battlefield problems to the complex managerial challenges staffs face in coordinating the actions of dozens of subordinate units based on information from hundreds of sensors and intelligence platforms, all in the face of complex logistical considerations.
Despite that prodigious investment in training to solve immediate and near-term problems, military education systems do very little to systematically train defense leaders on how to use evidence to inform longer-term decisions. The Army War College curriculum, for example, teaches necessary subjects such as “Strategic Leadership” and “Theory of War and Strategy,” but does not provide instruction on which kinds of data should inform which decisions. The professional military education system has no equivalent to the University of Washington’s “Calling Bullshit: Data Reasoning in a Digital World,” which systematically takes students through common mistakes such as assuming correlation implies causation, failing to consider base rates, and scaling data graphics in deceptive ways. Without education on how to use data to inform the big picture, modern technology has produced what Peter Singer calls “tactical generals,” leaders pulled by technology to micromanage at the tactical level, leaving few thinking about how the profusion of information could be used to learn and plan at the operational or strategic level.
This is an unfortunate state of affairs, as botching a few key principles can cause even the most astute leader to arrive at the wrong conclusion.
How Evidence Goes Wrong
Consider a few concrete questions. What drives suicide bombings? Will small-scale aid packages help establish stability in counter-insurgency campaigns? Will additional funding to airport security reduce the incidence of terrorism? In each case, an intuitive and superficially sensible evidence-based approach to the question leads to the wrong conclusion.
If you want to understand what motivates suicide terrorism, at first blush it seems sensible to look for commonalities among groups that use suicide bombings. That is, after all, the kind of thing people tend to do when they think about “lessons learned.” One prominent study did this and concluded that suicide terrorism tends to occur in conflicts involving foreign occupation by a democracy. But just focusing on the suicide terrorists was a mistake. To figure out what distinguishes groups that turn to suicide terrorism from groups that do not, you have to compare those two types of groups to one another. And studies that do so find no association between foreign occupation and suicide terrorism.
How individuals make inferences matters because those inferences drive strategy. In the suicide bombing case, different conclusions could be drawn from different datasets. For example, from 1970 to 1982 the only terrorist group using suicide bombings was the Tamil Tigers, so one might have concluded that suicide bombings were used by groups that combined socialist ideology with Tamil nationalism. But by 1989, both Hezbollah and Amal had used the tactic in the Lebanese civil war, so one might reasonably have concluded the common factors were socialist ideology plus Tamil nationalism or Lebanese Shiite Muslim groups fighting occupation. As suicide terrorism spread, by 2003 at least seven more groups were using the tactic, including Hamas, the Kurdistan Workers’ Party, Palestinian Islamic Jihad, Jemaah Islamiyah, al-Qaida, Egyptian Islamic Jihad, and several Kashmiri rebel groups. One might then have concluded that the common factor was groups fighting directly or indirectly against occupation by a U.S.-allied country. Finally, by 2016, one would have been forced to add the Pakistani Taliban, various Chechen groups, al-Qaida in Iraq, the Islamic State of Iraq, ISIL, Jabhat al-Nusra, the Free Syrian Army, and al-Qaida in the Arabian Peninsula to the list. The common conditions would then have to encompass fighting a U.S.-, Saudi-, or Iranian-supported regime, as well as engaging in factional competition against other Sunni Islamist groups.
This is an example of a more general mistake. If you want to know whether two features of the world (say, occupation and suicide terrorism) are correlated (i.e., tend to occur together), you can’t just look at cases where one of those features occurs. You have to compare the frequency of occupations in conflicts with and without suicide terrorism.
And failure to appreciate this point doesn’t just affect academic studies; it can also undermine the efficacy of American defense institutions. For instance, think about the practice of performing a postmortem following some operational failure. It is natural to ask what rules were not followed or what warning signs were ignored during the failed mission. But if those postmortem procedures don’t compel leaders to ask whether those same rules were broken or warning signs brushed aside during previous successful missions, then they allow them to make the mistake of not comparing and lead leaders to jump to the wrong conclusions.
Even when individuals do compare, things can go wrong. Think about trying to assess the efficacy of small-scale aid spending in insecure environments, a topic one of us has studied extensively. We can compare across Iraq’s 104 districts and ask whether places where the United States devoted more money to small-scale aid experienced less insurgency. The answer turns out to be no. Districts with more small-scale aid projects experienced more insurgent violence, not less. But does that mean small-scale aid is counterproductive? As anyone who directed those projects will tell you, smart military leaders directed money to places where they faced bigger problems — for instance, to districts where the people were more firmly opposed to the new Shiite-led government. So, the positive correlation between aid and insurgent violence doesn’t necessarily reflect the counterproductive effects of aid spending. Instead, aid spending chased insurgency. A more clear-minded comparison can help untangle this question. We can account for the underlying level of insurgent support in a district by comparing changes in aid spending and changes in insurgent violence within districts, from one period to the next, instead of comparing levels across districts. Consistent with the concern that the positive correlation between spending and violence didn’t reflect the true causal relationship, when you compare changes, you discover that increases in aid spending from month to month are actually associated with decreases, not increases, in violence.
Accuracy in tactical assessments only really matters, though, if those assessments are linked to the broader mission you are trying to achieve. During the late 1960s and early 1970s, a rash of airline hijackings led the United States to require metal detectors at all major airports. This was the first step down the road to owning toiletries that fit only in a quart-size bag. The number of hijackings of airplanes dropped quickly, from an average of almost 20 per quarter before metal detectors were installed to fewer than 10 per quarter after. Big counter-terrorism success, right? Well, maybe.
Let’s stop and think about whether we’ve measured the counter-terrorism mission properly. If the counter-terrorism mission is to stop hijackings, then this seems like evidence of a clear win. But what if the mission is broader — not just stopping hijackings, but terrorist attacks more broadly? Then, by looking only at the effect of the policy on hijackings, we haven’t quite measured the mission. And, indeed, it turns out that the reduction in hijackings was almost perfectly offset by an increase in other kinds of hostage takings by terrorists, who likely decided that if the United States was going to protect airplanes, they would attack other targets instead.
We can see a similar example in the war on drugs. Successful U.S. efforts to shut down drug transshipment through the Caribbean led drug traffickers to move their operations to Central America and Mexico, with no long-term reduction in drugs flowing to the United States, but with devastating consequences for those countries.
Questions to Avoid Common Mistakes
A firmer understanding of a few key evidence-based principles would add tremendous value to the defense educational framework. Leaders, especially at the senior level, can begin by asking their team, and themselves, four questions when trying to use evidence to make better decisions:
- Have you established a relationship? If you haven’t compared cases where the outcome in question did and did not occur, the answer is no. Correlation requires variation.
- Have you determined causality? Once we observe a correlation between two features of the world, we need to ask what it means. Could the relationship between two variables not be causal at all, but reflect the effect of some third variable on both of them, as with the robust correlation between ice cream consumption and murder rates over time (both influenced by temperature)?
- Have you measured the mission correctly? It’s often easy to conflate our actual goals with things that are easily measured. But that can lead to mistakes. We have to pay close attention to whether the metrics underlying our data analysis track our mission. When they don’t, we have to think carefully about what we can and cannot conclude from the evidence.
- Can I extrapolate from my evidence to my problem? Once leaders have measured their mission, established a relationship, and convinced themselves the evidence suggests a causal relationship between a policy action and an outcome in some place, they still need to ask whether the evidence they have is the right evidence for the problem they are working on. For example, does Mitts’ evidence on the efficacy of countering violent extremism in the United States imply similar programs would work in Europe? What features of the problem are similar and different in the United States and Europe that might make extrapolation more or less reasonable?
Our ability to collect data has vastly improved in recent years. But to reap the national security benefits of this data revolution, our ability to think clearly about how to use evidence to make better decisions has to keep up. The good news is that, in our experience creating and teaching an executive education course on leading evidence-based decisions, leaders can acquire the key conceptual tools needed to navigate today’s information-rich environment without devoting years to becoming technical data analysts.
Like all important skills, however, evidence-based decision-making doesn’t come naturally. It takes careful training and practice. The United States should reform its defense education system to prepare leaders to understand common conceptual errors, ask the critical questions, and retain the healthy level of skepticism necessary to use evidence effectively to make better decisions. This means bringing short courses on leveraging evidence into the curriculum at many levels, from the service academies through the National Defense University. Applying some basic principles can help leaders to filter through the noise and think clearly in a data-driven age. We are at a pivotal point in history. It is vital that our leaders’ education keeps pace with innovation.
Ethan Bueno de Mesquita is the Sydney Stein Professor and Deputy Dean at the Harris School of Public Policy at the University of Chicago. He is co-creator of Leading Evidence-Based Decisions and the author of Political Economy for Public Policy.
Liam Collins is the Executive Director of the Madison Policy Forum and the Viola Foundation. He is retired Special Forces Colonel and former Director of the Combating Terrorism Center and Modern War Institute at West Point. He is co-creator of Leading Evidence-Based Decisions.
Kristen G. DeCaires is Program Manager for the Empirical Studies of Conflict Project (ESOC) at Princeton University. Prior to ESOC, she served in various public health initiatives and research administration programs. DeCaires conducted field research and program evaluations in the U.S. and Myanmar for refugee populations, emergency response, and maternal child health projects.
Jacob N. Shapiro is Professor of Politics and International Affairs at Princeton University, where he directs the Empirical Studies of Conflict Project. He is co-creator of Leading Evidence Based-Decisions, author of The Terrorist’s Dilemma: Managing Violent Covert Organizations, and co-author of Small Wars, Big Data: The Information Revolution in Modern Conflict.