See You in a Month: AI’s Long Data Tail


This submission is in response to the search for ideas from the National Security Commission on Artificial Intelligence. It addresses item 3b — the infrastructure needed to sustain leadership in artificial intelligence, and item 3d — how data should be collected, stored, protected, and shared.


If you are an analyst — it does not matter what kind — then like me, you eventually become begrudgingly resigned to spending most of your time preparing for, rather than actually conducting analysis. I sometimes share a dark joke with my fellow analysts. When asked how long it will take to provide an answer to a question, my response is, “About two days. But I’ll see you in a month because it will take me 28 days to find, beg for access to, and clean the data I need to answer your question.”

The typical analyst spends as much as 80 percent of his or her time discovering, accessing, and cleaning data before conducting any analysis. I have revalidated this percentage in discussions with intelligence analysts, program analysts, cost analysts, and operations research analysts, including those who leverage machine learning as their workhorse solution. Some senior policymakers are aware of this dirty little secret, and yet, the problem remains stubbornly persistent. Why? And more importantly, why should we care about solving it?



We should care now more than ever because artificial intelligence, or AI, can become an indispensable capability for achieving military and economic superiority. But taking full advantage of this amazing technology to enhance national security requires a focus not simply on machine learning algorithms, but also on the data to make them useful. In the national security realm, most analysts serve decision-makers who turn analytical answers into competitive advantage. At its core, AI is also about answering questions, often in order to gain competitive advantage. Forget about the science fiction of The Terminator. That sort of General AI is not coming to the battlefield (or anywhere else) any time soon. The engineering fact of Narrow AI, however, is here today in the form of data-hungry machine learning algorithms.

To fully appreciate the need to focus on the data, consider a mid-nineteenth-century historical analogy: The move from muzzle-loading rifled muskets such as the Springfield, to magazine-fed repeating rifles like the Spencer, increased an infantryman’s rate of fire by a factor of ten.

Just as infantrymen wield rifles loaded with ammunition, AI warriors wield algorithms loaded with data. Will their data have to be prepared in a time-consuming manner akin to the loading of the Springfield with Minié ball, paper powder cartridge, percussion cap and ramrod, or will their data be as immediately accessible and usable as the seven rounds in the Spencer’s magazines?

If AI warriors are to immediately and rapidly “fire” their weapons instead of constantly reloading them, data must be plentiful, discoverable, accessible, and of the highest possible quality.

Monetary Costs and Opportunity Costs

A common misperception among many military operators is that data is ubiquitous, of high quality, and free. It rarely is. They see what companies like Google and Facebook do with data. Understandably, they expect equally powerful outcomes from military data. There is a fundamental difference, though, in the data streams available to each. These companies set up systems where unpaid humans automatically and effortlessly generate (not collect or record) data about themselves, their family, and their friends just by using apps that they believe enrich their daily lives. Developers then monetize that data by writing code to extract correlations and insights. The infrastructure developed to “instrument” these humans was expensive at first, but with the sharp downward trends in the price of data storage, transmission, and retrieval, the marginal cost for each new exploitable data point approaches zero.

Not so with much of the data available to industries outside of Silicon Valley that were not “born digital,” or for the data available to military decision-makers. True, some military systems do provide large-scale machine-generated data sets, such as radar returns and video feeds, but sometimes, not understanding the potential utility of this data, it either is not saved or is not held very long. The preponderance of data held by corporations outside of Silicon Valley and by the public sector (including the military) is not saved in a readily discoverable or accessible manner.

Military data is often collected by fallible, fatigued, and error-prone human beings. The United States military’s recent difficulties with improvised explosive devices provide a pointed example. Explosive ordnance disposal technicians and weapons technical intelligence specialists on the battlefield encounter these devices, often while accompanying tactical units with military objectives to achieve. After conducting a potentially life-threatening procedure — sometimes under fire — to disarm the bomb, the device is a treasure trove of exploitable data. But because a patrol leader often needs to move on to the primary objective of the mission, there may be just enough time to bag and tag the physical evidence, and not enough time to record the data that describes the device and its emplacement. Instead, this has to happen at the end of the twelve-hour patrol in hundred-degree heat, when the only thing standing between this exhausted collector and his bed is writing a report about the explosive devices encountered on patrol.

With this in mind, I never disparaged the quality or completeness of these reports when I was working to improve the data needed for the forensic and biometric exploitation of these devices. Companies like Google and Facebook do not have this sort of problem.

Human data collectors rarely see the negative consequences of poor data quality — at least not directly, or immediately. The result is often data sets too small to be useful to machine learning algorithms, and of dubious quality. The government compounds these data quality problems with data accessibility problems. Military data is often sequestered in undiscoverable and inaccessible data storage devices and hoarded by data stewards who tend to believe that nothing good can come from granting third-party access to the data in their care.

Because of this human involvement, much military data is expensive to generate, record, store, clean, and share. The herculean task of finding, accessing, and making this data useful falls to the analyst — a task that consumes the bulk of the analyst’s productive time. Public-private partnerships to advance military AI adoption are a great idea, but both sides need to realize that significant differences in data collection environments will impede the adoption of private-sector solutions by the military. Adopting commercial technology in the sometimes rugged and austere military environment already requires significant adaptation. The difference between commercial and military data environments will likely add to the adaptation measures the military will have to endure.

Unenforceable Policies and Laws

If AI depends on data, what can be done to make more data discoverable, accessible, and of the highest possible quality? Department of Defense data would be in great shape right now if the department enforced its existing data quality policies. After all, that policy already requires data to be visible, accessible, understandable, trustworthy, and interoperable to the maximum extent possible.

Legacy systems are exempt from these requirements unless they receive modernization funding, so it would be reasonable to conclude that newly fielded or modernized systems comply with the data policy. Few do. Program managers sometimes invoke (or abuse) urgent capability acquisition authorities to justify noncompliance with acquisition policy requirements. I have seen noncompliant systems deployed to active combat zones with no accountability or negative repercussions to their program managers. One system had data so deficient that, according to a report by the Center for Naval Analyses, it “could increase risk for existing and future missions.”

While the existing policy contains at least one enforcement provision, it provides no enforcement mechanism. Policy requires that the Defense Comptroller prohibit the execution of funds for noncompliant programs, but no procedure exists to make these programs known to the comptroller. Occasionally, I ask my colleagues if they ever saw a program defunded for noncompliance. Thus far, no one has.

The policy describes data requirements qualitatively, without clear standards for what it means to be compliant. How much visibility or trustworthiness meets the standard? If a speed limit sign says “slow” instead of 55 miles per hour, how fast can you go? If given a speeding ticket, would you fight it in court?

Beyond Department of Defense policies, the Information Quality Act of 2001 applies to the entire federal government. This sometimes controversial law, with recently updated implementation guidelines, requires data disseminated to the public by the federal government to exhibit utility, objectivity, and integrity. Parties negatively affected by data that does not meet these standards can request corrections. Unfortunately, the law only covers data available to the public, which is limited based on law, policy, data rights, and security classifications. Circumventing the law is easy because of the discretion given in defining covered data and affected parties. One rather egregious example suffices to demonstrate this: The Department of Homeland Security simply declared that it “does not currently produce or sponsor the distribution of influential scientific information,” so the law does not apply. While the law applies only to its external data, this dismissive declaration about external data leaves a quality-conscious analyst to wonder if the Department of Homeland Security is equally dismissive regarding the utility, objectivity, and integrity of its internal data too — data collected to ensure the safety and security of every American citizen.

These laws and policies long predate the recent fascination with AI. Congress and the Department of Defense have long recognized the value of data quality and information sharing. AI technology advancements have greatly increased that value, elevating existing laws and policies from critical operational enablers to national security imperatives.

What Congress Can Do

Congress should make the data quality provisions of the Information Quality Act applicable to all federal government data, not just publicly disseminated data. The Office of Management and Budget guidelines should provide clear, binding, and quantifiable definitions of “utility,” “objectivity,” and “integrity.” Compliance then becomes measureable, and thus, enforceable.

Congress should also define the terms “influential scientific information,” and “affected parties” to make congressional intentions clear, thereby preventing circumvention of the law by discretionary interpretation.

Congress should recognize the clear differences between data held across the many different departments and agencies of the federal government. It should direct the Office of Management and Budget’s Office of Information and Regulatory Affairs to allow data quality definitions unique to each department and agency, but require that these “local” definitions also be clear, binding, and quantifiable to the maximum extent possible.

Congress should direct the Government Accountability Office to research data quality and information sharing law and policy enforcement mechanisms. Its report should identify enforcement provisions, mechanisms, tools, and resources, or the lack thereof. It should identify best practices for enforcement of existing data quality and information sharing laws and policies, if any.

Congress should act on the resulting report to make all policies and laws enforceable.

Enforcement of these policies and laws does not guarantee that the data needed to harness AI for national security purposes will suddenly become fully discoverable, accessible, and of high quality — for almost two decades, these rules were “in force but not enforced.” But the existing climate of disregarding and sometimes flouting the rules does guarantee that without realistic penalties for undesirable behavior, the behavior will continue.

No Data, No AI

Data quality is not free. Data-centric policies, laws, and even strategies defined a plausible way forward almost twenty years ago. Despite incredible technological progress since then, little has changed on the to-do list. The current AI strategy mostly reiterates and refines the benefits of data quality and information sharing that we are already aware of. One more unenforceable policy, law, framework, or strategy document will not help. If you are drafting such a document, and are not including enforcement mechanisms, put down your pen.

American leaders recognize, as do America’s adversaries, that mastery of AI confers leadership status in economic matters. If, as the National Security Commission on AI contends, AI mastery is a prerequisite to future American national security, then the United States as a nation ignores data discoverability, accessibility, quality, and quantity at its peril.



Dr. Michael Stumborg is a principal research scientist with the Center for Naval Analyses. The views expressed here are his own and are not necessarily those of any department or agency of the federal government, or the Center for Naval Analyses.

Image: Pexels (Photo by Florian Weihmann)