Trust and Tech: AI Education in The Military

chapa

The average life expectancy of a non-instrument trained pilot in instrument conditions is 178 seconds — or so says a 1954 study that pilots have referenced ever since. The point made by the aphorism is that once a pilot enters the clouds, instinct is of very little help in keeping the airplane upright. Learning to use the instruments when you can’t see outside is crucial to flying safely in the weather. But pilots learn not just to interpret the information displayed on the instruments but also to trust that information. Every instrument-rated pilot is familiar with the phenomenon: I pass from blue sky into the clouds and transition from using the horizon to determine my aircraft’s orientation to looking at the instruments in the cockpit. The vestibular system in my ear sends my brain compelling signals suggesting that I’m in a bank turn to the right. My natural inclination is to roll left to level the wings. But my training has taught me differently. I instead look to the attitude indicator. If it tells me that my wings are, in fact, level, then I trust the attitude indicator and fight the urge to bank the airplane. To do otherwise is literally to take my life into my hands while flying blind.

The rapid development of AI technology, and its employment within the military, raises questions about trust and technology like the ones pilots have wrestled with for decades. Though the questions are new, they do have some similarities with those raised when flying by instruments became possible nearly a century ago. And at that time, the educational resources that could help pilots with those questions were available to the U.S. Army Air Corps, but it did not avail itself of them. Similarly, today, the resources for providing operators with a baseline education in AI, which would help address these questions about trust, even if it wouldn’t solve them completely, are already available in industry. Though developing organic educational content tailored to national defense is important, it will also take time. In the short term, the Department of Defense should buy licenses for open-source, online, modular curricula to bring its workforce into the age of AI.

 

 

Trusting the Instruments

Is it trustworthy? That’s the first question we should ask about any AI application. If it is, we have to ask a second question: Will its operators actually trust it? The congressionally mandated National Security Commission on AI, led by former Google CEO Eric Schmidt and former Deputy Secretary of Defense Robert Work, put the challenge this way in a draft of their forthcoming report: “Ultimately, [those charged with using AI tools] need to formulate an educated answer to this question: In the given circumstances, how much confidence in the machine is enough confidence?” This question predates AI and, in fact, was asked at the birth of instrument flying.

Until the 1930s, flying in the clouds posed a considerable problem for pilots. In Timothy Schultz’s telling, the received wisdom was that experienced pilots with good instincts and well-functioning vestibular systems could fly straight and level even in the clouds. Of course, we know now that this was not so. The lucky ones made it through the weather right side up and lived to become experienced pilots. The unlucky ones didn’t. Many pilots even came to believe that the clouds were made of magnetic fields because, even though they “knew” that they were maintaining a constant heading, when they entered the clouds, the magnetic compass “began to rotate.”

Mechanical flight instruments would eventually solve this problem. In 1929, then-Lt. Jimmy Doolittle flew an entire flight — including the landing — with his canopy totally covered by a hood, so that he could see only the instruments inside his cockpit and nothing of the outside world. Doolittle would go on to fame in aviation history by leading the 1942 carrier-launched B-25 raid on Tokyo that now bears his name. But in 1929, Doolittle was putting his Ph.D. in aeronautical engineering to work in ways that would have far more profound implications for the future of aviation. Doolittle showed that the instruments were trustworthy, and instrument-rated pilots ever since have learned to trust them.

Applying these lessons to the AI challenge looks simple. Find some smart, technical people who can produce AI solutions; have them test those solutions; and if those experts say the solutions are trustworthy, then the problem is solved. But the AI challenge, like the flight-by-instruments challenge a century before, is more complicated than that.

Faults in the System

An important element in a person’s ability to trust technology is learning to recognize a fault or a failure. For instance, the airspeed and altitude indications in the cockpit are fed by sensors on the outside of the aircraft, known as the pitot tube and static ports. A blocked or frozen pitot tube will cause counterintuitive effects in the airspeed and altitude indications. Depending on the particular system and the specific kind of blockage, the airspeed indicator will freeze — no matter how fast or slow the aircraft goes. But the airspeed indicator will change with changes in altitude. Now imagine flying in the weather with a blocked pitot port. If not properly trained, pilots will either fail to notice the fault, or notice it but fail to identify it correctly and respond to it safely.

As AI applications become embedded in military systems, they will present a similar hurdle. First, a relatively small number of professionals — software developers, data scientists, ethicists, and others — will have to develop AI solutions and demonstrate that they are trustworthy. This will take place largely in the various phases of AI testing (though in the case of machine learning applications, algorithms will need to be examined across the lifecycle any time new training data is introduced).

But it’s the people who will ultimately operate these systems who will have to learn to identify faults and to react to those faults safely. And, just as the student pilot has to learn enough about the pitot-static system to interpret unexpected indications correctly, so too will the operators of the AI system have to learn enough about the technology behind the user interface to be able to interpret information correctly and react safely.

This is a difficult challenge with AI because, sort of like the pitot-static system, when AI fails, it often does so in counterintuitive ways: It’s not always obvious that it has, or why it has, failed. There are numerous examples from the tech industry, and those who follow AI closely will already be familiar with many of them. Here it might be useful to look at one example.

AI researchers at the University of Virginia and the University of Washington used a well-known set of labeled images called imSitu to train an image recognition algorithm. Each of the 125,000 images in the set included, among other things, a label for the place (e.g., “kitchen”), an agent (e.g., “man” or “woman”), and an activity (e.g., “cooking”). After training their image-classifier algorithm on a subset of the images, the researchers then applied the algorithm to the rest of the set to see how closely its classification matched the human labels. The algorithm performed well in general but failed in one specific area: In images in which the human labelers had identified a man cooking in a kitchen, the algorithm often labeled the image as a woman cooking in a kitchen. The algorithm labeled only 16 percent of the images of people cooking in a kitchen as men even though the human labels identified men in the kitchen 30 percent of the time. Bias in the training data led to bias in the classifier.

We might be tempted to call this error a simple case of “garbage in, garbage out”: A more representative training data set would have resulted in a more accurate result. The training dataset was biased in that it didn’t show an equal number of men and women cooking in kitchens. Though this much is true — a more representative training data set really would have helped — there is more to the failure than that. The algorithm didn’t just adopt or incorporate the bias in the training data. It amplified the bias. Because of the bias in the training data, this machine-learning application learned that women are more likely than men to be in rooms that look to you and me like kitchens. But even this description anthropomorphizes the algorithm. More precisely, the algorithm learned that the features of an image that would cause you and me to think of concepts such as “kitchen” and “cooking” are correlated with the label “woman” more tightly than they are correlated with the label “man.” The algorithm used the features that signal to us “kitchen” as a proxy for the features that ought to signal the label “woman.” Ultimately, it learned (incorrectly) that 84 percent of the time, identifying the features of a kitchen is sufficient to predict that the agent in the image is a woman. The researchers discovered the error and corrected it, but there are at least two elements of this example that should affect how the military prepares for more widespread use of AI.

First, the researchers knew what to look for because they are highly trained experts in the field of machine learning. To refer back to the flying analogy, they are more Jimmy Doolittle than undergraduate pilot training student.

The second is that the seriousness of the risk that accompanies failures in AI is dependent upon the context. Suppose a machine-learning system is trained to recognize different kinds of receipts and to pull the relevant data from the receipt and log it in a database, saving time on processing travel vouchers. The cost of failure is relatively low. It might mean the office loses some of the potential manpower savings because human agents have to conduct quality assurance or because some users will manually have to input the data retroactively. But now imagine a case in which a machine-learning system is trained on old promotion data to rank servicemembers in order of merit before a promotion board. The old promotion data is likely to include biases. The consequences of failure here are much higher — the people who ought to be promoted might not be. People might be disenfranchised based on factors that ought to be irrelevant to promotion.

Or, imagine a machine-learning system that is employed in combat to increase a commander’s situational awareness about civilian vehicle traffic compared to military vehicle traffic. Even if that algorithm has no direct connection to the kill chain, the commander’s understanding of the world will be at least partly shaped by its inputs. Ultimately, the cost of failure of the machine-learning system might be very high. And like the pilot flying on instruments, the operator’s ability to recognize faults and mitigate risk will depend on the operator’s education and training in the relevant technology. Will the commander know enough about the machine learning behind the traffic app’s user interface to recognize if it’s malfunctioning?

Developing an AI-Ready Workforce

This is a workforce development challenge, and it demands a workforce development solution. Just as enterprising instructor pilots had to develop new ways of teaching student pilots how to use and when to trust (and when not to trust) their instruments, enterprising professionals from across the Department of Defense will need to develop new ways of teaching military practitioners how to use and when to trust AI and machine-learning applications. They don’t all need to be Jimmy Doolittle Ph.D.s, though the department certainly needs some of those too. Instead, the military needs professionals who are experts in their own career fields and in the tactical employment of their own weapons systems but who also understand AI and machine learning well enough to identify its best uses — and especially to teach others how to mitigate risks in their specific areas of expertise.

In the face of the workforce development challenge confronting the department, the Joint Artificial Intelligence Center, the department’s lead for AI, has published an AI education strategy that identifies six categories of employee. Each category of employee will need a different kind of AI education. In the Joint AI Center’s terms, these categories are those who will “lead AI,” “drive AI,” “create AI,” “facilitate AI,” “embed AI,” and “employ AI.” The operators I referred to above — those managing the travel voucher system, those managing promotion systems, and the commander monitoring vehicle traffic — will fall into the “employ AI” category. In fact, according to the strategy, so will most Department of Defense employees.

What exactly does an employee need to know to employ AI safely and effectively? The Joint AI Center has offered broad outlines but no specifics. To employ AI, a person must be able to “interpret AI application output to inform decision making; understand AI concepts and recognize potential future applications; and demonstrate proficiency related to engaging with and interpreting AI application.” This description bears a striking resemblance to technical proficiency requirements for pilots during World War II: By 1943, “all students in basic [flight training] had to take and pass an instrument flight check ride.” To fly on instruments safely and effectively, pilots would have to interpret instruments to inform decision-making, understand and recognize applications, and demonstrate proficiency in engaging with and interpreting the instruments.

Though these broad brushstrokes are helpful, there is much left to learn about what a pilot, a surface warfare officer, or a logistics officer needs to know about AI to meet this standard. The Air Force Accelerator at MIT is developing a preliminary “employ AI” training program for the Air Force and Space Force that will be launched later this year to answer some of these questions.

To address this problem further, we might look once again to the instrument flying analogy.

At present, education in AI-related fields is available to the Department of Defense, and yet, the department lags behind its industry counterparts. Similarly, while the Army Air Corps had the technology to develop its workforce in instrument flying during the interwar years, it lacked the institutional will. Meanwhile, as early as 1932, “all commercial pilots had to undergo an extensive instrument training program and pass a rigid check flight before they were licensed.” One pilot recounts an experience from pilot training in the fall of 1941 — just a few months before the United States would enter World War II — in which he asked his instructor about the instruments on the dashboard. The instructor said, “Don’t mess with those things … they’re for airline pilots.” The result for the Air Corps at the beginning of the war was devastating. As one U.S. bomber pilot in England wrote to a pilot trainee in the United States, “For God’s sake, get all the instrument flying you can. It’s the difference between life and death over here.”

The Air Corps’ response to the challenge came through Col. Joe Duckworth. Duckworth had logged a staggering 12,000 flight hours and extensive instrument experience as a pilot in the civilian world. He used that civilian expertise to change the way the Air Corps taught pilots to fly.

The defense workforce’s position with respect to AI today is not unlike the position of the Air Corps’ workforce with respect to instrument flying in the 1930s and early 1940s: The necessary training exists, but it resides largely outside the department. In 1940, Duckworth began to change the Air Corps by applying the tools made available in the civilian aviation industry, and he used those tools to teach airmen how to use the new technology, to know when to trust it, and to know when not to trust it.

Today, according to numerous studies and subject matter experts, expertise in AI — and especially in AI education — resides largely outside the Defense Department. There are commercial education programs that can be tailored to the Defense Department workforce if the department, or the several services, are willing to buy the licenses. In fact, some of these commercial programs are the very same vendors companies in Silicon Valley offer to train their own employees. Amazon has committed $700 million to train 100,000 employees worldwide in AI-related fields. AT&T is likewise spending $1 billion on its company-wide reskilling program. These industry workforce development tools are as available to the Defense Department as they are to tech companies.

If, as the Joint AI Center has suggested, Defense Department employees in the “employ AI” group need a working knowledge of AI and must possess the ability to engage with and interpret AI tools, perhaps the department should start where leaders in the tech industry have started — licensing online education.

The National Security Commission on AI insists that, to be an AI-ready force by 2025, warfighters will need “baseline digital literacy and access to the digital infrastructure and software required for ubiquitous AI integration in training, exercises, and operations.” The baseline digital literacy requirement can be met in the near term with these off-the-shelf industry tools.

The Air Force is currently running a pilot program called “Digital University” to begin to make these courses available to members of the Air Force and Space Force. The Joint AI Center has likewise committed funds to run its own training pilots through Digital University. But it remains to be seen whether there is sufficient institutional will to continue to fund it after these pilot programs conclude. If the Defense Department fails to make AI education available to the workforce, it will risk the fate of the 1930s Air Corps — it had the right technology but failed to prepare its workforce to employ it. As the National Security Commission on AI puts it, “The artificial intelligence (AI) competition will not be won by the side with the best technology. It will be won by the side with the best, most diverse and tech-savvy talent.” Developing organic training and education within the department and capitalizing on industry capabilities are not mutually exclusive propositions. Start with what’s available.

 

 

Joseph O. Chapa is an officer in the U.S. Air Force and holds a Ph.D. in philosophy from the University of Oxford. He is a senior pilot and currently works on artificial intelligence issues on the Air Staff. The views expressed are those of the author and do not necessarily reflect the U.S. Air Force, the Department of Defense, or the U.S. government.

Image: Jacobs School of Engineering, UC San Diego