Chaos Engineering for National Defense: Embracing Infrastructure Complexity for Mission Assurance

Cyber Professionals Challenge Cyber Shield 19 Participants

Infrastructure failure is inevitable. National defense missions rely upon infrastructure — such as water, electricity, communications, and logistics — and those systems are guaranteed to fail under sufficient stress. Hazards range from weather, cyber attack, equipment faults, assault, disinformation, and curious animals. When infrastructure fails, mission systems and personnel will scramble to adapt — especially during critical activities, such as a mobility surge. At the same time, infrastructure operators will rush to recover their systems. These periods of disruption are characterized by surprise — surprise in the disruption itself, surprise in second- and third-order effects, surprise in emergent requirements, and surprise in the level of confusion or lack of control. For national defense missions, this chaos introduces risk when capabilities are most essential.

Eliminating failures is an impossible goal — threats are too numerous, interdependencies are too vast, and systems are too complex. More importantly, in national defense missions, the enemy gets a vote. Checklists and compliance standards may catch human errors but are ill-equipped to prevent all surprise or mitigate its effects. Limited in scope and imagination, existing Department of Defense mission assurance programs leave national defense missions vulnerable to the uncertainty and unavoidable surprise following infrastructure failures.

How can this be addressed? Chaos engineering for national defense embraces the inevitability of infrastructure failure and ensuing chaos. Chaos engineering itself is not novel — it is a mature concept within software development. In software, developers use chaos engineering to explore and understand system behavior under stress. But national defense infrastructure cannot directly implement practices from software engineering. Physical systems provide unique challenges, defense objectives may not have clear metrics, and, critically, people don’t die when Netflix fails.

But the Department of Defense can adapt software practices and their underlying philosophy. Chaos engineering for national defense prompts mission owners and infrastructure operators to pursue knowledge of how systems and people behave under stress. Better informed organizations can then calibrate their confidence and mitigate revealed vulnerabilities. The Department of Defense — and related organizations — should develop new structures, competencies, and techniques to implement chaos engineering in pursuit of mission assurance during inevitable infrastructure disruptions.

 

 

Complexity in Infrastructure

In the broadest sense, infrastructure describes systems that support other activities. It includes physical systems (e.g., energy, water, communications) and non-physical “soft infrastructure” (e.g., business rules, contracting and acquisition, healthcare). An expansive definition is important: Following a disruption, system behavior is determined by both interdependent physical components and human behavior from accrued experience, training, and management.

Infrastructure and mission systems provide a complicated web of interdependencies. Consider ballistic missile defense: Geographically separated subsystems each depend on local infrastructure, and those systems, in turn, support other local missions. Within this web, disruptions often propagate outside their originating systems. This is clear after seven years of energy resilience readiness exercises. Although constructed around power outages, these events consistently extend beyond the electrical system itself, providing critical recommendations to mission owners and their systems. In modern conflict, adversaries can — and likely willtarget critical national infrastructure systems. The Department of Defense should evolve mission assurance to holistically encompass infrastructure dynamics and disruptions that affect multiple systems and locations simultaneously.

Relationships between missions and infrastructure are complicated and their dynamics are complex, especially when disrupted. Disruption of complex systems introduces distinct characteristics, which can challenge intuition and traditional practices. Among these characteristics are nonlinearity, limited control, and unanticipated behavior. Small errors can have huge effects. For example, the 2003 Northeast blackout was due to failures in the alarm system, but obvious fixes may not work as operators can misinterpret alarms. Disruptions can also propagate in unexpected ways. During Hurricane Sandy, rising floodwaters led to internet disruptions despite functional backup generators. Framing infrastructure as complex is not new — Charles Perrow described “interactive complexity” in 1984. What has changed is the acceleration of forces contributing to complexity. Interdependencies continue to multiply, people cannot track the status of those interdependencies, mission requirements change faster than infrastructure can adapt, and threats to infrastructure have expanded.

In practice, how does this complexity affect personnel and organizations? When something goes wrong, complexity translates into surprise. Surprise can emerge from many directions, including the environment (e.g., weather extremes), system behavior (e.g., network traffic limitations), or mission requirements (e.g., COVID-19 support). While surprise is inevitable within new situations, there is an important distinction between situational and fundamental surprise. Situational surprise describes unlikely, but foreseeable, events — “it was bound to happen eventually” — while fundamental surprise contradicts one’s mental model — “I didn’t know this was a possibility.” If an organization doesn’t have resources or the ability to adapt, fundamental surprise can be crippling. Furthermore, installations should expect adversaries to create fundamental, not situational, surprise. Yet traditional mission assurance processes are focused on managing situational surprise, leaving systems vulnerable to fundamental surprise.

How do organizations prepare for fundamental surprise? Threat-based analyses are insufficient, precisely because fundamental surprise results from an event that was not imagined. Comprehensive hazard enumeration is impossible, and complexity prevents discovery of all potential system behavior. Fortunately, this challenge provides an opportunity. For example: The water supply can fail for many reasons — burst pipes, contamination, cyber attack, or something unanticipated. So how will organizations adapt, regardless of the underlying cause of failure? Approaching resilience and mission assurance this way is liberating — it expands analyses beyond probabilities to possibilities. Chaos engineering was specifically designed to investigate system behavior under such possibilities.

What is Chaos Engineering?

Chaos engineering was developed at Netflix to improve the reliability of its cloud services and enable its mission to continue when software services failed. Distributed computing is inherently complex. Systems rely on the interaction between many services on many servers in many geographic locations. Each constituent piece of software code is tested in its development, but engineers cannot anticipate how software will interact once deployed. Despite its best efforts in software design, Netflix’s mission was at risk from local failures, such as power loss at a single server.

In its earliest form, an automated tool — Chaos Monkey — disabled a random Netflix service within their distributed architecture. Disruptions were introduced to systems “in production,” meaning while actively used by a small subset of customers. This allowed engineers to observe real-world system behavior and discover unknown interdependencies and failure mechanisms. In other words, they exposed themselves to fundamental surprise. Armed with new knowledge, engineers update their mental map of interdependencies and update their code to mitigate system-wide impacts. Netflix has since refined its approach, and chaos engineering has been adopted throughout the online systems industry.

Chaos engineering embraces complexity and explores it proactively. Casey Rosenthal, John Allspaw, and Nora Jones, early pioneers of the concept, describe chaos engineering as “the facilitation of experiments to uncover systemic weakness.” As experiments, each event tests a specific hypothesis on disrupted system behavior, and the hypothesis is evaluated through a defined performance metric (Netflix uses stream starts per second). Each experiment examines disrupted system behavior and software engineers’ mental models and allows learning about the system capabilities.

The concept of chaos engineering — stress testing through experimentation to allow understanding and resilience improvements — can be extended to other complex systems. Such efforts are still relatively new and limited, with cyber security being the most significant. Moreover, extending chaos engineering to new domains presents clear challenges. Unlike distributed computing, infrastructure cannot take advantage of A/B testing — experimenting on a subset of users — or system rollback —quickly undoing a change. And many missions cannot be described with aggregated metrics, unlike how “stream starts per second” summarizes millions of Netflix users. Finally, distributed computing often focuses on maintaining day-to-day performance, whereas national defense missions are generally focused on readiness for the surge of activity if conflict arises. These challenges are not insurmountable, but they prevent wholesale adoption of commercial chaos engineering practices. Instead, the Department of Defense should develop new structures, competencies, and methods of evaluation to establish chaos engineering for national defense.

Chaos Engineering for National Defense

Like software, there is an inherent complexity to infrastructure and, like software, chaos engineering seeks to discover knowledge of how the system works under unexpected stresses. To be clear: The purpose of chaos engineering is, in the words of Rosenthal and Jones, “to uncover the chaos inherent in complex systems, not to cause it.” As in the software world, chaos engineering for national defense needs to be organized around deliberate, structured experimentation. Rephrased and reframed for infrastructure, the four experimental steps are: scope the experiment, generate a hypothesis, execute the experiment, and translate the results.

The first step, scope the experiment, determines the systems of interest, identifies the stakeholders, and defines the perturbations that will take place. Of particular interest are the connections (and gaps) between systems and organizations. Systems outside one’s span of control are likely to be opaque, and mistaken expectations can trigger cascading impacts — for example, assuming connection to backup power. Disruptions should extend to recovery activities and include the consideration of possible failure mechanisms. The process of scoping experiments, in itself, has value: What are stakeholders’ concerns and are they well understood? Experiments should also vary in size and level of effort. For example, a small-scale experiment could focus on a communication node, its backup generator, and the response process.

In contrast to traditional mission assurance processes, experiments extend beyond past performance and threat intelligence. Experiments should also incorporate randomness — as with Chaos Monkey — and corner case extremes where expected operational limits are met. This provides a broad mandate, resolved only through continuous experimentation across a variety of scopes and scenarios.

The second step, generate a hypothesis, distinguishes chaos engineering from traditional mission assurance approaches. Those most familiar with the system must attempt to predict the system’s behavior. Hypotheses should specify the operating environment, mission requirements, performance measures, and criteria for confirmation or rejection. Omitting technical criteria here for brevity, a hypothesis might be stated as: Given steady-state communications load in the middle of the night, when power is lost and the backup generator fails to start, any throughput degradation will be identified and resolved before network traffic is throttled.

Like scoping, the process of hypothesizing provides value in and of itself. Stakeholders may confront otherwise implicit assumptions about the operating environment or mission requirements. Hypotheses, when written out, might be immediately dismissed as wishful thinking. The process itself promotes curiosity and provides insights in the pursuit of mission assurance.

The third step, execute the experiment, tests the hypothesis. Just as software is tested in production, events should seek maximum realism. This specifically includes no-notice experiments. The disruption events are investigation, not practice, and an experiment is concluded when the hypothesis is confirmed or rejected. To prevent impacts from propagating, an experiment should end if rejection is imminent or if the event escalates beyond its anticipated scope. If there are failures, adopting the software approach of learning from incidents will allow changing actions from specific fixes to broader gap mitigation that enhances mission assurance. In sum: Experiments are executed to gain specific knowledge, not for training value or evaluation.

Unlike compliance-based mission assurance, a rejected hypothesis is not a failure or a discrepancy. Surprise allows a better understanding system behavior, “revealing the way the world actually exists and how it differs from how we expect it to exist.” Every experiment provides knowledge or confidence; every experiment is a success.

In the final step, translate the results, chaos engineers review the event and establish next steps. If the hypothesis is confirmed, stakeholders have justified confidence in their system and can focus attention on other concerns. If the hypothesis is rejected, stakeholders have better information on vulnerabilities or flaws within their mental models. Results can be useful beyond the event’s participants.

With a better understanding of disrupted system behavior, the risk of fundamental surprise is reduced. While experiments themselves require time, effort, and resources, the end result “swaps uncontrolled risk for controlled risk.” Experiments can highlight immediate and inexpensive improvements to mission assurance, as well as more extensive capability gaps that must be weighed against mission risk and cost.

Establishing and Empowering Chaos Engineers

The cloud computing industry provides an example, but not a blueprint. So what’s next? Like chaos engineering itself, implementing chaos engineering for national defense would involve experimentation. If adopted, the concept will evolve over time, but there are three immediate actions to quickly improve mission assurance.

Action One: Introduce a new culture for mission assurance. With the right mindset and authority, anyone can be a chaos engineer. The foundational concept — experimentation — is so flexible that it can be applied without formal training or organizational structure. At the local level, stakeholders should seek opportunities to learn about their systems under stress. A commander could ask for an upcoming maintenance-driven power outage to be kept secret from their unit, with the hypothesis that an unexpected outage will not prevent us from executing the day’s mission. In general, leaders should explore “latent uncertainty in the thing you believe to be absolutely true.” Even without a formal structure, grassroots experimentation will improve local mission assurance.

Action Two: Establish and empower installation-level organizations. To fully establish chaos engineering for national defense, new skillsets are needed within empowered organizations. Within the Department of Defense, each service has its own approach to managing infrastructure and running installations, and installation-level organizations are evolving for future conflict (for example, the Air Force’s new distinction between combat wings and air base wings). Each installation already has requirements for readiness, plans, and mission assurance, all with associated offices and responsibilities. At the installation level, an office of chaos engineering would augment those efforts with its core philosophy: Experiment to learn about system behavior under stress.

At a minimum, an installation-level office of chaos engineering should span a variety of installation support activities (e.g., electrical systems, communications, logistics). Organizational seams and functional stovepipes create blind spots, so an office of chaos engineering must be deliberately cross-functional (e.g., avoid chaos engineering for public works). As a cross-functional team, the office would scope experiments, support local execution, and communicate results. For example, repeating the same experiment has diminishing value, so the office would analyze and recommend high-impact events to installation leaders.

With the goal of building new knowledge, an installation-level office of chaos engineering should have a cooperative — not adversarial — relationship with local missions and organizations. To be useful, experimental results should inform performance standards for policy, training, and evaluation. For example, building codes require generators to be periodically tested. Should technician response times be similarly evaluated? What about response times at night? What about response times for backup battery systems on critical communication nodes? An office of chaos engineering establishes local-level knowledge of system behavior and vulnerabilities, and that information informs local decisions. If technician response times are unacceptable, rewrite standby procedures. If backup battery systems — generally not managed as real property — are underperforming, invest in maintenance or practice mission transfer procedures.

Action Three: Synchronize installation efforts through higher-level staffs. With the right mindset, anyone can be a chaos engineer. But higher-level staffs are needed to guide local efforts, collect best practices, and synthesize results. Within a single installation, infrastructure is dynamic and interdependent; at the national level, installations are dynamic and interdependent. A staff-level office of chaos engineering focuses on intra-installation complexity and seeks knowledge of system behavior beyond that of local leaders.

A staff-level office would also provide the center of gravity for building the skills of nascent chaos engineers. Candidates should have broad knowledge of infrastructure but, more importantly, the ability to effectively communicate with stakeholders, even when identifying uncomfortable truths. Adopting a philosophy from safety systems, an ideal chaos engineer is a concerned outsider who knows the inside. Chaos engineers should develop skills in selecting scenarios. As with distributed computing, it is impossible to test all potential disruptions, and expertise is needed to decide which experiments not to run. Personnel serving as educators, evaluators, and emergency managers can leverage their experience to build offices of chaos engineering.

Finally, chaos engineers will develop insights into how systems behave — and fail — during contingencies. Rosenthal and Jones described how the Netflix chaos engineering team developed a unique set of skills: “You naturally start to develop a specific expertise in how it’s all wired together, without having ‘deep’ expertise in any particular area of the system.” There is tremendous value in the ability to uncover, explore, and communicate modes of system failure. This expertise is a final benefit of adopting chaos engineering for national defense: a cadre of curious and empowered thinkers, all dedicated to understanding infrastructure contingencies and how to enable mission success.

Conclusion

Current approaches to mission assurance are ill-equipped to address surprise during degraded operations and contingency scenarios. Chaos engineering for national defense provides an alternative vision and a way to understand and manage the inherent complexity of infrastructure under stress. To prepare for when operational capability is most essential — when “the balloon goes up” — the Department of Defense can minimize the risk of fundamental surprise through deliberate experimentation. In preparing for future conflicts, complexity cannot be eliminated — so it must be embraced.

 

 

Nicholas Judson, Ph.D., is the associate group leader of the energy systems group at MIT Lincoln Laboratory, where he has pioneered the initiation and evolution of Department of Defense energy resilience readiness exercises.

Lt. Col. Craig Poulin, Ph.D., P.E., is an Air Force civil engineer officer. He holds a Ph.D. in interdisciplinary engineering from Northeastern University, where he researched infrastructure resilience modeling and served as a military fellow with MIT Lincoln Laboratory. Formerly the commander, 801st RED HORSE Training Squadron, he is currently the director, Department of Engineering Management at the Civil Engineer School, Air Force Institute of Technology.

The views expressed are those of the authors and do not reflect the official policy or position of the Air Force, Department of Defense, or the U.S. government.

Image: Staff Sgt. George Davis