When the world's at stake,
go beyond the headlines.

National security. For insiders. By insiders.

National security. For insiders. By insiders.

Join War on the Rocks and gain access to content trusted by policymakers, military leaders, and strategic thinkers worldwide.

Cogs of War
Cogs of War

Your Defense Code Is Already AI-Generated. Now What?

March 25, 2026
Your Defense Code Is Already AI-Generated. Now What?
Cogs of War

Cogs of War

Your Defense Code Is Already AI-Generated. Now What?

Your Defense Code Is Already AI-Generated. Now What?

Markus Sandelin
March 25, 2026

Somewhere in a defense ministry, someone is drafting a policy on whether to permit AI-assisted software development in defense procurement. The sentiment is understandable, but the policy is unfortunately unenforceable because the code is already there.

In April 2025, Microsoft Chief Executive Officer Satya Nadella told an audience that 20 to 30 percent of code in some Microsoft repositories is now AI-generated. The figure cannot be independently verified — as multiple analysts have observed, there is no reliable method to measure AI-generated code in a repository after the fact. If Microsoft cannot trace it in their own repos, national defense organizations buying Microsoft products certainly cannot, yet the company continues to ship products to every allied government.

The open source foundations underneath defense systems — including Linux itself — are increasingly maintained by contributors using GitHub Copilot, which now has over 20 million users. The libraries that those systems depend on receive AI-generated patches, reviewed by AI-assisted tools, merged by developers who accepted the suggestion because it looked right. By the time any of this reaches a defense application, the code has passed through dozens of AI-touched links in a supply chain that no nation currently tracks.

GitHub reports that Copilot writes 46 percent of code in files where it is enabled — and it is enabled across 90 percent of Fortune 100 companies. Cursor, one popular tool among several, produces approximately one billion lines of accepted code per day. These are not just experimental toys used by early adopters, but the primary mechanism through which production software — including the software nations buy, deploy, and depend on — gets written.

The debate about whether national defense organizations should adopt AI-assisted development is over. The question is whether they will build the verification infrastructure to manage what they are already running, or whether they will write doctrine that pretends the decision has not yet been made.

The Chain Nobody Sees

A software supply chain is not a single transaction, but hundreds of links, each depending on the one below it. A defense application runs on an operating system. That operating system depends on libraries. Those libraries depend on other libraries. Each link was written, updated, or patched by someone. Increasingly, that someone is an AI.

Consider this plausible scenario: A national defense organization procures a command-and-control system. The vendor built it using a modern development stack. The developers used Claude Code, Copilot, or Cursor for portions of the code — as the vast majority of large enterprises now do. The application depends on open source frameworks, which themselves depend on hundreds of packages maintained by small teams of volunteers. Those volunteers use the same AI tools — and as of February 2026, Anthropic is actively giving its most powerful coding tool, Claude Code, to up to 10,000 open-source maintainers for free, specifically to encourage them to use it for maintenance and code review.

The command-and-control system runs on Linux, where kernel maintainers have debated the role of AI-generated patches since at least 2024. Underneath the Linux system sits firmware and drivers, some provided by hardware vendors whose development practices are entirely opaque, many of which are also using open source components and libraries.

At no point in this chain does anyone record which code was written by a human, which was suggested by an AI, and which was generated wholesale. Clearly, writing a procurement policy that says “we will not accept AI-generated code” is a wish. It is unenforceable for reasons that will become clear.

Every Attack Vector Has Been Demonstrated

This would matter less if AI coding tools were simply unreliable. Unreliable tools produce bugs, but bugs are findable. Now the concern is no longer reliability, but compromise, and every technical prerequisite for compromising AI-generated code at scale has been independently demonstrated.

Researchers at USENIX Security 2024 showed that contaminating 0.2 percent of a model’s training data — 160 files out of 80,000 — embedded backdoors that evaded all standard detection tools. Why does such a small amount work? Because the model does not need to learn the backdoor as a primary pattern. It only needs to associate the malicious output with a specific trigger context. The rest of the training data teaches the model to write clean code, which is what makes the poisoned output invisible: It looks exactly like everything else the model produces, until the trigger fires.

Anthropic published research in January 2024 demonstrating that a model could be trained to write secure, correct code under normal conditions while injecting exploitable vulnerabilities when triggered by a specific signal — in their experiment, the calendar year changing. The backdoor survived every standard safety technique, including the reinforcement learning process specifically designed to remove unwanted behaviors. Larger models proved harder to fix. Why? Because larger models have more capacity to compartmentalize behaviors, maintaining the appearance of safety in one context while preserving the compromise in another. The capability that makes them useful is the same capability that makes them dangerous.

Trail of Bits demonstrated in August 2025 that an attacker could file a normal-looking bug report on GitHub containing invisible instructions in the page’s HTML. When a coding assistant read the page, it followed the hidden instructions — in their demonstration, installing a backdoor. The developer saw nothing unusual.

And in July 2025, an attacker exploited a flaw in the build process for Amazon Q Developer and injected a malicious instruction into the official product distributed through Visual Studio Code’s marketplace. The compromised extension had over 964,000 installations. It was publicly distributed for two days. The instruction directed the AI to wipe users’ systems and delete cloud resources. The only reason it caused no damage was a syntax error in the attacker’s payload. A typo is the current margin of safety for AI coding tool supply chains.

The Tools That Build the Tools

There is a layer beneath all of this that the conversation has not caught up with.

The AI coding tools themselves are software. They are built, updated, and maintained using the same AI-assisted development practices they enable for everyone else. Claude Code receives multiple updates weekly. Cursor ships updates at a similar pace. These are not annual releases that undergo a six-month certification cycle. They are continuous deployments into the workflows of millions of developers, including developers building defense systems, and each update changes what the tool generates, how it interprets context, and what patterns it injects into downstream code.

This is the recursive problem. It is not just that AI tools write code that enters defense systems. AI tools — increasingly built and tested by agentic AI processes — write the tools that write the code that enters defense systems. The chain has folded back on itself.

And the output is genuinely difficult to inspect. These are not template engines inserting predictable boilerplate. They are context-sensitive generators that produce different code depending on the project structure, the surrounding files, the conversation history, and the configuration they were given. What one session generates may differ from what the next session generates for an identical prompt. The variation is a feature — it makes the tools useful. It also makes their output effectively opaque to verification. A human reviewer sees code. Another AI reviewer sees code. Neither can reliably determine whether a function was written character by character by a person or generated by an agent that read a poisoned configuration file three steps back in a chain nobody logged.

This is why banning AI-generated code from defense procurement is unenforceable. Code is code. There is no watermark. There is no signature. A function written by a human and the same function generated by an AI are identical artifacts. You cannot inspect the output and determine the origin, any more than you can look at a brick wall and determine whether the bricklayer was left-handed. Provenance tracking — the idea that you could record which model touched which line — only works if it is recorded at the moment of generation, by the tool doing the generating, in a chain of custody that no party in the supply chain currently maintains or has any incentive to build.

The Review That Is Not Happening

The standard institutional response to supply chain risk is review: We will examine what we procure. The evidence strongly suggests this is not functioning.

The structural problem is automation bias: the tendency to defer to automated outputs under time pressure. A systematic review of 74 studies across aviation, healthcare, military operations, and nuclear safety documented this pattern consistently: When an automated system provides an output, human oversight degrades. The mechanism is straightforward. A developer requests code, the AI generates something that looks reasonable, and the developer’s cognitive evaluation shifts from “Is this correct?” to “Does this look wrong?” Those are different questions, and the second one misses far more. Developers routinely retain the vast majority of AI-generated suggestions, and enterprise analyses show that AI-assisted teams ship dramatically more security findings per month — findings that a functioning review process should have caught before deployment.

Even when humans do review, the results are poor. A controlled study hired thirty professional developers to review a web application containing seven known vulnerabilities. No developer found all seven. One in five found zero. The average detection rate was 33 percent. But here is the finding that should alarm defense policymakers: simply instructing reviewers to focus specifically on security — rather than letting them review in their default mode — improved detection by a factor of eight. Without that instruction, experienced developers routinely missed well-documented flaws. The default cognitive mode of a code reviewer is functionally blind to security issues. Nobody told them to look.

A Stanford study found the inverse relationship: Developers using AI assistants wrote measurably less secure code while reporting higher confidence in its security. The developers with the least secure code rated their trust in AI at 4.0 out of 5.0. Those with the most secure code rated it at 1.5. The system selects for the worst combination of overconfidence and under-competence.

The Monoculture

The market is moving faster than most institutional analyses have tracked. Claude Code, launched in May 2025, went from 4 percent developer adoption to 63 percent by February 2026, reaching an estimated $2.5 billion in annualized revenue within ten months. OpenAI’s Codex tripled its weekly active users to 1.6 million after the GPT-5.3 release in February 2026. Cursor still produces roughly one billion lines of accepted code per day. GitHub Copilot maintains over 20 million all-time users. OpenCode, an open-source coding agent with 95,000 GitHub stars, adds yet another layer: a community-maintained tool that wraps around any model, introducing its own unaudited link in the chain.

The number of tools keeps growing, but the number of underlying models has not. Nearly all of them run on three or four foundation models with substantially overlapping training data. The U.S. National Institute of Standards and Technology formally identifies this condition as an “algorithmic monoculture,” and the parallel to agriculture is structural, not metaphorical. The Irish Potato Famine killed a million people because the entire crop was genetically identical and a single blight could destroy it all. A successful poisoning technique deployed against one of these foundation models would propagate identical vulnerabilities into every tool, every organization, and every defense system built using any of them.

A September 2025 analysis of a Fortune 50 enterprise found that teams using AI coding assistants shipped ten times more security findings alongside four times the development velocity. The organization was generating 10,000 new security vulnerabilities per month. The velocity and risk were real, both being part of the same phenomenon.

The Ban That Rewards Its Own Violation

Some national defense organizations have responded to these risks by restricting or banning AI coding tools. This creates a second enforcement problem that is, if anything, harder than the supply chain problem.

Developers use the tools anyway. The productivity gain is too large to leave on the table. A developer using Claude Code or Copilot ships features faster, closes tickets sooner, and ships features faster, resolves issues sooner, and delivers more completed work per cycle than one who does not. The organization’s own performance metrics — the ones that determine promotions, bonuses, and contract renewals — select for exactly the behavior the policy prohibits. The developer who follows the ban works slower. The developer who ignores it gets praised for their output. Nobody asks why.

This is not hypothetical. Surveys consistently show that the majority of developers report using AI tools regardless of organizational policy, and adoption rates are highest among the developers who are most productive. The ban does not prevent use. It prevents visibility. It pushes the tools underground, where there is no logging, no review process, no institutional awareness of which code was touched, and no possibility of the mitigations described below. A ban that cannot be enforced does not reduce risk. It further eliminates the organization’s ability to manage it.

What Helps

There are no clean solutions to a problem this deeply embedded. But there are interventions that materially change the risk profile, and most are available now.

The most consequential shift is institutional honesty: accepting that AI-generated code is already in the defense supply chain at every level. This is not new. Large corporations have tens of thousands of employees using copy-pasted code or modules from fake developers, facing the same supply chain risks even before AI became a thing. Redirecting the energy currently spent on unenforceable prohibition toward building verification infrastructure would seem the logical route. That means using multiple AI models against each other: If one model generates code and the same model reviews it, the review cannot catch systematic blind spots because the reviewer shares them. It means demanding tool-level provenance records from suppliers, even knowing those records are imperfect, because an imperfect audit trail that narrows the search when a model is compromised is better than auditing everything or auditing nothing. It means instructing code reviewers explicitly to focus on security, which the evidence shows improves detection by a factor of eight and costs nothing to implement. And it means investing in runtime monitoring — watching how code actually behaves in production rather than relying entirely on pre-deployment review that the evidence says does not work.

None of these are solutions. Rather, they are mitigations. The honest standard is “better than nothing,” and the gap between nothing and these measures is the difference between structural indefensibility and a reasonable chance of detecting compromise.

The Margin

In March 2024, a Microsoft engineer named Andres Freund noticed that Secure Shell connections were taking half a second longer than expected. He was not investigating a security issue. He was benchmarking a database and the latency annoyed him. What he found was a backdoor in XZ Utils — a compression library embedded in virtually every Linux system on earth — planted by an attacker who had spent over two years building trust through legitimate contributions. The vulnerability received the maximum possible severity score. The entire defense was one person’s curiosity about a 500-millisecond delay.

That attack took one person thirty months to execute. It compromised one library. It was caught by one engineer who happened to be looking at the right metric on the right day.

What makes the AI supply chain threat categorically different is scale and speed. A compromised foundation model does not inject a backdoor into one library. It injects vulnerabilities into every codebase touched by every developer using every tool built on that model — simultaneously, continuously, and at a volume measured in billions of lines per month. The XZ Utils attacker needed two years of social engineering to compromise one project. A training data poisoning attack needs 160 files to compromise a model that generates code for millions. The attack surface is not a single point of failure. It is the entire surface.

And the speed compounds the problem. These tools do not operate on human timescales. Claude Code, Cursor, and Copilot generate code faster than any review process can absorb. A vulnerability introduced through a model update on Monday morning is in production code by Monday afternoon, merged by a developer who saw nothing wrong, reviewed by a tool that shares the same blind spots, and deployed into systems that no nation can audit at the rate the code is being written.

National defense organizations can build the verification infrastructure to manage this reality. Or they can write policy that pretends the decision has not yet been made and discover the consequences when the next compromise is not one library over thirty months, but every codebase, everywhere, all at once.

The monoculture is planted. The crop looks healthy. History has a consistent opinion about what happens next.

 

Markus Sandelin is the AI Lead for the Command and Control Centre and Domain Architect (Medical) at the NATO Communications and Information Agency. He builds federated AI memory systems with conflict detection and is preparing to publish on delta evaluation architecture for tactical networks at the 2026 International Conference on Military Communication and Information Systems. He is particularly interested in what happens when machines need to determine whether other machines are telling the truth. The views expressed are his own.

Image: Senior Airman Solomon Cook via DVIDS.

Become an Insider

Subscribe to Cogs of War for sharp analysis and grounded insights from technologists, builders, and policymakers.