The FSJ interviewed a military technology journalist and expert on foreign policy to learn about the risks and rewards of AI in our industry.
Foreign Service Journal: What risks related to AI are we not seeing sufficiently? What are the consequences if we fail to account for those risks?
Kelsey D. Atherton: Automation makes what it automates opaque and then invisible. When the product works as intended, from automating a summary of an email to successfully navigating a vehicle down a city street, we don’t notice if there are mistakes in the process. But when accidents happen or errors emerge, opacity in their creation and a lack of awareness from the human employing the automated process about the error carry outsized risk. One of the more novel threats is unexplained emergent behavior, where an automated process not only fails but fails differently than the way a human might be expected to fail.
Safeguards, safe evaluation of responses under adversarial conditions via red-teaming, and tools for forensic investigation are all needed to manage, reduce, and mitigate the odds of expected and unanticipated error. Without those doctrines and practices of accountability, we can expect AI-led automation to fail in new and novel ways, and be caught by surprise, without adequate preparation or guidance on how to proceed.
FSJ: What is the Pentagon doing on AI that is novel and innovative? Where have they failed? What lessons could the State Department learn from our colleagues in uniform?
KDA: Before I left the military technology beat full time in November 2023, the most compelling use of AI was profoundly boring—automated predictive maintenance and assessment, like using robots and data collection to predict where ships are rusting and need extra help.
One of the flashier ideas explored is synthetic training data, using AI to generate and iterate novel battlefield scenarios, ones that are important for a machine to recognize while likely to be lacking, or certainly lacking at high fidelity.
I think the biggest lesson from the Pentagon is to look at where off-the-shelf AI is already capable (e.g., data extraction, summary, coding) while also within the bounds of what a human can review, to ensure that the process is right. This is the boring work of logistics and personnel management, looking to ensure that systems are sustained and repaired before they break, which is equally important for everything from aircraft carriers to embassies.
I would also include a major word of caution, especially on experimenting with AI just to say you’ve done it. It remains to be seen if and when the Pentagon’s full embrace of AI tools leads to error that reflects known errors in the civilian version of the tools, like reflecting racial bias in word association, or underemphasizing information based on where in a document it’s presented, but I would say that’s a “when,” not an “if.”
We can expect AI-led automation to fail in new and novel ways, and be caught by surprise, without adequate preparation or guidance on how to proceed.
FSJ: How might AI change the job of defense attachés, political officers, and arms-control diplomats overseas?
KDA: AI might lead defense attachés, political officers, or arms-control diplomats to believe they have more and more accurate information in hand than they actually do. The authoritative tone of AI reinforces this. While enterprise versions of AI products promise fewer errors, the genuine limits of large language models (LLMs) mean that lots of the text-analysis AI is doing is simple word association rather than processing and summarizing text the way a diplomat might.
With generative AI, especially as the fidelity of images improves, it is becoming easier to produce convincing fakes of everything from satellite footage to videos and photos. People working with sensitive information must learn to identify “tells” in AI imagery or other data and should become familiar with ways to verify images as genuine versus generated or modified.
There is also a danger in relying on LLMs as a translation and summarization tool. With automated communication tools in abundance, there’s a real danger of meaning lost in translation services, and an added danger of the tool potentially being corrupted to intentionally produce miscommunication.
FSJ: One of the most important things we do as American diplomats is speak on behalf of the U.S. government overseas. Our authenticity is what gives us our credibility. However, it’s increasingly easy for anyone (diplomats and government officials included) to use AI to churn out content nonstop. Will such AI-generated content overwhelm audiences and lead to a loss of authentic voices? What implications could that have for governments and authorities who need to cut through the noise to reach audiences on topics involving safety and security?
KDA: AI slop, and the fondness of AI-slop imagery as the in-house style of the Trump administration specifically, risks drowning out real information and authentic human experience. While AI slop will likely always be some part of the information ecosystem now, the best way to communicate is still in person. For events, it’s helpful to create and store your own recordings as a check against AI distortions. And generally, when it comes to meeting people where they’re at, “touch grass diplomacy”—getting out in the real world and meeting your interlocutors face-to-face—can be a breath of fresh air.
FSJ: A plethora of AI services and platforms is out there, and it can be hard to keep track of which AI platform, account, or service is needed for a specific task. How are the best private companies mitigating this AI traffic jam?
KDA: I think the best way to manage competing services is to incorporate them with in-house IT and have program managers track and check in with staff if the AI tools are delivering the capability promised, or if they’re just another box that needs checking and interferes with existing process. A good starting point would be to see how processes have been/are done before implementing an AI tool, and then check in three to six months after the adoption of a tool to see what has changed, if anything.
A good starting point would be to see how processes have been/are done before implementing an AI tool, and then check in three to six months after the adoption of a tool to see what has changed.
FSJ: What’s the deal with Claude? Can you explain the significance of the disagreement between the Pentagon and Claude’s creator, Anthropic?
KDA: As best I understand it, Claude is the name of the reclusive AI firm Anthropic’s AI tool, a sort of high-end cousin to ChatGPT or their gutter relative Grok. All three are built on large language models and neural networks, where iterative training and inference based on word association plays out in a functional “black box” until the program spits out results that effectively match patterns, often to the point of impressing people as though they are interacting with a sentient being. (In this instance, I’d argue the human users are failing the Turing Test of artificial intelligence, more than the AI tools are passing it.)
As for Anthropic, Claude, and the Pentagon: Claude is aimed at enterprise users, businesses, and bureaucracies, including the Pentagon. As noted in a February 27 response to Secretary Pete Hegseth designating Anthropic a “supply chain risk,” the impasse was reached after a request by Anthropic that their model not be used for “the mass domestic surveillance of Americans and fully autonomous weapons.” Notably, other lawful uses were allowed and permitted by the company. As The Wall Street Journal reported on February 28, Central Command “uses the tool for intelligence assessments, target identification and simulating battle scenarios even as tension between the company and Pentagon ratcheted up.”
This debate hits at the heart of autonomy and culpability in action, and one of the long-feared and warned-about dangers of AI tools in military use specifically is that AI functions as a permission structure, possibly untraceable and likely unaccountable, for generated targets and effectively signing death warrants. Target selection is an inevitable part of war, and the laws of war account for humans operating under orders, issuing bad orders, and the flaws, hazards, and limits of bad intelligence.
But the danger of AI in targeting isn’t just a hypothetical: As +972 Magazine reported in April 2024, Israel used an AI target generation tool called Lavender to increase the tempo of authorized targets and attacks in Gaza, an automated process that inferred machine wisdom over the difficult and verifiable process of other intelligence gathering. The danger is real and realized: “The result, the sources testify, was that the role of human personnel in incriminating Palestinians as military operatives was pushed aside, and AI did most of the work instead. … Lavender—which was developed to create human targets in the current war—has marked some 37,000 Palestinians as suspected ‘Hamas militants,’ most of them junior, for assassination (the Israeli Defense Forces spokesperson denied the existence of such a kill list in a statement to +972 and Local Call).”
This is likely a long digression, but Anthropic was perfectly fine working within a broad set of lawful bounds for the military, and drew a line, I think, at least as much out of reputational risk and culpability on undeniable war crimes as anything else. This is a defense contractor looking to outlast the present administration and avoid being jailed for aiding and abetting its crimes, not necessarily a stalwart defender of human interests in the face of a new conflict. Given that the secretary has gone on to say the war will not be conducted with “stupid rules of engagement,” anyone hedging their bets on the potential for future consequences from these actions is likely to want to distance themselves from Operation Epic Fury.
When sharing or linking to FSJ articles online, which we welcome and encourage, please be sure to cite the magazine (The Foreign Service Journal) and the month and year of publication. Please check the permissions page for further details.