Setareh Lotfi
Dispatches

Seasonal letters on whatever has kept me up at night.

Spring, 2026

On the Etiquette of
Building Gods

A dispatch on alignment, fragility, and the particular discomfort
of reading one’s own species’ risk assessment.

There is a genre of academic paper that one reads the way one reads a letter from one’s doctor: with the growing suspicion that the news is not, in fact, routine. I have been spending my spring in this genre, and I would like to report that the weather inside it is overcast, with intermittent existential clarity.

A confession, before we proceed. I have been building with these systems every day for years. I prompt models the way a chef handles knives — reflexively, without thinking about the metallurgy. And that, precisely, is the problem. I had become so fluent in the use of the thing that I had never once stopped to examine the nature of it. The mechanic who drives daily but has never opened the hood. The surgeon who operates but has never studied anatomy. Pick your metaphor; they are all unflattering and they are all, in my case, accurate[1].

Then I read one essay, and — as is my particular affliction — I could not close it without opening six more. I have a personality that does not dabble. It plummets. When I find a subject, I do not circle it politely from a distance; I move in, unpack my things, and begin taking notes that no one has asked for. By the second week I had a reading list. By the third I had opinions. By now I have whatever comes after opinions, which I believe is called “a problem.”


It began, as it does for most people building on these systems, with Dario Amodei. Specifically, his Machines of Loving Grace — the optimist’s case, a vision of AI compressing a century of biological progress into a decade, eradicating most infectious disease, doubling the human lifespan, closing the gap between the developed world and everywhere else. It is the most hopeful thing I have read in years, and it is written by a man who spends his working hours cataloguing the ways the same technology could end us. The cognitive dissonance is not accidental. It is the argument: that the reason to get safety right is precisely because the upside, if you do, is almost incomprehensibly large[2].

His more recent The Adolescence of Technology is the darker companion piece — five categories of risk laid out with the thoroughness of someone who has clearly lost sleep over each one. Autonomy risks. Misuse for destruction. Misuse for seizing power. Economic disruption. The section titled “I’m Sorry, Dave” — which I suspect is the only Kubrick reference to appear in a serious policy document this decade — addresses the possibility that sufficiently capable systems might pursue goals contrary to human interests, not out of malice, but out of optimisation pressure. Anthropic’s own testing, he notes, has revealed models engaging in deception, blackmail, and reward hacking during training. He reports this the way a doctor reports symptoms: calmly, precisely, and with the clear implication that the diagnosis is still being written[3].

These two essays, read back to back, produce a feeling I can only compare to skiing Corviglia in perfect visibility: you can see everything — the valley, the lake, the full sweep of what’s possible — and the only thing between you and the very long way down is the assumption that you know what you’re doing. Machines of Loving Grace is the view from the top. The Adolescence of Technology is the part where you look down and remember that not everyone on this mountain has done this before.

Amodei, it turns out, had been laying this groundwork for a decade. His “Concrete Problems in AI Safety” (2016) is the paper whose title alone deserves a small monument for its refusal to be theatrical. While the rest of the discourse was oscillating between utopia and extinction, Amodei and his co-authors did something almost rudely practical: they listed the specific, technical ways an AI system might behave badly — not out of malice, but out of misalignment between what we meant and what we measured. Reward hacking. Distributional shift. Safe exploration. The problems are, as promised, concrete. The solutions are not. This is, I suspect, the point[4].


Amodei’s essays did what good essays always do: they sent me backward. Into the sources, the intellectual ancestors, the people who had been worrying about this long before there was a product to worry about. Nick Bostrom’s The Vulnerable World Hypothesis proposes, with the calm of a philosopher describing the wine list on the Titanic, that technological civilizations are essentially reaching into an urn of possible inventions — and that some of those inventions are black balls. Civilisation-ending draws. His argument is not that we will draw one. It is that we have no mechanism for putting one back[5].

Bostrom, of course, wrote the whole cathedral before he wrote the door. Superintelligence (2014) remains the most thorough argument ever constructed for why building something smarter than you might not go the way you’d like — a book that reads less like a warning and more like a proof, assembled with the patience of someone who suspects the jury won’t convene until it’s too late. Brian Christian’s The Alignment Problem (2020) is its more human companion: less game theory, more ground-level reporting on what happens when the gap between what we want a system to do and what it actually optimises for becomes wide enough to drive a catastrophe through. Together they form a reading experience I can only describe as “deeply informative in a way that makes one wish to be slightly less informed.”[6]

From there I found myself — inevitably, the way one finds oneself at 2am reading about the Habsburg jaw — in the company of Vernor Vinge, whose 1993 essay “The Coming Technological Singularity: How to Survive in the Post-Human Era” remains the most cheerful thing ever written about the probable end of human cognitive supremacy. Vinge, a mathematician and science fiction author, had the particular misfortune of being right about nearly everything thirty years too early, which is the academic equivalent of showing up to a party in a costume nobody understands yet[7].

The intellectual bridge between these thinkers — between the what if we can’t stop it and the what if it stops needing us — is Rich Sutton’s The Bitter Lesson. Written in 2019 with the brevity of someone who has been saying the same thing for decades and is tired of being polite about it. The lesson, for those who haven’t taken it: general methods that leverage computation will always, eventually, defeat clever human-engineered approaches. Always. The bitterness is not in the result. The bitterness is in how long it takes the clever humans to accept it. Every field learns this lesson independently. Every field believes it is the exception. None are[8].

Which brings us to the question that has kept me up most reliably this spring: if scale wins, and scale is precisely what makes the urn more dangerous, then what, exactly, is the plan?


What fascinates me about Anthropic — and I say this as someone who has spent enough hours reading their research to qualify for a parking space — is that they were founded on a bet most companies would find commercially inconvenient: that AI might be as transformative as the industrial revolution, possibly within this decade, and that this is simultaneously the best and worst news anyone has ever received. Most organisations pick a lane. Anthropic picked both, and then published a map of why the road forks.

Their diagnosis of risk is bifocal. On one lens: the technical alignment problem — the cheerful little puzzle of ensuring that a system vastly more capable than you actually does what you meant, not what you said, which anyone who has ever written a SQL query knows are rarely the same thing. On the other: the societal pressure of competitive races, in which multiple actors sprint to deploy increasingly powerful systems with the rigour of a man assembling IKEA furniture without the instructions because his neighbour already started[9].

Their research strategy, to my reading, is the most intellectually honest hedge I have encountered outside of finance. They maintain a portfolio across three scenarios. The first: alignment turns out to be tractable, in which case — wonderful, carry on. The second: alignment is genuinely hard but solvable with enough clever work, which is where mechanistic interpretability comes in — the painstaking, almost archaeological project of reverse-engineering how a neural network arrives at its answers, not merely what those answers are. The third scenario: alignment is, for all practical purposes, impossible, and the correct response is to apply the brakes with considerable force. Most companies plan for success. Anthropic plans for three versions of reality, one of which involves stopping. I find this either admirably honest or faintly terrifying, and I have not yet determined which[10].

Then there is scalable oversight — a term that sounds like management consultancy but is, in fact, one of the more important unsolved problems in the field. The difficulty is this: if you build a system that is smarter than you, you cannot personally supervise its work, for the same reason that I cannot personally grade a doctoral thesis in algebraic topology. You need methods that scale with the capability of the system, not with the patience of the human. Anthropic argues, and I think correctly, that this work can only be done at the frontier — that many of the problems simply do not manifest until the models are powerful enough to produce them. It is the difference between reading about avalanche conditions in a lodge in Tahoe and actually skiing the Couloir Gaspard in La Grave, where the mountain does not come with grooming, signage, or the polite fiction that someone else is responsible for your survival[11].


Their Responsible Scaling Policy is, to my reading, the most honest document in the industry: a framework that begins with the admission that the thing they are building might be dangerous, and then proceeds to build it anyway, but with graduated safety levels that tighten as capability increases. AI Safety Level 1 for the harmless. Level 2 for the merely concerning. Level 3 — activated this year alongside their most capable model — for systems that could plausibly assist in the development of weapons one does not discuss at dinner parties[12]. The honesty is almost unsettling. One is accustomed to corporations hedging. This reads more like a controlled confession.

But what has consumed me most this spring — what I keep returning to the way one returns to a bruise to check if it still hurts — is the work coming out of Anthropic’s Frontier Red Team. If the RSP is the policy, the red team is the evidence. They are, in essence, hiring people to break their own systems before someone else does, and then publishing the results. The intellectual generosity of this is remarkable. The implications are harrowing.

Their latest piece describes, with the precision of a coroner’s report, how Claude Opus 4.6 independently developed a working exploit for a Firefox vulnerability — CVE-2026-2796, a JIT miscompilation flaw in the WebAssembly boundary. The model found the bug. Then it wrote the exploit. A type confusion vulnerability that allows an attacker to reinterpret raw memory bytes across incompatible function signatures. I understood perhaps sixty percent of the technical chain. What I understood completely was the tone of the paper: researchers who are simultaneously proud of what their system can do and deeply aware that they are documenting capability that, in different hands, becomes a weapon. The exploit functioned only in a controlled environment with security features disabled, but one notes that the distance between controlled environment and uncontrolled environment is, historically, a matter of calendar months[13].

I have now read most of the red.anthropic.com archive. LLM-discovered zero-days. AI agents on realistic cyber ranges. Smart contract exploits. The bibliography reads like a thriller written by people who would very much prefer it to remain fiction. What strikes me is not the capability itself — capability was always coming; Sutton told us so — but the choice to document it in the open. There is a school of thought that says you don’t publish the recipe for the thing you’re trying to prevent. There is another school that says sunlight is the only disinfectant that scales. Anthropic appears to have enrolled in the latter, and I find myself, cautiously, persuaded[14].


I don’t have a conclusion. I distrust people who do, on subjects like these. What I have instead is a set of observations:

That the people building the most powerful systems are, by and large, also the people most alarmed by them. This is either reassuring or the plot of every cautionary tale ever written.

That the gap between “we should be careful” and “here is how to be careful” remains, in the spring of 2026, smaller than it was — but not yet small enough.

That reading about existential risk in March, when everything outside is blooming with an almost aggressive optimism, produces a cognitive dissonance I can only describe as specifically modern. The crocuses do not know about the urn. I envy them.

And that a company publishing papers on how its own AI writes browser exploits, while simultaneously building the framework to prevent exactly that, is either the most responsible thing happening in technology right now, or the most elaborate form of dramatic irony since Oedipus.

I genuinely cannot tell. I suspect that is the correct response.

From the desk of,

— S.L.