America’s AI chokepoint is shifting to model evaluation

Written by Silvia Pavelli

For most of the past three years, the politics of artificial intelligence have revolved around two obvious bottlenecks: the chips that make frontier training possible and the cloud infrastructure that turns those chips into industrial-scale computing power. The new agreements announced by NIST’s Center for AI Standards and Innovation suggest that Washington is trying to occupy a third and potentially more subtle chokepoint: the narrow zone between a model’s completion and its public release.

NIST said on May 5 that CAISI signed new agreements with Google DeepMind, Microsoft, and xAI that will allow the government to conduct pre-deployment evaluations, post-deployment assessments, and targeted research on frontier systems. The agency also said it has already completed more than 40 evaluations, including on unreleased state-of-the-art models, and that the agreements support testing in classified environments. That may sound administrative. It is not. It implies that the U.S. government is trying to move from being a commentator on frontier AI risk to becoming an institution embedded in the release process itself.

This matters because evaluation is where technical uncertainty becomes political judgment. Training a model is one problem. Deploying it into the world is another. Between those two steps sits a difficult set of questions: how capable is the model at cyber offense, biological misuse, safeguards evasion, or strategic deception? How reliable are the guardrails? Which failure modes are known, and which only appear under adversarial pressure? If government evaluators can examine systems before the public sees them, then the United States gains leverage not by banning research outright but by shaping the conditions under which frontier capability becomes deployable power.

Layer of frontier AI governanceWhat dominated the last cycleWhat the CAISI agreements suggest now
HardwareExport controls, access to advanced chips, compute concentrationHardware still matters, but it is no longer the only strategic gate
Cloud and distributionHyperscaler partnerships and model hosting powerInfrastructure remains decisive, yet release governance adds a new control point
Safety oversightVoluntary principles and abstract commitmentsEvaluation is becoming operational, repeatable, and partly institutionalized
State capacityBroad policy debates about AI riskGovernment is moving into the technical workflow of model assessment itself

The NIST language is revealing. CAISI says the agreements were renegotiated to reflect its directives from the secretary of commerce and America’s AI Action Plan, and that it will serve as industry’s primary point of contact within the U.S. government for testing, collaborative research, and best-practice development related to commercial AI systems. In plain English, Washington is trying to consolidate an interface with the major model labs. That alone is strategically important. Bureaucratic fragmentation weakens oversight; a designated evaluation node creates the possibility of continuity, institutional memory, and shared testing doctrine.

Microsoft’s companion post reinforces that interpretation. The company says its new arrangements with CAISI in the United States and the UK AI Security Institute are designed to advance the science of AI testing and evaluation through collaborative work on adversarial assessments, shared frameworks, datasets, and workflows for safety, security, and robustness risks. Microsoft also frames national-security testing as something that cannot be done adequately by firms acting alone. That is a notable admission. It suggests the frontier labs increasingly understand that their legitimacy will depend not only on their internal red teams but also on external evaluators with state authority and security expertise.

The geopolitical implication is easy to miss. Much of the global AI race is usually described as a competition over who can build the best model first. But if the United States can institutionalize a privileged evaluation layer for leading domestic or allied labs, then it gains something almost as valuable as raw model leadership: an information advantage about capability trajectories and a normative advantage in defining what counts as acceptable release behavior. In that scenario, the real state asset is not only compute. It is pre-deployment visibility.

That, in turn, changes the relationship between frontier labs and the government. The old story cast regulation as an external force that might arrive late and crudely, after commercial incentives had already outrun public safeguards. The CAISI model points to something more integrated. Labs still build the systems, but the state tries to establish a recurring right to inspect, measure, and stress-test them before they cross into wide deployment. This is softer than licensing every model and narrower than full command-and-control regulation. But strategically it may be more durable, because it targets the stage where release decisions are actually made.

There are limits to this approach. The agreements are voluntary. Their effectiveness depends on the quality of the testing methodologies, the willingness of companies to provide sufficiently unsandboxed access, and the government’s ability to keep pace with fast-moving model architectures. A chokepoint is only real if the actors on both sides continue to treat it as consequential. Moreover, institutional access to leading American labs does not automatically solve the international problem. Chinese frontier labs, open-weight ecosystems, and smaller model developers may not fit into the same governance channel.

Even so, the strategic direction is unmistakable. CAISI says evaluators from across government may participate and that the agency’s TRAINS task force convenes interagency expertise around national-security concerns. That means AI oversight is no longer being framed as a purely consumer-protection or ethics exercise. It is being built as part of the national-security apparatus. The center of gravity is moving from public messaging about responsible AI toward a technical regime for measuring what frontier systems can do before the market absorbs them.

The larger lesson for investors, policymakers, and foreign governments is that AI power will not be determined solely by model quality or semiconductor supply. It will also be shaped by who can build credible institutions at the point where capability becomes permission. In the emerging U.S. approach, evaluation is starting to look less like an academic afterthought and more like strategic infrastructure.

If that holds, then America’s most important AI chokepoint may no longer be only the foundry or the cloud region. It may be the evaluation layer where the state, the lab, and the release decision meet — and where the future terms of frontier AI competition are quietly being written.

News
Silvia Pavelli

Silvia Pavelli

Silvia Pavelli is an Italian journalist and AI correspondent based in Rome. She covers how artificial intelligence is reshaping business, policy, and everyday life across Europe. When she's not chasing a story, she's probably arguing about espresso.