Beyond benchmarks: how we might know if AI is really about to change everything
Written by Andreas Mogensen and Nick Stockton, based on Rob Wiblin’s interview with Ajeya Cotra on The 80,000 Hours Podcast.
What will society look like in 2050? Ajeya Cotra, a technical staff member at METR and former senior advisor at Coefficient Giving, has a stark answer: “I think there’s a pretty good chance that by 2050 the world will look as different from today as today does from the hunter-gatherer era.”
Her expectation is that, by the early 2030s, we’ll see AI systems that can outperform the best human experts at any task that can be done remotely from a computer. Physical-world automation could close the loop within a year or two after that, at which point progress is limited only by physical constraints.
Even if you don’t buy Cotra’s timeline, her projections are worth taking seriously. She recently finished third among forecasters trying to predict how AI capabilities would advance over the course of 2025. If she’s right and we are about to enter a regime of explosive growth, it would be ideal to have access to a body of leading indicators that can alert us ahead of time, so that we aren’t caught off guard.
In a recent interview with Rob Wiblin on The 80,000 Hours Podcast, Cotra laid out the indicators she thinks matter most — observed productivity gains, automated decision-making inside companies, and rigorous testing of AI’s real-world impacts. She also noted the challenges that are likely to arise in gathering and communicating that data in a way that can steer public debate. This post summarises the key ideas from their discussion.
Why benchmarks fail
Progress in AI capabilities is often measured using benchmarks like SWE-bench Pro, which evaluates a model’s expertise in software engineering. However, benchmarks have a number of important limitations. As Cotra notes, “benchmarks are sort of clean and contained, and the real world is messy and open-ended.” As a result, benchmarks are liable to oversell model capabilities relative to their ability to automate real-life work tasks. Benchmarks also tend to get maxed out and replaced, meaning they’re poorly suited to play the role of a stable yardstick for measuring progress over time.
Cotra also notes that it’s up to AI companies when they announce benchmark scores, which they typically do when releasing a new model to the public. That means they could be sitting on models that have made significant capability gains and are on the verge of kick-starting a rapid positive feedback loop wherein better and better AI models iteratively create even better models. Cotra points to the scenario depicted in AI 2027, where the leading company ends up so far ahead of its competitors that it can simply release products marginally better than the state of the art while keeping its real internal frontier secret. In that world, a dangerous takeoff could already be underway long before the public sees evidence of it.
To address this problem, Cotra suggests that companies should ideally “release their highest internal benchmark score at some calendar time cadence.” For example, they might report updated benchmark scores for key capabilities like cybersecurity and software engineering every three months. This would allow the public to more systematically track progress in AI capabilities, rather than relying on companies to announce new results when it suits them.
Measuring what actually matters
More importantly, we need to look beyond benchmarks. Cotra says, “we need some kind of real-world measure before we can start sounding the alarm. And the ultimate real-world measure is actually just observed productivity.” She especially highlights the importance of understanding how improved AI capabilities impact productivity within leading AI companies. After all, that’s where we’ll need to look for evidence that an automated R&D feedback loop might be taking off.
Whereas CEOs at AI companies may brag about what percentage of their code is written by AI, Cotra thinks it’s more important to look at “what fraction of pull requests to your internal codebase were mostly written by AI and mostly reviewed by AI — so humans are not involved for the most part in both sides of this equation.” A pull request is basically a request to merge some newly written bit of code into an existing software project, which needs to be reviewed and approved by others working on the project. More generally, Cotra thinks it’s important to look at “how much higher-level decision-making authority is being given to the AIs in practice inside the companies.”
Another way to directly measure AI’s impact on productivity is through randomised control trials (RCTs). So far RCTs have had mixed results. METR’s initial productivity uplift study from 2025 found that AI actually slowed down experienced software developers. A more recent METR RCT points in the opposite direction, though methodological issues limit confidence in the results. Rather than treating these mixed findings as discouraging, Cotra argues they’re exactly why we need more RCTs. “I really want to get a lot more evidence about that of all kinds, like big uplift RCTs, and it would be great if companies were into internally conducting RCTs on their own rollouts of internal products.”
Another resource Cotra points to as particularly promising is the Longitudinal Expert AI Panel (LEAP), run by the Forecasting Research Institute. LEAP asks 100–200 AI experts, economists, and superforecasters to make granular predictions about where AI will be in six months, a year, and five years. It covers not just benchmark scores, but things like whether companies will report slowing hiring because of AI, or whether AI will be able to plan a real-world event. Cotra thinks having people make predictions, explain the worldviews behind them, and then check who was right over time might be the most flexible forecasting tool currently available.
An unsolved problem: balancing transparency and confidentiality
One challenge for Cotra’s proposals is that companies may not want to share data about pull requests, productivity uplift, and other important metrics, such as internal incidents involving model misalignment. She and Rob discussed different ways to address the concerns AI companies might have about sharing sensitive data.
One possibility Cotra raises is that the companies could report their individual data to a third-party aggregator, which would then publish an anonymised industry-wide figure. This could allow us to know what’s happening at the frontier, without being able to tie individual results to particular companies. However, Cotra doesn’t think this is likely to provide true anonymity for AI companies, “because there are few enough of them that people would be able to guess.”
Another option would be to empower a government agency to collect this kind of information while keeping it confidential, much as a central bank might be empowered to compel financial institutions to provide information on a confidential basis to inform its assessments of systemic risk. A similar idea is discussed in our recent interview with Dean Ball.
Cotra sees the appeal of this idea, but is sceptical. She thinks public disclosure is likely more useful than confidential reporting to a government body, because the indicators worth tracking are still evolving and interpreting them will require an open scientific conversation. A small, technically understaffed agency, she argues, would struggle to read the evidence quickly enough, or sound an alarm credibly enough, for the wider public and policymakers to respond in time. By way of comparison, Cotra asks us to imagine how hard it would have been for governments to enact the necessary measures to contain the spread of COVID-19 if they were the only ones who had access to the key epidemiological data.
Summing up
The broader lesson here is less about any specific metric and more about a stance. We should treat benchmarks as one input among many. We should watch observable productivity inside the companies and across the AI stack. We need to fund the rigorous empirical work that benchmarks can’t substitute for. And finally, we should push for the disclosure regime that would make all of it legible in time to matter.
The full interview with Cotra is below; you can find the transcript and links to learn more on the 80,000 Hours episode page.
If you’d like to learn more about how you can use your career to reduce risks from transformative AI, look no further than our in-depth guide.

