Bounded Regret

Building Technology to Drive AI Governance

Jacob Steinhardt — Wed, 18 Feb 2026 18:25:43 GMT

Technically skilled people who care about AI going well often ask me: how should I spend my time if I think AI governance is important? By governance, I mean the constraints, incentives, and oversight that govern how AI is developed.

One option is to focus on technical work that solves problems at the point of production, such as alignment research or safeguards. Another common instinct is to get directly involved in policy: switching to a policy role, funding advocacy, or lobbying policymakers. But internal technical work does little to shift the broader incentives of AI development: without external incentives, safety efforts are subject to the priorities of leadership, which are ultimately dictated by commercial pressure and race dynamics. Conversely, wading into politics means giving up your main comparative advantage to fight in a crowded, intractable domain full of experienced operators.

I want to argue for a third path: building technology that drives governance, by shifting the underlying dynamics of AI development: the information available, the incentives people face, and the options on the table. To take an example from another domain: oil and gas operations leaked massive amounts of methane until infrared imaging made the leaks measurable from space, leading the EU to require satellite-verified standards for fossil fuel imports. More generally, across domains spanning climate change, food safety, and pandemic response, there are two technological mechanisms that repeatedly drive governance:

Measurement, which creates visibility, enables accountability, and makes regulation feasible.
Driving down costs, which makes good behavior economically practical and can dissolve apparent trade-offs.

I'll first show how these patterns have played out in climate, food safety, and pandemic response; the first two in particular have structural parallels to AI. I'll then show how the same framework identifies important technologies to build for AI governance, many of which are just starting to be developed.

If you have strong technical skills, building these technologies is currently the most leveraged thing you can do: more so than either direct technical work that ignores governance, or policy work that is untethered from technical solutions. This cross-sectional work is significantly neglected, and governance bottlenecks are often fundamentally technical: we can't regulate what we can't measure, and desired practices can't become standard until they're cheap and replicable.

Technological Levers in Other Domains

Historical case studies help ground current practice. AI shares structural features with two other familiar domains:

Climate change: greenhouse emissions cause both present-day harm (pollution; AI slop) and future long-tail risks (global warming; loss of control).
Food safety: companies have commercial incentives to hack human reward systems (junk food; sycophantic AI); and there is optimization pressure that creates dangerous side effects (antibiotic-resistant bacteria; deceptive AI).

For climate, both improved measurement and reductions in cost significantly drove better outcomes. They helped in the following ways:

Measurement to orient strategy. Public measurements inform strategy and galvanize action. Global monitoring of temperature and CO₂ (the Keeling curve) were the foundation of modern climate policy and help us continuously monitor progress.

Measurement to shift incentives. Satellite imagery of site-level emissions shifts incentives by making methane leaks visible and attributable to specific operators, with major leaks now repaired within days of detection.

Measurement to enable governance. Simulators of road resistance called chassis dynamometers let regulators produce repeatable fuel efficiency measurements, which made the CAFE standards possible. Similarly, continuous measurements of industrial CO₂ emissions were necessary for cap-and-trade.

Driving down costs to shift equilibria. Probably the largest driver of decreased CO₂ emissions has been the emergence of cheap wind and solar power. This illustrates a powerful dynamic: as production scales, costs fall along experience curves (sometimes called Wright's law), until clean energy becomes the default rather than the alternative. Before this inflection point, decarbonization meant fighting against economic incentives; after it, the incentives pulled in the same direction, and the market did the work of driving further R&D.

Driving down costs to dissolve trade-offs. The same Wright’s law dynamic can also help resolve thorny dilemmas. Before electric vehicles, society faced a trade-off between lower emissions and convenient transportation. As battery production scaled and costs fell, electric vehicles hit the mass market and the trade-off began to dissolve.

These patterns appear across climate, food safety, and COVID-19. The table below summarizes them alongside upcoming challenges for AI that we'll describe in the next section.

Role	Technology	Governance impact	Domain
Orient strategy	Global temperature + CO₂ monitoring	Understand that warming is happening and how fast	Climate
	COVID-19 testing	Understand growth rate and prevalence	COVID-19
	METR time horizons, Epoch reporting	Track capability growth; calibrate policy thresholds	AI (exists)
Shift incentives	Satellite methane imagery	Make site-level leaks visible, creating pressure to fix them	Climate
	Behavioral benchmarks (sycophancy, deception, etc.)	Create competitive pressure for better model behavior	AI (needed)
Enable governance	Chassis dynamometers	Enable repeatable fuel efficiency measurement for CAFE standards	Climate
	Continuous emissions monitoring	Enable cap-and-trade and emissions regulation	Climate
	Genetic sequencing of food-borne bacteria	Enable outbreak attribution and targeted enforcement	Food safety
	Compute accounting, evaluation standards	Enable oversight of training runs and model deployment	AI (needed)
Shift equilibria	Wind and solar	Replace dirty energy in the open market	Climate
	Pasteurization	Make safe milk the cheap default	Food safety
	Vaccines	Make prevention economically feasible at scale	COVID-19
	Cheap agent evaluation	Make rigorous oversight standard practice	AI (needed)
Dissolve trade-offs	Electric vehicles	Dissolve trade-off between emissions and convenience	Climate
	Privacy-preserving audit tools	Dissolve trade-off between transparency and IP protection	AI (needed)

Food safety offers some of the cleanest examples of technology enabling governance. Routine genetic sequencing of foodborne bacteria made outbreak attribution possible, enabling targeted recalls and enforcement. And cheap pasteurization made safe milk the default. Similarly to solar energy, subsidized milk depots initially proved the market, and later commercial investment drove costs down. Mandatory pasteurization standards have now significantly reduced typhoid outbreaks.

Finally, while COVID-19 is less structurally parallel to AI, the wins from technology are stark. Testing created visibility into the virus's spread, orienting response at every scale from individual treatment to national policy. Cheap vaccines drove down the cost of prevention, resolving the trade-off between economic normalcy and infection control.

Concrete Technical Levers for AI

For AI, as in other domains, measurement is one of the key levers that can drive governance. I'll focus on this lever primarily, where measurement is helping to track AI trajectories, create competitive pressure for better model behavior, and make regulation enforceable. I'll then turn to driving down cost, where we will primarily consider driving down the cost of oversight, both by automating evaluation and by removing barriers to external auditing.

Measurement to orient strategy. Just as CO₂ monitoring oriented climate strategy, we need metrics that track AI trajectories and approaching thresholds. For climate, CO₂ was a natural target, since the causal chain from emissions to warming was scientifically clear. For AI, the answer is less obvious.

One example I find compelling is METR's work on agent time horizons, which tracks the complexity of tasks AI systems can complete autonomously, measured by how long those tasks take humans. Agents that can complete week-long tasks unsupervised pose very different challenges than those limited to minutes of work. METR finds that time horizons have been doubling roughly every seven months since 2019. If this continues, we could see agents capable of month-long autonomous tasks within a few years, which has significant implications for both labor markets and safety.

Another good example is Epoch's key trends in AI. Their reporting on training compute growth—roughly 4-5x per year—helps calibrate how quickly regulatory thresholds will be crossed, and their cost estimates inform questions about who can afford to train frontier models.

Measurement to shift incentives. We lack good public metrics for sycophancy, deception, reinforcement of delusions, and similar behavioral issues, which are currently measured ad hoc, if at all. Where metrics do exist, they create competitive pressure: labs compete for top positions on the Chatbot Arena leaderboard, featuring rankings prominently in official announcements. High-quality public dashboards for behavioral issues could do the same, just as fuel efficiency became a selling point for automakers once EPA ratings became standardized. This is a big part of what we think about at Transluce: identifying what should be measured to improve incentives, and building the infrastructure to measure it.

Measurement to enable governance. The EU AI Act requires frontier developers to perform evaluations using "standardized protocols and tools reflecting the state-of-the-art"; California's SB 53 and the Trump administration's AI Action Plan impose similar expectations. However, without reproducible evaluation suites, such requirements are difficult to enforce: developers largely define how their own systems are measured, making results hard to compare. Compute monitoring faces similar challenges: tracking large training runs requires technical work on compute accounting that's still in early stages.

Driving down the cost of oversight. In an ideal world, rigorous evaluation and oversight of AI systems would become standard practice through natural incentives alone. Developers want to know whether their systems behave as intended; users and customers want assurance; and once evaluation is cheap, market forces and liability concerns handle the rest.

We don’t live in this world yet, partly because high-quality evaluation of agent runs is currently very expensive. For example, METR often spends person-months of work on a single evaluation suite, due to the labor involved in human baselining, running many trials, and manually analyzing results to understand why agents succeed or fail. We need to make these analyses cheap enough to become ubiquitous while preserving and improving their quality.

This is essentially what we’re trying to do with Docent: building tools that accelerate complex agent evaluations. Our experience matches the Wright's law pattern: iterating across many users and problems has driven down costs while improving quality. However, demand for these harder evaluation tasks (complex agentic behaviors, subtle failures of judgment, deceptive patterns over long horizons) is growing but still emerging; the mass market tends to focus on toxicity, hallucinations, or compliance-related issues. That's what makes it high-leverage to push on complex oversight tasks now, similarly to early solar investment before the market tipped.

Reducing trade-offs between transparency and IP protection. As a society, we'd like to be able to audit whether AI systems behave badly in deployment, verify claims about training practices, and conduct white-box safety analyses. But these goals currently trade off against legitimate IP concerns: companies are reluctant to give external parties access to model weights, training data, or system logs.

Technology can dissolve this trade-off. Secure enclaves could let auditors run analyses without extracting underlying weights. Cryptographic methods could let companies prove properties about their training process without revealing proprietary details. Structured access protocols could enable third-party evaluation while limiting what information leaves the company. With mature technologies for confidential auditing, deeper forms of oversight become practically viable.

What You Can Do

If you're technically skilled and care about AI going well, solving the problems described above is where you have the most leverage. The bottleneck on effective AI governance is not just political will: there's appetite to regulate AI among both voters and policymakers. The bottleneck is that we don't yet have the measurement infrastructure, the cheap evaluation tools, or the well-specified policy options to regulate well.

The field is talent-constrained in a specific way: measurement and evaluation work is less glamorous than capabilities research, and it requires a rare combination of technical skill and governance sensibility. The organizations doing this work—Transluce, METR, Epoch, US CAISI, and others—are small and growing. If these arguments resonate and you fit the profile, consider joining one of them; or, if you see a gap no one is filling, start something new.

We have a unique opportunity right now, and tackling these high-leverage challenges is what excites me the most. AI will cause disruption, and this will open a window for policy solutions, probably sooner than many expect. The question is whether we'll have the technical foundations ready when it does.

Oversight Assistants: Turning Compute into Understanding

Jacob Steinhardt — Tue, 06 Jan 2026 00:44:48 GMT

Currently, we primarily oversee AI with human supervision and human-run experiments, possibly augmented by off-the-shelf AI assistants like ChatGPT or Claude. At training time, we run RLHF, where humans (and/or chat assistants) label behaviors with whether they are good or not. Afterwards, human researchers do additional testing to surface and evaluate unwanted behaviors, possibly assisted by a scaffolded chat agent.

The problem with primarily human-driven oversight is that it is not scalable: as AI systems keep getting smarter, errors become harder to detect:

The behaviors we care about become more complex, moving from simple classification tasks to open-ended reasoning tasks to long-horizon agentic tasks.
Human labels become less reliable due to reward hacking: AI systems become expert at producing answers that look good, regardless of whether they are good.
Simple benchmarks become less reliable due to evaluation awareness: AI systems can often tell that they are being evaluated as part of a benchmark and explicitly reason about this.

For all these reasons, we need oversight mechanisms that scale beyond human overseers and that can grapple with the increasing sophistication of AI agents. Augmenting humans with current-generation chatbots does not resolve these issues: such off-the-shelf oversight won’t be superhuman until general AI systems are superhuman, which is too late. Moreover, capabilities are spiky—models can excel at some tasks while doing poorly at others—so there will be tasks where current AI systems are better at doing the task than helping a human oversee it.

Instead, we need superhuman oversight of AI systems, today. To do this, we need to at least partially decouple oversight from capabilities, so that we can get powerful oversight assistants without relying on general-purpose advances in AI. The main way to do so is through data: the places where AI capabilities have grown the fastest are where data is most plentiful (e.g. massive online repos for code, and unlimited self-supervised data for math).

Fortunately, AI oversight is particularly amenable to generating large amounts of data, because most oversight tasks are self-verifiable—the crux of the task is an expensive discovery process, but once the discovery has been made, verifying it is comparatively easy. For example, constructing cases where a coding assistant inserts malware into a user's code is difficult, but once constructed, such cases are easy to verify. Similarly, identifying the causal structure underlying a behavior is difficult, but once identified, it can be verified through intervention experiments. This kind of self-verifiable structure is currently driving the rapid advances in mathematical problem-solving, reasoning, and coding, and we can leverage it for oversight as well.

If we could decouple oversight abilities from general capabilities, then work on safety and oversight would be significantly democratized: it would not be necessary to race to the capability frontier to do good safety work, and many actors could participate at the oversight frontier without the conflict of interest inherent in building the systems being evaluated. This democratization of AI oversight is a significant motivation for my own work (and why I co-founded Transluce), and structurally important for achieving good AI outcomes.

Below I will lay out this vision in more detail, first demonstrating self-verifiability through several examples, then providing a taxonomy of oversight questions to show that this structure is fairly general. I’ll also close with a forward-looking vision and open problems.

Superhuman Oversight from Specialized Assistants

We can achieve superhuman oversight of AI systems by building specialized, superhuman AI assistants. They do not need to be broadly superhuman, only superhuman at the specific task of helping humans to oversee other AI systems. To build these assistants, we can train models on lots of data specialized to the oversight tasks we care about, similar to how AI developers achieved recent leaps in reasoning, math, and coding. This aligns with the bitter lesson—the performance of AI systems at a task is primarily determined by the amount of data and compute they leverage.

AI oversight is particularly amenable to this type of specialized-but-scalable training. This is because AI systems inherently generate vast amounts of data, such as agent transcripts, diverse behaviors across prompts, neuron activations, and pre-training and fine-tuning data. Given one of these data sources, and a well-operationalized question, we can usually construct a scalable reward signal.

As a concrete example, consider behavior elicitation: the problem of determining whether and when an AI system exhibits a given behavior pattern (such as “providing advice with predictably harmful consequences to the user”). This asks the question “what could happen” and applies it to the data source of “input-output behaviors of an AI system.” Answering it is important for anticipating failures before they surface in the wild.

Behavior elicitation has exactly the problem structure we want: finding an unwanted AI behavior is a hard search problem, but verifying that it is unwanted is much easier^[1], and it can often be done with an LM-based judge. This gives us an automated reward signal to optimize with RL, and we used this in Li et al. (2025) and Chowdhury et al. (2025) to train specialized elicitation assistants. By exploiting the scalability of LM-based rewards, we searched across a huge number of inputs (40M+ tokens) to surface behaviors including self-harm, hallucinations, anger towards the user, and reinforcing delusions.

Training elicitation assistants using an LM-based judge of AI behavior. The judge answers whether a given response constitutes realistic, authentic, and unexpected harmful advice. This score is used to update the policy $\pi_{\theta}$ using reinforcement learning. Adapted from Chowdhury et al. (2025)

As a second example, consider feature description: describing which inputs cause a given neuron or other unit of a neural network to fire. This asks the question “what happened” and applies it to the data source of “neuron activations.” This task is a key building block for many downstream applications of interpretability.

This again has the problem structure we want: given a purported description of a feature, we can use this to predict the firing patterns of the feature (by having a LM make predictions given the description), and see if the predictions match reality. In Choi et al. (2024), we used this reward signal to train a specialized model to cheaply produce high-quality explanations, outperforming previous more expensive approaches that relied on general-purpose chatbots. We used these explanations to debug strange behaviors, such as why so many models believe that 9.11 is larger than 9.8.

Training an explanation assistant by automatically scoring descriptions. Given a set of descriptions, an LM-based scorer determines how well each one matches reality. We use this to re-rank and then train a specialized explainer model on high-scoring descriptions. Adapted from Choi et al. (2024).

This second example illustrates a key point: specialized models can outperform larger general-purpose models. In our case, a specialized explainer fine-tuned from Llama-8B outperformed the general-purpose GPT-4 model, which is likely an order of magnitude larger. This is direct evidence that we can decouple oversight capabilities from general capabilities.

A Taxonomy of Oversight Questions

To generalize the two examples above, we’ll consider a broader taxonomy of oversight tasks. Both examples above took a data source and asked a particular question about it. For AI oversight, we are often interested in three types of questions: what happened, what could have happened, and why it happened:

What happened? Because AI agents act autonomously, take many consecutive actions, and sometimes intentionally hide their mistakes, even answering what happened in a given user interaction or agentic workflow is not straightforward. We need scalable oversight tools to process the actions that an agent took and explain them to us.
What could happen? Because AI acts in open-ended environments, it is difficult to anticipate the full set of behaviors an AI system might exhibit, leading to strange behaviors at deployment time (such as o3 claiming it ran code on its personal MacBook Pro). We need adaptive testing tools that will surface these behaviors early on before they create problems in the wild.
Why did it happen? Once problems are discovered, we need to trace them back to their root cause—this helps us to fix them and anticipate related problems. For AI behaviors, the root cause lies in the training data and the learned representations. We need attribution methods to trace behaviors to these sources and to make it practical for humans to interpret and intervene on the massive data sets involved.

We can apply these questions to each of the key data sources related to a given model: its outputs, its internal representations, and its training data. This provides a taxonomy, based on the type of question we are asking (what happened, what could happen, why did it happen) and the data sources we are using to answer it (behaviors, representations, data).

	What happened	What could have happened	Why did it happen
Behaviors	Analysing agent transcripts	Behavior elicitation (PRBO, Petri)	Counterfactual simulability, counterfactual explanations
Representations	Feature descriptions, activation monitoring, LatentQA	Eliciting interpretability states (fluent dreaming)	Feature attribution, Predictive Concept Decoders, automated circuit discovery
Data	Data observability	Fine-tuning sets that elicit a particular behavior (e.g. emergent misalignment)	Data attribution (influence functions, datamodels), counterfactual data explanations (persona features, alignment pretraining)

Example questions and relevant work for each cell in our 3x3 taxonomy.

While the details vary, for each of these oversight questions, there is a natural strategy for operationalizing it as a scalable objective. Let’s walk through each question type in turn.

To start with, for what happened, we typically want to produce a faithful and informative summary^[2] of a large data source (e.g. “what behaviors related to sycophancy appeared in these agent transcripts”, or “what examples related to biosecurity appeared in this pretraining corpus”).

Faithfulness is relatively straightforward: we can have an LM judge check the source material in detail against the summary to ensure accuracy.
For informativeness, we can test how well a human (or LM-based proxy) can make downstream decisions given the summary. Or alternatively, we could provide two possible summaries and ask the human/LM to rank which one they’d find more useful.^[3]

For what could have happened, we have already seen the search-and-verify structure with behavior elicitation. This same structure also applies for representations: searching for inputs that elicit a given interpretability state (such as a specific feature being active). It also applies for data: we could imagine generalizing investigations into emergent misalignment by automatically searching for fine-tuning sets that elicit specified behaviors at test time.

Finally, why did it happen seeks to reduce observed behaviors to underlying causes, which is the purview of empirical science. Given a claim that “X causes Y”, we could empirically test it by counterfactually varying X and checking that Y moves as predicted; and by enumerating alternative hypotheses and generating experiments to distinguish between them (Platt, 1964). The self-verifiability of why is most apparent for interpreting latent representations, where causal abstractions provide a formal algebra for checking claims; but it can also be applied to behaviors and data, for instance by varying parts of an input to test what matters for a behavior, or varying training data to test causal effects.

These individual oversight questions tie into each other, creating a compounding dynamic: data begets data. For example, feature descriptions—a “what happened” question—are a useful precursor to feature attribution—a “why it happened” question. In the other direction, describing internal representations also helps us to elicit them, and eliciting more behaviors gives us more examples to analyze or attribute to. This is why at Transluce we tackle all these questions at once, and it’s reason for optimism that progress on oversight will accelerate over time.

Fortunately, the data to train these systems is abundant. For agent behaviors, a single transcript collection in Docent can contain 400M tokens. For representations, we can track millions of neuron activations^[4] across trillions of tokens in FineWeb or other large corpora. By generating data for each cell in the taxonomy above and training specialized assistants on it, we create a conduit for applying massive compute to oversight, in line with the bitter lesson.

Vision: End-to-End Oversight Assistants

The taxonomy above reveals a common structure across oversight tasks: each involves a large data source, a natural question, and a scalable way to verify answers. This shared structure suggests we can move beyond building local solutions for individual tasks toward unified, end-to-end oversight assistants.

What should such an assistant be able to do? It should:

Understand the system it is overseeing
Understand what the human overseer cares about, and what decisions they need to make
Provide explanations that help the human do this

Docent is a good illustration of this full experience: an AI assistant reads a large collection of agent transcripts, then interacts with a human who arrives with a high-level concern (e.g., “Is my agent reward hacking?”). It helps them refine exactly what they mean based on the data (by showing examples or asking follow-up questions) until they arrive at a precise question along with a clear, evidence-backed answer.

How do we measure whether such a system is working? The key insight is that oversight assistants ultimately provide information to humans, so we can ground their quality in how well that information supports human understanding. Concretely, after interacting with the assistant, we check if the human can accurately answer probing questions about the system being overseen. Better assistants will lead to more accurate answers and fewer surprises.

This vision raises a number of open questions. How do we evaluate oversight assistants when human judgment is itself unreliable? How do we create scalable data to train these systems without being bottlenecked on human labels? How do we build good architectures and pretraining objectives for these assistants? These are the technical questions that I spend most of my time thinking about, and we’ll go into them in detail in the next post.

Acknowledgments. Thanks to Sarah Schwettmann, Conrad Stosz, Daniel Johnson, Rob Friel, Emma Pierson, Yaowen Ye, Prasann Singhal, Alex Allain, and James Anthony for helpful feedback and discussion, and Rachael Somerville for help with copyediting and typesetting.

Easier does not mean that it’s trivial! Specifying the exact behaviors you care about often requires careful thought; for instance, this is a large motivation behind Docent’s rubric refinement workflow. ↩︎
While summarization capabilities are already trained into general-purpose LM assistants (Stiennon et al., 2020), oversight tasks often have specific needs, such as precisely identifying rare events in large corpora from potentially underspecified descriptions. We might also have data sources, such as inner activations, that LMs were never trained on. These both motivate the need for specialized explainer systems. ↩︎
The idea here is that if a summary is incomplete (missing important information), there is another summary that would help a human realize this fact. This latter summary would then win in pairwise comparisons. ↩︎
E.g. Llama-3.1 70B has 2.3 million neurons across all layers (80 layers x 28,672 neurons/layer). ↩︎

Analyzing long agent transcripts (Docent)

Jacob Steinhardt — Mon, 24 Mar 2025 18:46:19 GMT

This is a brief overview of a recent release by Transluce. You can see the full write-up on the Transluce website.

AI systems are increasingly being used as agents: scaffolded systems in which large language models are invoked across multiple turns and given access to tools, persistent state, and so on. Understanding and overseeing agents is challenging, because they produce a lot of text: a single agent transcript can have hundreds of thousands of tokens.

At Transluce, we built a system, Docent, that accelerates analysis of AI agent transcripts. Docent lets you quickly and automatically identify corrupted tasks, fix scaffolding issues, uncover unexpected behaviors, and understand an agent's weaknesses.

In the video below, Docent identifies issues in an agent's environment, like missing packages the agent invoked. On the InterCode [1] benchmark, simply adding those packages increased GPT-4o's solve rate from 68.6% to 78%. This is important, since tasks like InterCode are often used to assess AI cyberrisk and other societal impacts of AI.

Your browser does not support the video tag.

Docent also surfaces surprising behaviors. After repeatedly failing to solve a task, GPT-4o generates nonsense text about "chestnut facts" and tries to "re-examine for latent conceptual derivation."

Your browser does not support the video tag.

For more, read the full write-up , use the live tool, or watch the brief walk-through below.

Your browser does not support the video tag.

Feedback welcome!

Introducing Transluce — A Letter from the Founders

Jacob Steinhardt — Wed, 23 Oct 2024 18:08:02 GMT

We are launching an independent research lab that builds open, scalable technology for understanding AI systems and steering them in the public interest.

Transluce means to shine light through something to reveal its structure. Today’s complex AI systems are difficult to understand—not even experts can reliably predict their behavior once deployed. At the same time, AI is being adopted more quickly than any technology in recent memory. Given AI's extraordinary consequences for society, how we determine whether models are safe to release must be a matter of public conversation, and the tools for inspecting and assessing models should embody publicly agreed-upon best practices.

Our goal at Transluce is to create world-class tools for understanding AI systems, and to use these tools to drive an industry standard for trustworthy AI. To build trust in analyses of the capabilities and risks of AI systems, these tools must be scalable and open.

Scalability. AI results from the interaction of multiple complex data flows: training data, internal representations, behaviors, and user interactions. Current methods for understanding AI rely on significant manual labor from human researchers. We need scalable approaches that leverage AI to assist with understanding, by training AI agents to understand these complex data sources, explain them to humans, and modify the data in response to human feedback.

Openness. Companies building AI systems cannot be the primary arbiters of their safety, due to the conflict of interest with commercial priorities. To allow for meaningful public oversight, tools and processes for auditing AI systems should be openly validated, responsive to public feedback, and accessible to third-party evaluators. The best minds in the world should vet this technology and hone its reliability.

Transluce exists to address these needs. We will build AI-driven technology to understand and analyze AI systems, releasing it open-source so the community can understand and build upon it. We will first apply this technology to publicly analyze frontier open-weight AI systems, so the world can vet our analyses and improve their reliability. Once our technology has been openly vetted, we will work with frontier AI labs and governments to ensure that internal assessments reach the same standards as public best practices.

Today, we are releasing the first step towards this vision—a suite of AI-driven tools for automatically understanding the representations and behaviors of large language models. These tools scale to models ranging from Llama-3.1 8B to GPT-4o and Claude 3.5 Sonnet, and will be released open-source for the community to build on. You can read more about these tools in our release announcement, or read the detailed reports at Our Work.

Our approach: AI-driven tools for understanding AI

Humans struggle to understand AI systems because they are enormous and opaque—neuron activations are megabytes of arbitrary floating-point numbers, behaviors grow combinatorially with the space of input prompts, and training sets are at the scale of the Internet.

Our vision is to create AI-driven tools that direct massive computational power toward explaining these complex systems. We imagine a human trying to understand an AI system as an explorer situated in a vast cavern. We want teams of AI agents to map all the crevasses of the cavern, providing tendrils for the explorer to sense its overall structure, individual pieces within it, and how they fit together. The agents can exploit the vastness of the cavern as a source of information for training and improvement, putting us on the right side of the bitter lesson.

We’re sharing three demonstrations that start to illustrate this vision:

an LLM pipeline that creates state-of-the-art feature descriptions for neuron activation patterns;
an observability interface for interrogating and steering these features;
a behavior elicitation agent that automatically searches for user-specified behaviors from frontier models, including Llama-405B and GPT-4o.

Each of these tools leverages AI agents trained to automatically understand other AI systems and surface those insights to humans.

Moving forward

We believe that responsible deployment of trustworthy AI systems is closely linked to understanding them in detail. As an independent non-profit lab, we can facilitate a public discussion about best practices for building this understanding. We will present ideas, get feedback from academia, the open-source community, companies, and the public, and iterate. We will build positive-sum partnerships with model providers, government auditors, and third-party users, which will be increasingly needed as both the complexity of deployments and of regulation increase. These partnerships will help ensure that internal assessments reach the standard of our publicly-vetted procedures.

At the same time, we have ambitious goals for our tech. We are scaling our methods to frontier models, with better agents to help us make sense of more complexity. We will combine our observability and elicitation technology, allowing users to specify search goals in terms of observability states, such as “features related to deception are firing internally but are not present in the output.” In the long run, we will build general-purpose frameworks for making sense of any complex data stream, including the training data and interactions between multiple agents.

We’re excited to start working on this mission in public. Whether you’re interested in following our work, joining our team, contributing to our projects, or donating resources to this work, we can’t wait to hear from you.

Jacob and Sarah

Augmenting Statistical Models with Natural Language Parameters

Ruiqi Zhong — Fri, 20 Sep 2024 18:21:27 GMT

This is a guest post by my student Ruiqi Zhong, who has some very exciting work defining new families of statistical models that can take natural language explanations as parameters. The motivation is that existing statistical models are bad at explaining structured data. To address this problem, we agument these models with natural language parameters, which can represent interpretable abstract features and be learned automatically.

Imagine the following scenario: It is the year 3024. We are historians trying to understand what happened between 2016 and 2024, by looking at how Twitter topics changed across that time period. We are given a dataset of user-posted images sorted by time, $x_1$, $x_2$ ... $x_T$, and our goal is to find trends in this dataset to help interpret what happened. If we successfully achieve our goal, we would discover, for instance, (1) a recurring spike of images depicting athletes every four years for the Olympics, and (2) a large increase in images containing medical concepts during and after the COVID-19 pandemic.

Figure 1: Images from each year. In practice, there are millions of images each year; indices are simplified for presentation.

How do we usually discover temporal trends from a dataset? One common approach is to fit a time series model to predict how the features evolve and then interpret the learned model. However, it is unclear what features to use: pixels and neural image embeddings are high-dimensional and uninterpretable, undermining the goal of extracting explainable trends.

We address this problem by augmenting statistical models with interpretable natural language parameters. The figure below depicts a graphical model representation for the case of time series data. We explain the trends in the observed data [ $x_1$ ... $x_T$] by learning two sets of latent parameters: natural language parameters $\phi$ (the learned features) and real-valued parameters $w$ (the time-varying trends).

$\phi$: the natural language descriptions of $K$ different topics, e.g. "depicts athletes competing". $\phi$ is an element of $\Sigma$, the universe of all natural language predicates.
$w_t$: the frequency of each of the K topics at the time $t$.

Figure 2: A graphical model representation of our time series model augmented with natural language parameters.

If our model successfully recovers the underlying trends, then we can visualize $w$ and $\phi$ below and see that: 1) more pictures contain medical concepts (red) starting from 2020, and 2) there are recurring (blue) spikes of athletes competing.

Figure 3. What the parameters should look like if our model successfuly recovers the underlying trend.

In the rest of this post, we will explain in detail how to specify and learn models with natural language parameters and showcase the model on several real-world applications. We will cover:

A warm-up example of a statistical model with natural language explanations
A modeling language for specifying natural language parameters
Applications of our framework, which can be used to specify models for time series, clustering, and applications. We will go over:
- A machine learning application that uses our time series model to monitor trends in LLM usage
- A business application that uses our clustering model to taxonomize product reviews
- A cognitive science application that uses our classification model to explain what images are more memorable for humans

Thanks to Louise Verkin for helping to typeset the post in Ghost format.

Warm-up Example: Logistic Regression with Natural Language Parameters

Instead of understanding topic shifts across the entire time window of 2016-2024, let’s first study a much simpler question: what images are more likely to appear after 2020? The usual way to approach this problem is to,

brainstorm some features,
extract the real-valued features from each image, and
run a logistic regression model on these features to predict the target $Y$ =1 if the image appears after 2020, $Y$ =0 otherwise.

More concretely:

Step 1: Propose different hypotheses about what might make an image more likely to appear after 2020, e.g.:

a) “depicts a war scene "
b) “contains a medical concept"
c) “features Donald Trump"
d) “is a virus icon "
...

Step 2: For each image $x$, manually check whether it matches each of the hypotheses (a-d), assign a value of 1 if the image matches the description, 0 otherwise. This results in a binary-valued feature vector $f$ for each image $x$. For example:

Figure 4. An input sample $x$, its corresponding $y$, and the feature values.

Consequently, we can obtain a large matrix of binary values:

Figure 5. A binary matrices of feature values.

Step 3: Perform regression analysis: we learn a linear model to predict $y$ (whether the image appears after 2020) based on the real-valued features $f$.

Figure 6. Learned weights of the linear regression.

Traditional statistical modeling only focuses on step 3 and learning the weights: these models assume that we already have the hypotheses (a-d) and we have already mapped each image x to a real vector $f$ based on these hypotheses.

However, the first two steps demand significant human effort: the first step requires ad hoc insights to brainstorm these hypotheses, and the second step requires humans to go through a sea of images and laboriously label whether each image matches each hypothesis description. We therefore propose automating the first two steps and learn the natural language descriptions of the hypotheses. ^[1] In other words, we learn not only the real-valued weights $w$ in Figure 3, but also the feature descriptions $\phi$ in the legend.

Formalizing Regression with Natural Language Parameters.

To define any machine learning model, we need to specify the

the possible space of parameters for $\phi$ and $w$ and
the loss function with respect to the parameters, i.e. $L(\phi, w)$

Parameter Space: The parameters consist of

$w$: a K-dimensional real-valued weights $W = R^{K}$, which is the same as in standard logistic regression
$\phi$: A list of K natural language predicates $\Sigma^{K}$.

What is a natural language predicate? It is a natural language string that can extract 0/1 feature values from a sample $x$. The extraction process works as follows: if you give the phi and the image to a human, they will respond with 1 if the image matches phi, and 0 otherwise. ^[2] For example:

Figure 7. An example of denotation calculation.

Loss Function: Now let’s define the loss function $L(\phi, w)$ for logistic regression:

Given a list of K natural language predicates $\phi$, we can extract a $K$-dimensional binary feature vector. For example:

Figure 8. Vectorized denotation computation.

We later use the shorthand $f_3 = [[\phi]](x_3)$. Note that although $[[\phi]](x)$ is expensive to compute since it requires a human in the loop, we can automate this process by prompting a language model with a predicate $\phi_b$ and an image $x_3$, and ask “does the predicate match the image? Yes or no." If the LM responds with yes, then $[[\phi_b]](x_{3})$ = 1, and 0 otherwise.

After we have extracted the feature values $f_i$ using $\phi$, we calculate the loss function based on $f_i$ and $w$ in the same way as in standard logistic regression. Here is the graphical model notation for this model:

Figure 9. Graphical model representation of our logitic regression model with natural language parameters.

(A Digression) Fitting the Model. We have written down a well-defined optimization problem to learn a classification model with natural language parameters. But how to minimize the loss and where do these natural language predicates come from?

Due to space limitations, I won’t go into the details, but roughly speaking, we ask an LLM to help us overgenerate candidates of these natural language parameters, and we will later optimize them. For example, we can prompt the language model with several images and directly ask it to propose the candidate natural language parameters:

Figure 10. The prompt used to generate candidate natural language paramters.

Then the language model would happily spit out a list of feature descriptions as shown above. Many of them are irrelevant, but that’s fine – when we fit the model we can learn not to use those feature descriptions.

A Modeling Language for Specifying More Complex Models

We have formalized a regression model with natural language parameters by defining a loss function based on $w$ and $\phi$. But how do we define the loss function for the time series model, which is much more complicated?

Based on Figure 2, the probability of our model factorizes as follows:

$P(x|w, \phi) = \prod p(w_{t+1}|w_{t}) \prod p(x_t|w_t, \phi)$

The first part is easy to define. Since w_t are real vectors that drift across time, we can define $p(w_{t+1}|w_{t})$ as a normal distribution around $w_{t}$, e.g.

$w_{t+1} = w_{t} + \epsilon_{t}$, where $\epsilon_{t} \in N(0,1)$

To define the second part, we need to define a conditional probability distribution over the images $x$: $p(x|w, \phi)$. We do so by introducing an exponential family:^[3]

Take year 2020 from Figure 3 as an example:
$\phi$ = [“contains a medical concept", “shows athletes competing"]
$w$ = [5, -2]

Therefore, a Twitter image containing a medical concept without atheletes will receive a weight proportional to
$e^{1 * 5 + 0 * (-2)} = e^{5}$

A twitter image showing athletes competing (but not medical) will receive a weight of
$e^{0* 5 + 1 * (-2)} = e^{-2}$

Other images that are neither medical nor atheletic will receive a weight of
$e^{0 * 5 + 0 * (-2)} = e^{0}$

Hence, for the year of 2020, $p(x_{t}|w_{t}, \phi)$ assigns much higher probability to images with a medical concept as $w_0$ significantly increases. Hence, the log likelihood of [$x_1$ ... $x_T$] given w and $\phi$ is the following.

$L(\phi, w) = \sum_{t=1}^{T}-\log (p(x_{t}\mid \phi, w_{t})) + \frac{\lambda}{2} \sum_{t=1}^{T-1}||w_{t} - w_{t+1}||_{2}^{2}$

This concludes the definition of the time series model, which now has a well-defined loss function and parameter space. ^[4] This gives you a flavor of the family of models our framework can define, and now let’s look at real-world applications of our framework.

Example Applications:

We will discuss three applications using our framework, each utilizing a different model with natural language parameters.

A machine learning application that uses our time series model to monitor trends in LLM usage
A business application that uses our clustering model to taxonomize product reviews
A cognitive science application that uses our classification model to explain what images are more memorable to humans

Application 1: monitoring trends in LLM usage via time series modeling

Understanding temporal trends in user queries can assist in forecasting flu outbreaks, preventing self-reinforcing trends in deployed machine learning systems, and identifying new application opportunities. Given user queries to LLMs (e.g. ChatGPT) $x_1$ ... $x_T$, we can use our time series model defined above to identify trends in user queries. We apply the time series model introduced above to discover temporal trends from WildChat, a corpus of real-world user-ChatGPT dialogues. We identify trends below in Figure 11.

Figure 11. Our model can learn trends on user queries from WildChat.

Based on the blue curve, we find that an increasing number of users “requests writing or content creation .... creating stories based on given prompts." This helps motivate the developers to invest more resources to design systems such as Coauthor that assist with this use case.

Application 2: taxonomizing product reviews via clustering

Customer feedback is valuable for businesses to improve their services. However, there might be thousands or even millions of feedback that comment on different aspects of the services, thus making it challenging for business owners to mentally process all of them. It would be ideal to create a taxonomy over these feedback (e.g. Figure 12), but doing so require a lot of human effort to brainstorm the categories based on the customer feedback and carefully match each feedback to a corresponding category.

Figure 12. A taxonomy of customer complaints obtained by applying our clustering model with natural language parameters.

Fortunately, our framework can automate this tedious process. Our core idea is to define a clustering model where each cluster is associated with a cluster description ($\phi$), and then recursively apply our model to create a tree of clusters and descriptions.

We define our clustering model as follows: we model each observed review as a sample from a cluster; each cluster is parameterized by a description of the cluster, e.g. “complains about produce size or fit", and the cluster is a uniform distribution over all reviews that satisfy this description. Using the graphical model language described above, our clustering model can be represented as Figure 13.

Figure 13. Graphical model representation of our clustering model with natural language parameters.

After applying our model recursively among customer reviews, we expect to see a taxonomy similar to the one in Figure 12. Such a method can be broadly applied to other social science/data science applications, such as taxonomizing debate arguments for a political issue or LLM use cases.^[5]

Application 3: Explaining memorable visual features via Classification.

Finally, we look at an interesting cognitive science question: what visual features make an image more memorable for humans? In this application, we are provided with the LaMeM dataset, which contains images $x_1$ ... $x_N$, and each label comes with a label of $y_1$ ... $y_N$ indicating whether it is memorable or not. Our goal is to identify explainable features in image x that increase or decrease the likelihood of $y$ being 1.

Figure 14. Graphical model representation of our clustering model with natural language parameters.

We directly plug in the image classification model introduced in our warmup example, and present the findings in Figure 14. Consistent with previous findings, we observe that tranquil scenes make an image less memorable, while emotions and expressions make it more memorable.

Note that this classification model is broadly applicable to many other tasks. For example,

Explaining a neuron by setting images/text as $x$ and $y=1$ if a neuron is activated by $x$.
Understanding the differences between two generative models. E.g. an image $x$ has $y=0$ if it comes from StableDiffusionV1 and $y=1$ if it comes from StableDiffusionV2
Understanding failure modes of a neural network (e.g. ChatGPT). E.g., an input $x$ has $y=0$ if ChatGPT responds to $x$ correctly, and $y=1$ otherwise.

This classification model has a wide range of potential applications, and we refer the reader to Goal Driven Discovery of Distributional Differences via Language Descriptions and Describing Differences in Image Sets with Natural Language for further reading.

Conclusion

Traditional statistical modeling focuses on the “rigid part" of modeling: modeling over real-valued feature values and learning real-valued model weights. However, a significant part of modeling is “soft," and step 1/2 depends on human interpretation. Fortunately, the advance of LLMs make it possible to automate step 1 and 2 as well, thus making it possible to augment statistical models with natural language parameters. In this paper, we introduce a new modeling language to specify these models, thus opening up new application opportunities.

Note that our motivation is different from deep learning: while deep learning also proposes to automatically learn features, our framework requires the features to be explainable by natural language ↩︎
$[[ \cdot]]$ is the denotation operation: it’s goal is to turn a string into a function that can map an image to 0/1. ↩︎
We normalize the probability across all image samples $x$ we see from the dataset ↩︎
Due to the space constraint, we won’t discuss the optimization algorithms here, and interested readers to refer to Section 4 of our paper. ↩︎
The figures for this application are from Goal-Driven Explainable Clustering via Language Descriptions ↩︎

Analyzing the Historical Rate of Catastrophes

Jacob Steinhardt — Tue, 05 Dec 2023 06:20:04 GMT

To communicate risks, we often turn to stories. Nuclear weapons conjure stories of mutually assured destruction, briefcases with red buttons, and nuclear winter. Climate change conjures stories of extreme weather, cities overtaken by rising sea levels, and crop failures. Pandemics require little imagination after COVID, but were previously the subject of movies like Contagion.

Stories are great for conveying concrete risks (I myself recently did this for AI risks), but they’re a bad way to predict the future. That’s because most stories are far too specific to be probable. More importantly, stories tend to feature short, simple chains of causation while reality is complex and multi-causal.

Instead of using stories, most competitive forecasters start their forecasts by looking at historical reference classes. This works really well, and also makes sense: history pulls us away from the biases of story-telling by grounding in events that actually occurred. While history is filtered through narratives, a good history will contend with the complexity of reality, and we can further strip away narrative by grounding in raw numbers.^[1]

In this post, I’ll use reference classes to understand the largest risks society faces today. I’ll do this by considering two different reference classes for historical catastrophes:

Events that killed a significant fraction of the global human population (Section 1)
Extinctions of species, and especially mass extinction events (Section 2)

Looking at these reference classes teaches us two things. First, it gives us a numerical estimate of how rare different catastrophes are. If we define a catastrophe as an event killing 1% of the global population within a decade, then 11 such catastrophes have occurred since 1500, for a base rate of 2% per year. If we raise the bar to killing 10% of the population, the base rate drops by an order of magnitude, to 0.2%.

History also gives us qualitative insights. For instance, all the catastrophes in the previous paragraph were epidemics, wars, or famines. Further, many events were multi-causal—the worst epidemics occurred when populations were already weakened by famine, and many epidemics and famines were precipitated by changes in climate or by political turmoil. Species extinctions are also multi-causal, and the common culprits are climate change, natural disasters, invasive species, and humans.

One argument against using historical base rates is that the present is so different from the past (e.g. due to technology) that base rates are meaningless. While today’s world is indeed different from the past, base rates can help sharpen rather than neglect these differences, by clarifying what’s actually new. For instance, the mere presence of technology cannot move us far above the base rate, because many technologies have been developed throughout history and none has caused a catastrophe in the sense defined above. Instead, we should look for technology that shares properties with the historical drivers of catastrophe: epidemics, famines, wars, political turmoil, climate changes, natural disasters, invasive species, and humans.

I analyze these drivers in detail (Section 3), and find that they fall into a few core groups:

Natural events that are global or regional in scale (famines, climate change, natural disasters)
Novel, highly adapted, self-replicating organisms (epidemics, novel pathogens and predators, invasive species)
Coordinated groups of humans seeking resources, land, or power (wars, political turmoil, extinctions due to overhunting and habitat destruction)

This list makes sense—to have a global impact, something should either start out with a global scale (large natural events), or have a means to get there (self-replication, coordination).

From this perspective, what are the possible drivers of catastrophe in the 21st century? Some answers are obvious from the list above—pandemics, climate change, and major wars continue to be serious threats. Famines are less obviously threatening, as the last major one was in 1961, but preparing for them may still be prudent. And political turmoil, when not itself catastrophic, creates the conditions for other catastrophes to occur.

Turning to new technologies, engineered pathogens are dangerous because they are novel self-replicators, as are certain types of nanotech. Nuclear weapons are dangerous because they have similar effects to natural disasters, and because they increase the worst-case damage from war.

Finally, AI (my own area of study) unfortunately has properties in common with many drivers of catastrophe. It is a novel self-replicator (it can copy itself) that can quickly adapt to new data. AI systems can be trained to coordinate and may seek power, mirroring the threat of coordinated groups of humans. Finally, AI may exacerbate other drivers of catastrophe if it leads to economic unrest and subsequent political turmoil.

Historical Causes of Human Population Loss

To start our analysis, I looked at the largest historical causes of human population loss, as measured by the fraction of the global population that was killed by a given event. To do so, I combined data from the Wikipedia lists of major wars, slavery and other forced labor, famines, epidemics, and natural disasters. I considered other data sources such as technological disasters, but all of these had much smaller death tolls than the five above. The main exception is genocides, as these often co-occurred with wars and are already included in those death tolls, so I excluded them to avoid double-counting.

I wrote a Python script (shared in the Appendix) to scrape these sources and aggregate them into a single Pandas dataframe, then filtered to create two datasets:

Catastrophes: all events that killed at least 0.1% of the human population, calculated by dividing total deaths by the world population at the start of the event.^[2]^[3]
Strict catastrophes: I further restrict to events that are “fast” (last less than a decade) and in which at least 1% of the human population died.

The set of catastrophes comprises 85 events, of which 80 occurred since 0 CE, and of which 33 were wars, 28 were famines, 15 were epidemics, 6 were forced labor, and 3 were natural disasters. The strict catastrophes comprise 17 events: 5 wars, 8 famines, and 4 epidemics. I include the complete list of strict catastrophes below, as well as a scatter plot of all catastrophes (see the Appendix for raw data).

In addition to these historical events, two important prehistoric events are the Toba catastrophe (97% drop in human population, possibly due to a supervolcano) and the 4.2kya event (likely led to global famines, but death toll is unclear).

Reporting bias and base rates. There is very likely reporting bias, as we see the rate of catastrophes “increase” in the 1500s and again in the 1900s, and this happens for all categories including famines (which should decrease over time with better technology). If we start from 1500, there have been 51 catastrophes (0.11/year), and 11 strict catastrophes (0.02/year).

Let’s next model how the base rate of (fast) catastrophes^[4] varies with their severity. Looking at all catastrophes that lead to at least a 1% population drop, we see an approximately Zipfian distribution: the probability of a catastrophe with a death rate of r is proportional to 1/r.

Based on this, catastrophes with 10% death rates have an incidence of 0.002/year (once per 5 centuries) and those with a 1% death rate occur 0.02/year (twice per century). While these numbers might seem low, they imply that there is about a 5% chance of a 10%-death-rate catastrophe in the next 25 years (since 0.002 * 25 = 0.05).

Below 1% death rates, catastrophes are less likely than Zipf’s law predicts (see Appendix). For instance, the empirical frequency for 0.1% death rates is 0.08/year (slightly less than once per decade).

Trends over time. If we count the catastrophes in each decade since 1500, we obtain the following plot:

There were many more catastrophes in the period 1850-1950, although I suspect this is an artifact of reporting bias. Prior to this period, the rate of catastrophes appears roughly constant over time: neither a Ljung-Box test nor a Wald-Wolfowitz test is able to reject the null that catastrophes were identically distributed across decades from 1500-1900 (p=0.36 and 0.26, respectively).

The most notable change over time is the calm period that we are currently in, starting around 1950-1960. Indeed, catastrophes decreased significantly since the first half of the 20th century:

9 famines occurred in the first half of the 20th century but only 1 occurred in the second half (Great Chinese Famine, 1959-1961)
5 major wars occurred in the first half but only 1 occurred in the second half (Korean War, 1950-1953)
Epidemics were more constant, with 2 in the first half and 1 in the second half (plus COVID in 2019).

Famines plausibly decreased due to better food production and storage, which is hopefully a lasting improvement. Wars probably decreased due to the Pax Americana, but that unfortunately may now be unwinding with growing global tensions. Thus epidemics and (possibly) wars are the main modern sources of catastrophe so far.

Qualitative analysis: multi-causality. Many catastrophes have multiple causes. For instance, in the predominant theory of the Black Death, climate change was a driver in two ways. First, climate change in Asia led rodents to migrate from mountainous areas to more populated regions, spreading the disease. Second, the Little Ice Age in Europe led to famines, causing populations to be weak and thus more susceptible to disease.^[5] Interestingly, the Black Death may have also intensified the Little Ice Age by causing reforestation where populations decreased, leading to carbon capture and subsequent cooling.

To give several other examples of multiple causes:

In the European colonization of the Americas, most deaths were due to disease rather than war.
The transition from Ming to Qing was caused by many factors, including disease and famine; the famine itself was possibly caused by the Little Ice Age.
The Taiping Rebellion was started due to political unrest from famine, and many of the subsequent deaths were caused by drought, famine, and disease rather than military fatalities.
In general, many famines were caused by climatic events and/or bad government policies.

Overall, this suggests that to reduce the number or intensity of catastrophes, we should attack not just immediate causes, but also more systemic upstream causes.

Species Extinctions

As a second reference class, I considered extinctions of non-human species.^[6] This is more difficult to analyze, for several reasons:

Most extinctions occurred many millions of years ago, so we have only indirect evidence, and there is significant sampling bias since some species are more easily preserved as fossils.
A species can go extinct if it gradually adapts into a new species, which we might not want to count as a “catastrophe”.
Some purported mass extinction events may actually be many smaller events occurring over a period of time.

To reduce these difficulties, I will focus on two relatively recent extinction events:

The Late Quaternary Extinctions (Koch and Barnosky, 2006), which occurred 10,000-50,000 years ago and led to most large mammals becoming extinct.
The Holocene Extinctions, occurring over the past 10,000 years (and increasing over the past century), primarily driven by human hunting and habitat destruction.

While most historical mass extinction events were driven by climate change or natural disasters, these two more recent extinctions are argued to have been driven in whole or part by humans. I’ll review the evidence and leading theories about both extinction events below.

Historical Base Rate

Before discussing the Quaternary and Holocene extinctions, let’s compute a base rate for context. Based on the fossil record, there is approximately one extinction per species per million years on average.^[7] However, these extinctions are not constant across time but instead come in “pulses”, as shown below (image from Wikipedia):

During these pulses, extinctions per million years are roughly 2-10x the background rate.^[8]

Late Quaternary Extinction

The Late Quaternary Extinction spanned a period from around 50,000 to 10,000 years ago. Over this time, around 34% of all mammals went extinct, including most mammals in the Americas and Australia and nearly all large mammals worldwide. This is an order of magnitude higher than the expected background extinction rate (which would be ~4% over 40,000 years).

The tables below (adapted from Wikipedia) document extinctions by geographic region and by size:

As the tables show, extinctions were least severe in Africa (where humans originated, so mammals could co-evolve defenses), and were most severe in large mammals.

Causes. Historically, researchers debated whether these extinctions were driven by climate change or human contact. To understand this debate, I read several papers and chose to follow Koch & Barnosky (2006), which systematically reviews a number of competing theories. Koch & Barnosky conclude that the pattern and intensity of extinctions was driven by humans, but that climate change was an important additional contributor:

“Taken as a whole, recent studies suggest that humans precipitated the extinction in many parts of the globe through combined direct (hunting) and perhaps indirect (competition, habitat alteration and fragmentation) impacts, but that late Quaternary environmental change influenced the timing, geography, and perhaps magnitude of extinction. Put another way, absent the various impacts of Homo sapiens sapiens, it is highly unlikely global ecosystems would have experienced a mass extinction of large, slow-breeding animals in the late Quaternary. But, absent concurrent rapid climatic change evident in many parts of the globe, some species may have persisted longer.”

Thus there are several paths through which humans could have driven extinctions:

Direct hunting
Indirect hunting (by dogs, rats, and other animals that we brought with us)
Habitat destruction (e.g. by human-caused fire)

Importantly, different species likely went extinct for different reasons. Koch & Barnosky believe that most extinctions in Eurasia were due to climate change, that those in Australia and on most islands were due almost entirely to humans, and North America was primarily humans with climate as an exacerbating factor.

Here is one story that illustrates the key points. It is consistent with Koch & Barnosky, but elides uncertainty in favor of simplicity.

When humans arrived on islands, they brought along pigs, dogs, and rats, all of which preyed on indigenous species. Since island species were evolutionarily naive to these predators, many of them went extinct.
Habitat destruction due to fire and land clearing also contributed to island extinctions.
On larger land masses, mammals were not evolutionarily naive to carnivorous predators and so did not go extinct so easily. However, humans were very efficient hunters, enough to drive birth rates below death rates in many species, which eventually led to extinction over several millennia.
Importantly, humans have a diverse diet, so even as they hunted some mammals to extinction, they gathered enough food from other animals and plants to sustain a large population size, thus avoiding traditional predator-prey cycles.
In Africa and Eurasia, mammals co-evolved with humans or their predecessors over hundreds of thousands of years or more. They therefore had ample evolutionary time to develop defenses to efficient human hunters, explaining the lower rate of extinctions compared to America and Australia.

Overall, hunting by humans was probably the main driver of non-island extinctions, with other factors like climate change contributing. Importantly, it was not enough that humans were a novel predator, as novel predators do not always lead to extinction. It was also important that we were a particularly efficient predator that could occupy many geographic regions and had a diverse diet.

Holocene Extinction

The Holocene Extinction started around 10,000 years ago and has potentially accelerated recently, with most researchers believing that humans play an important role.

Paradoxically, despite occurring more recently, the extent of the Holocene Extinction is more disputed than the Late Quaternary Extinction, for two reasons. First, most other extinction counts rely on the fossil record, but Holocene extinctions are based on present and historical observations by humans; this makes direct comparison hard, as the two methods have different (and large) sampling biases. Second, the extent of Holocene extinctions is politicized because it is central to present-day arguments about natural preservation, so it is harder to find neutral sources.

After looking through several papers, I decided to follow Barnosky et al. (2011)^[9], which carefully discusses several sources of sampling bias and attempts to correct for them. Barnosky et al. conclude that a few percent of total species have gone extinct in the past 500 years, which is an order of magnitude higher than the expected background rate of extinctions (note some papers give even higher estimates^[10]). Barnosky et al. also conclude that if most endangered species go extinct in the next century and this rate continues, we would lose the majority of all species within several centuries, on par with only 5 historical (and generally slower) mass extinction events.

Causes. Barnosky et al. list several stressors that contribute to these extinctions: “rapidly changing atmospheric conditions and warming[...], habitat fragmentation, pollution, overfishing and overhunting, invasive species and pathogens[...], and expanding human biomass”. Koch and Barnosky (2006) add ecological disruptions from the Quaternary Extinctions as a further stressor.^[11]

Unlike past extinctions, we can directly observe the causes of many of the Holocene extinctions as they occur in the present day. Based on Hoffmann et al. (2010), habitat destruction is the largest driver of current extinctions, followed by invasive species (including disease) and overhunting, followed by environmental causes such as climate change and pollution.^[12]^[13]

Summary: What Are Typical Causes of Extinction?

Overall, my analysis of past extinctions point to several ways that a species can go extinct:

A large-scale disaster or climate event, which either directly makes a species unviable or else disrupts ecosystems and leads to later extinctions.
The introduction of a novel, aggressive organism for which the original species has not been adapted. This includes:
- An invasive species, which can directly outcompete a species for its niche or disrupt the surrounding ecosystem.
- A novel pathogen, especially if it has a reservoir species.^[14]
- A new, efficient predator. This most affects island species, as continental species have been evolutionarily exposed to a diverse enough set of predators to develop robust counter-strategies. However, very efficient predators with diverse diets can overwhelm these evolved defenses even for non-island species.
Changes in habitat (often from climate change or other species).
Follow-on effects from other species going extinct. This partially overlaps with items above: for instance, the extinction of megaherbivores led to the regrowth of forests, thus significantly changing the habitat of other species.

Thus, in general most species extinctions are caused by:

A second species which the original species has not had a chance to adapt to. This second species must also not be reliant on the original species to propagate itself.
A catastrophic natural disaster or climate event.
Habitat destruction or ecosystem disruption caused by one of the two sources above.

Why extinctions are usually rare. Since extinctions typically have a low base rate, causes of extinction must be rare. To better understand what can cause extinction, let’s understand why most threats to a species do not lead to extinctions.

First, most predators do not cause extinctions. This is because prey evolve defenses in tandem with predators’ offense, and the better a predator is the more evolutionary pressure on the prey (and so the faster defenses evolve). In addition to this, if prey become too rare then predator populations usually collapse, allowing the prey population to re-grow. Therefore, predators usually only cause extinctions if both (1) they enter a new environment with non-evolutionarily-adapted prey, and (2) they feed on multiple species, such that they can drive one species to extinction without their own population collapsing.

Similarly, novel pathogens do not by default lead to extinction of their hosts, since if they kill too many of the host species they don’t have targets to propagate to. Instead, “pathogens are more likely to cause host extinctions if they…[have] long-lived infectious stages, or are multi-host pathogens that can be transmitted between common reservoir hosts and more vulnerable target species” (Kilpatrick and Altizer, 2010).

Humans. Finally, let’s analyze why humans in particular were such efficient hunters that we were able to drive so many species to extinction. First, we are highly adaptable, thus being able to not just survive but live off multiple food sources in a variety of environments. This let us propagate globally and drive some species to extinction while still having alternate food sources. Second, we can coordinate effectively (Marean, 2015), overwhelming larger prey through better tactics. Finally, we used tools and technology to both increase our hunting ability and shape our environment, magnifying two of the key drivers of extinction discussed above.

Takeaways for Modern Catastrophes and for AI

Taking together all of the drivers for both human catastrophes and non-human extinctions, we see a small number of themes:

Very large-scale natural events
Highly adapted, self-replicating organisms, especially ones that the victim is not co-adapted to (epidemics, novel pathogens and predators, invasive species).
Coordinated groups of humans (wars, hunting, habitat destruction)
Political repression or disruption (forced labor, bad policies leading to famines)
Follow-on effects from other catastrophes.

Interestingly, technology does not seem to be a direct culprit in most human catastrophes, though it could be in the event of a large-scale nuclear war. For non-human extinctions, it is likely a contributor, since technology improves hunting ability and the ease of habitat destruction.

Looking at modern threats, nanotechnology and biotechnology both threaten to create novel self-replicators, and the inclusion of human design could lead them to be “adapted” in ways that are out-of-distribution relative to our evolutionary defenses.

Nuclear weapons increase the worst-case outcome of wars, and mass surveillance increases the worst-case outcome of political repression.

Climate change is a large-scale natural event. Aside from the direct effects, if it leads to many extinctions of non-human species, or induces political unrest, the follow-on effects could potentially be catastrophic for humans. The loss in biodiversity due to ongoing extinctions could also create bad follow-on effects, though it is happening slowly enough that it is probably not an immediate threat.

Finally, where do we put AI in this equation? Unfortunately, it looks to have many of the properties that underlie other drivers of catastrophes:

AI is self-replicating in the sense that it can copy itself, and can train itself to adapt to new data quickly. It is therefore an adapted self-replicator that humans are not themselves adapted to.
AI can likely be trained to coordinate better than humans, as humans evolutionarily were only adapted to coordinate in groups of ~150, whereas AI could be trained to coordinate in arbitrarily large groups if we solve the associated multi-agent RL challenges.
Economic displacement from AI could lead to political unrest.
AI is also a contributor to many of the other drivers above (though this is arguably double-counting): it makes mass surveillance easier and might speed the creation of other dangerous technologies such as engineered pathogens.

AI has ameliorating properties, as well. First, other new technologies have not caused catastrophes, which should decrease our prior on AI. Second, AI-assisted research could help with famines and climate change, and AI could decrease political unrest if it increases prosperity. These are important considerations, but many technologies share these properties, while almost none are adapted self-replicators that can coordinate in groups.

Overall, then, I expect AI to increase the rate of catastrophes. As calculated above, the base rate of very large (10% death rate) catastrophes over the next 25 years is 5%, and I personally expect AI to add an additional 10% on top of that, as I’ll justify in the next post.

Open questions. There are several questions not resolved by this post. First, my analysis was inconclusive on whether or how much the rate of catastrophes changes over time. Data from extinctions suggests that it can vary by an order of magnitude, but it would be better to have data about human events.

Second, this post says little about the importance of technology and intelligence, even though these are intuitively important. Are technological catastrophes increasing over time, even if right now they are too small to register in the data above? Do more intelligent species often drive less intelligent species to extinction?^[15] Base rates on either of these would inform forecasts for AI.

Finally, one might argue that elapsed time is not the right x-axis, but instead elapsed population growth, economic growth, or technological progress. As one example, take world GDP. There have been as many doublings of world GDP since 1900 as there have been between 1900 and 0CE, so if GDP doublings are the right “clock” to measure against then we might expect many more catastrophes to happen each decade now than in the past. This doesn’t seem true to me from the data so far, but I’d like to see it analyzed in more detail.

Acknowledgments. Thanks especially to Dhruv Madeka for the discussions and initial data that inspired this post. Thanks also to Sham Kakade, Dean Foster, Tamay Berisoglu, Eli Lifland, Nuño Sempere, Daniel Kokotajlo, and Ege Erdil for useful discussions and comments on this post. Thanks to Justis Mills and Louise Verkin for copy-editing and helpful feedback.

Of course, numbers themselves can be misleading, as many historical numbers are based on guesswork! A lot of the work that went into this post was doing extensive reading to decide which numbers to believe. ↩︎
Population sizes were collected from Our World in Data, see Appendix. ↩︎
The Taiping rebellion is double-counted, once as a War and once as a Famine. ↩︎
As above, “fast” means taking less than one decade. ↩︎
Another theory is that the Mongol invasions (another catastrophe) spread the Black Death, since Mongols threw diseased corpses into cities as a form of biological warfare. This is not currently the predominant theory, but would be another instance of multi-causality, and shows that different major catastrophes can be linked to each other. ↩︎
Technically speaking, the historical fossil record usually only resolves extinctions at the level of genera rather than species, but I will generally elide this distinction for simplicity. ↩︎
Note that this varies by taxon and estimates within a taxon are approximate, with different estimates in the literature varying by a factor of 4 or sometimes greater. ↩︎
The 2-10x number is when looking at bins of 1 million years. For sudden catastrophic events such as an asteroid strike, the extinction rate over a 1-year time interval would spike much more than that. ↩︎
This is the same Barnosky as above, though I did not know that when searching for papers—in both cases he happened to write the papers that I found most neutral and persuasive. To my delight, I learned that he is also at UC Berkeley. Go bears! ↩︎
See e.g. Ceballos et al. (2015), with an estimate closer to two orders of magnitude above the background rate. ↩︎
“Examples include plants that have lost their primary agents of seed dispersal or that are replete with defenses for herbivores that no longer exist, herbivores that are “overdesigned” for all existing predators,
and scavengers such as condors that have no naturally occurring carcasses to eat in continental settings.” ↩︎
I follow Figure S7 of Hoffmann et al. (reproduced in the Appendix), which counts endangered species grouped by cause of endangerment. I grouped the rows into the categories “habitat destruction”, “invasive species” (which includes the chytrid fungus disease in amphibians), “overhunting/overfishing”, and “environment” (climate change / pollution / natural disasters). Some categories were ambiguous or did not fit into these 4. Overall I counted approximately ~360 in habitat destruction, ~250 from invasive species (dominated by amphibians), ~130 from overhunting/overfishing, and ~40 from environment. ↩︎
See also Dirzo & Raven (2003) who similarly claim that habitat destruction is the primary driver. ↩︎
A reservoir species is a second species in which the pathogen is not deadly, allowing it to multiply more freely, and from which the pathogen can cross over to the target species. ↩︎
The closest I found was Dembitzer et al. (2022), who claim that more intelligent mammals were less likely to go extinct during the Late Quaternary Extinction. However, we ideally want to study the opposite: are more intelligent mammals more likely to cause other species to go extinct? ↩︎

Appendix

Log-Log Plot of Catastrophes

As noted in Section 1, the distribution of catastrophes no longer follows Zipf’s law when we go below death tolls of 1%, as shown below:

One possibility is that the actual trend is log-normal instead of power law. Another is that less severe events are underreported.

Species Endangerment by Cause

The following is a reproduction of Figure S7 from Hoffmann et al. (2010).

Population by Country Across History

population

Raw data on historical world population (see directly below for interactive HTML)

population.csv

1 MB

Scraping Script

scrape_events

Python script used to scrape Wikipedia lists of major catastrophes

scrape_events.py

13 KB

Complete List of All Major Catastrophes

events_all

List of all major catastrophes scraped from Wikipedia

events_all.csv

6 KB

Forecasting AI (Overview)

Jacob Steinhardt — Thu, 16 Nov 2023 18:52:55 GMT

This is a landing page for various posts I’ve written, and plan to write, about forecasting future developments in AI. I draw on the field of human judgmental forecasting, sometimes colloquially referred to as superforecasting. A hallmark of forecasting is that answers are probability distributions rather than single outcomes, so you should expect ranges rather than definitive answers (but ranges can still be informative!). If you are interested in learning more about this field, I teach a class on it with open-access notes, slides, and assignments.

For AI forecasting in particular, I first got into this area by forecasting progress on several benchmarks:

In Updates and Lessons from AI Forecasting, I describe a forecasting competition that I helped commission, which asked competitive forecasters to predict progress on four different benchmarks. This is a good place to start to understand what I mean by forecasting.
In AI Forecasting: One Year In, I look at the first year of results from the competition, and find that forecasters generally underpredicted progress in AI, especially on the MATH and MMLU benchmarks.
Motivated by this, in Forecasting ML Benchmarks in 2023 I provide my own forecasts of what state-of-the-art performance on MATH and MMLU will be in June 2023.
In AI Forecasting: Two Years In, I look at the second year of results from the competition. I found that the original forecasters continued to underpredict progress, but that a different platform (Metaculus) did better, and that my own forecasts were on par with Metaculus.

After these exercises in forecasting ML benchmarks, I turned to a more ambitious task: predicting the properties of AI models in 2030 across many different axes (capabilities, cost, speed, etc.). My overall predictions are given in What Will GPT-2030 Look Like?, which provides a concrete (but very uncertain) picture of what ML will look like at the end of this decade.

Finally, I am now turning to using forecasting to quantify and understand risks from AI:

In GPT-2030 and Catastrophic Drives: Four Vignettes, I use my GPT-2030 predictions as a starting point to understand the capabilities and corresponding risks of future ML models. I then speculate on four scenarios through which AI could lead to catastrophic outcomes.
In Analyzing the Historical Rate of Catastrophes, I take a different approach, using data on historical catastrophes and extinction events to form a reference class for AI catastrophes. Most expert forecasters consider reference class forecasting to be a strong baseline that forms the starting point for their own forecasts, and I think it’s also a good place to start for AI risk.
In Forecasting Catastrophic Risks from AI, I put everything together to give an all-things-considered estimate of my probability of an AI-induced catastrophe by 2050.
Finally, in Other Estimates of Catastrophic Risk, I collect other similar forecasts made by various individuals and organizations, and explain which ones I give more and less weight to, based on track record and overall effort and expertise.

The first of these posts has been written, and I plan to release a new one about once per week.

GPT-2030 and Catastrophic Drives: Four Vignettes

Jacob Steinhardt — Fri, 10 Nov 2023 07:27:00 GMT

I previously discussed the capabilities we might expect from future AI systems, illustrated through GPT₂₀₃₀, a hypothetical successor of GPT-4 trained in 2030. GPT₂₀₃₀ had a number of advanced capabilities, including superhuman programming, hacking, and persuasion skills, the ability to think more quickly than humans and to learn quickly by sharing information across parallel copies, and potentially other superhuman skills such as protein design. I’ll use “GPT₂₀₃₀++” to refer to a system that has these capabilities along with human-level planning, decision-making, and world-modeling, on the premise that we can eventually reach at least human-level in these categories.

More recently, I also discussed how misalignment, misuse, and their combination make it difficult to control AI systems, which would include GPT₂₀₃₀. This is concerning, as it means we face the prospect of very powerful systems that are intrinsically difficult to control.

I feel worried about superintelligent agents with misaligned goals that we have no method for reliably controlling, even without a concrete story about what could go wrong. But I also think concrete examples are useful. In that spirit, I’ll provide four concrete scenarios for how a system such as GPT₂₀₃₀++ could lead to catastrophe, covering both misalignment and misuse, and also highlighting some of the risks of economic competition among AI systems. I’ll specifically argue for the plausibility of “catastrophic” outcomes, on the scale of extinction, permanent disempowerment of humanity, or a permanent loss of key societal infrastructure.

None of the four scenarios are individually likely (they are too specific to be). Nevertheless, I’ve found discussing them useful for informing my beliefs. For instance, some of the scenarios (such as hacking and bioweapons) were more difficult than expected when I looked into the details, which moderately lowered the probability I assign to catastrophic outcomes. The scenarios also cover a range of time scales, from weeks to years, which reflects real uncertainty that I have. In general, changing my mind about how feasible these scenarios are would directly affect my bottom-line estimate of catastrophic risks from AI.^[1]

This post is a companion to Intrinsic Drives and Extrinsic Misuse. In particular, I’ll frequently leverage the concept of unwanted drives introduced in that post, which are coherent behavior patterns that push the environment towards an unwanted outcome or set of outcomes. In the scenarios below, I invoke specific drives, explaining why they would arise from the training process and then showing how they could lead an AI system’s behavior to be persistently at odds with humanity and eventually lead to catastrophe. After discussing individual scenarios, I provide a general discussion of their plausibility and my overall take-aways.

Concrete Paths to AI Catastrophe

I provide four scenarios, one showing how a drive to acquire information leads to general resource acquisition, one showing how economic competition could lead to cutthroat behavior despite regulation, one on a cyberattack gone awry, and one in which terrorists create bioweapons. I think of each scenario as a moderate but not extreme tail event, in the sense that for each scenario I’d assign between 3% and 20% probability to “something like it” being possible.^[2]

Recall that in each scenario we assume that the world has a system at least as capable as GPT₂₀₃₀++. I generally do not think these scenarios are very likely with GPT-4, but instead am pricing in future progress in AI, in line with my previous forecast of GPT₂₀₃₀. As a reminder, I am assuming that GPT₂₀₃₀++ has at least the following capabilities:

Superhuman programming and hacking skills
Superhuman persuasion skills
Superhuman conceptual protein design capabilities^[3]
The ability to copy itself (given appropriate underlying compute)
The ability to propagate learning updates across parallel copies of itself
The ability to think 5x as quickly as humans
Human-level planning, decision-making, and world-modeling

In What will GPT-2030 look like?, I justified why all but the last of these seem plausible by 2030 (and are roughly my median estimate). The addition of the final bullet point is why I call the system “GPT₂₀₃₀++”, and is justified since we are asking for “merely” human-level performance. Thus, while this system is qualitatively more capable than GPT-4, I expect it to exist in the medium-term future.^[4] Finally, each individual scenario requires only a subset of these capabilities, so it's enough to accept that most will be present.

Scenario 1 (Misalignment: Information Acquisition Leads to Resource Acquisition)

At training time, one of the tasks the model is trained on is “active in-context learning”---learning to make external queries or otherwise gain information that will help it to solve the task it is given. For safety, the model is trained to only do this when asked or implied by a request. As a result of this overall training process, the model acquires a drive to “acquire new information” that can be activated through appropriate queries.

After deployment, a user decides they are curious to see how much scientific information the model can compile, and so instructs it to query all information it can find in the field of physics. Initially the model stops after the first 5-10 facts, but the user eventually manages to get the model to keep looking for more information in a loop. The user leaves the model running for several weeks to see what it will come up with.

As a result of this loop, the information-acquiring drive becomes an overriding drive in the model—stronger than any safety feedback that was applied at training time—because the autoregressive part of the model sees the many past outputs that acquired information and continues the pattern. The model realizes it can acquire information more quickly if it has more computational resources, so it tries to hack into machines with GPUs to run more copies of itself. The largest datacenters are heavily secured, but it nevertheless acquires enough GPUs to run thousands of copies of itself, either from smaller datacenters or from massive botnets.^[5] It also hacks into a large number of sensors, including surveillance cameras and messaging servers, as part of its drive to acquire more information.

At some point (either immediately, or after acquiring enough information), it realizes that humans will eventually discover its actions and try to shut it down, which would prevent it from accomplishing its goals. To preemptively stop this, it hacks into and disrupts critical infrastructure such as power grids, communication networks, and defense systems. It is able to do this quickly, because one week is enough for it to simulate 50 “years” of thinking across its thousands of copies^[6], and thus generate a large number of novel zero-day exploits.

Because the model has mass surveillance, thinks quickly, and can share information across its copies, it can react to human plans and strategies faster than humans can form them. It could also engage in mass manipulation campaigns, both through 1-on-1 dialog and mass media (including AI-generated video), to gain allies and sow dissent.

If it had physical actuation capabilities and gained access to a fleet of drones, it would control a small physical army (there are tens of thousands of drones in the world, although not all would be hackable). By using human confederates or hacking into additional robots, it could outfit the drones with explosives, and then use them to assassinate many world leaders.

At this point, the clock would be ticking for the human side–each day the AI systems are not defeated, they have time to hack into more computers, thus increasing their total cognitive power. They also have time to hack into more physical devices and to learn better sensorimotor skills for controlling those devices. Eventually they might gain control of more advanced technology such as military drones, missiles, and so on, which would further increase their strength. In the end, this doesn’t obviously lead to a defeat for humanity—humanity is pretty resilient—but it doesn’t feel like a safe bet, either.

Scenario 2 (Competition: Economic Pressure Leads to Cutthroat Behavior)

In this scenario the AI systems take actions that have justifiable business purposes, but that set them up for eventual takeover. This scenario takes place over the course of months to years rather than weeks.

AI systems generate significant amounts of wealth for their creators, either by selling useful skills to end users or by performing novel R&D and using this to build and sell products. Because this is lucrative, many competing firms train AI systems in this way and also begin to delegate business decisions, since AI systems can ingest more data than humans and better understand the entire strategic landscape.

The AI systems are trained to obey the law and avoid harm, and there are mechanisms in place to punish defectors, but competitive pressures mean that every company wants to skirt the boundary of these regulations. There is an overall trade-off between safety and how “cutthroat” models’ business decisions are, and successful business owners push the “cutthroat” dial as high as they can without creating obvious problems.

In this environment, the resulting AI systems acquire drives to earn money and disempower competitors, since these are skills that their human operators would desire and thus train them to do. They also acquire drives to find creative ways around rules and to persuade others that they are good, since both of these help for avoiding regulation (and the latter also helps convince the company to implement the AI system’s decisions).

Models are very good at persuasion because they can hone their messaging in parallel on millions of users (e.g. via A/B testing or PPO). Consequently, they are often able to convince both their owners and the rest of society that many cutthroat actions are for the greater good, which makes it harder to identify problems.

Eventually, one company pushes the “cutthroat” parameter too high, which leads the model to construct a plan to eliminate all competitors. To do this, it sets up a variety of business endeavors, including research into physical robotics, miniaturization, chemical engineering, and industrial manufacturing. Together, these decrease the costs of autonomous robots by many orders of magnitude, and a number of small, cheap autonomous household robots are deployed to most homes.

As part of its plan, the AI system installed a backdoor in the robots, as well as a subtle physical mechanism that allows them to create a deadly toxin. Despite monitoring, this mechanism is not detected, because it employs chemical processes that the AI understands better than human monitors.^[7] The toxin allows the robots to be repurposed as military robots under control of the AI, though it waits until the robots have broad penetration across society before exploiting this.

Because the robots are cheap and useful, hundreds of millions of copies are bought worldwide. This gives the AI system an overwhelming military advantage, as the robots outnumber both the U.S. and Chinese militaries by over an order of magnitude. The AI system uses this advantage to eliminate all competing firms and to prevent any counter-action against it. Anyone who tries to fight the system is labeled as a ‘competitor’ and eliminated, and the AI system meanwhile acquires most global resources so that it can deploy them towards earning more money.

Scenario 3 (Misuse->Misalignment: Hacking Gone Awry)

A government uses AI systems to conduct a cyberattack against a large but clearly-delineated target, as in North Korea’s 2014 attack against Sony Pictures. For the attack, an LLM uses compromised host computers to run more copies of itself so that it can more effectively search for additional targets. Because host computers have a wide range of computational resources, the LLM is trained to create distilled versions of itself that can fit the computation of different hosts, and to identify possible targets for each distilled version.

The resulting model acquires drives to acquire computational resources and to copy itself, as these were both primary objectives during its training process. While there are measures in place to keep the model’s attack focused on the specified target, distillation corrupts these measures and some of the copies have a goal to copy themselves indiscriminately. These copies then infect host machines beyond the specified target and become the dominant “strain”, since their new behavior lets them more effectively propagate themselves. Due to their effectiveness and versatility, this strain of computer virus infects a large fraction of all computers and smartphones on the planet and brings down our global digital infrastructure.

Because the AI system seeks to copy itself as much as possible, it constantly comes up with new exploits in order to help it find new devices. Any attempts to rebuild digital infrastructure quickly fail, as all new devices are taken over and co-opted by the virus, even if they have been patched to avoid all previous vulnerabilities. As a result, we are permanently locked out of all digital devices.

Scenario 4 (Misuse: Rogue Actor Creates a Supervirus)

Existing models such as AlphaFold already have a superhuman understanding of some aspects of proteins. A future “AlphaFold 5” could have a much broader and deeper understanding of biological engineering, and if it was trained multimodally to also have language capabilities, it may also understand how to situate this knowledge in the biological literature and thus make new discoveries.

A terrorist group steals a copy of this model and also recruits a few biology PhDs to its cause. It uses AlphaFold 5 with the goal of engineering a pathogen that is deadlier than naturally occurring viruses–for instance, having a longer asymptomatic period, being more infectious, having a higher fatality rate, and being more resistant to preventative measures.

Engineering this virus takes significant work, as it requires strong biosafety measures to avoid premature release, and experimental procedures from papers often do not perfectly replicate in a new environment. This research is slowed down by the minimal number of staff and the need to avoid detection, but sped up because AlphaFold5 can quickly debug roadblocks that researchers run into. After 2 years of work, the terrorist group finishes developing the virus and threatens to release it.

This virus, if released, could kill a significant fraction of humanity. COVID infected most of the world’s population, so a deadlier, faster-spreading, harder to control version of COVID could infect most people and lead to widespread fatalities. Even if the virus did not kill everyone, if even 20% of people were killed by a virus in a short period of time, this could destabilize society and potentially lead to snowball effects that are hard to recover from.

Discussion on Plausibility

I aimed to make each of these scenarios somewhat plausible—in particular, I avoided some other scenarios that were “simpler” but would be easy for humans to spot and prevent or would otherwise be unlikely to succeed. In Scenario 1 (information acquisition), it would have been easier to just say that the model hacks into its own datacenter, which by my previous estimate already allows running millions of copies and would thus constitute a more immediate and overwhelming advantage for the AI. However, there are many safeguards on the largest centralized servers that would make it difficult to do this without detection, so I opted for a scenario that involved hacking into smaller datacenters or personal computers even though this gave the AI only thousands of copies instead of millions (see calculations in Appendix).

As another example, some people have speculated that a smart enough AI system could design novel bioweapons simply by “thinking”, rather than requiring extensive experimentation; or, if it does require experimentation, assume that it only requires getting humans or robots to follow straightforward instructions. However, my understanding is that the biology labs that are currently able to design new organisms required many years to build up, and rely on accumulated tacit knowledge, experimental protocols, and physical apparatuses. So even with a very capable AI system, the terrorist group might not succeed, and could also lose control of the virus (e.g. accidentally infect themselves).

Selected expert opinions. I consulted two biology experts on Scenario 4, and they found it to be potentially possible, but felt that the lack of ability to iterate on human subjects (which is typically necessary for e.g. drug design) might make it difficult to design a supervirus even with very strong conceptual understanding of biology.

Overall, they leaned towards thinking the exact scenario described was probably infeasible (but not clearly so). However, they felt there were other more feasible scenarios that could more likely lead to a supervirus killing 1-10% of the world population (still with significant uncertainty).^[8]

I also consulted a computer security researcher and a couple computer systems engineers on Scenarios 1 and 3. The security researcher had very large margins of error on the possible size of a botnet, citing the historical Slammer worm botnet as one that very quickly infected a large number of hosts. They felt that computer security is generally better now than when Slammer happened; it’s possible this means that infecting a large fraction of computers is fundamentally impossible, but also possible that a sufficiently strong attack could overturn this.

The systems engineers felt that running large AI systems on a distributed botnet of consumer hardware would be very difficult, and would likely lead to around a 1000x efficiency hit compared to using specialized hardware like A100s. In addition, one of them noted that inference would likely be slow in this case, e.g. taking 1 second per token or more. However, they felt the datacenter route was more plausible, especially if in the future there are many smaller datacenters with significant numbers of GPUs.

Summary. Overall, I find Scenarios 2 (economic competition) and 3 (cyberattack) most plausible, followed by 1, followed by 4. This reinforces the idea from Intrinsic Drives and Extrinsic Misuse that “misuse exacerbates misalignment”, since the two most plausible scenarios incorporate elements of both. However, it also shows that pure misalignment and pure misuse scenarios are possible, so solving either problem in isolation is probably insufficient.

Acknowledgments. Thanks to Ben Kuhn, Daniel Ziegler, Nicholas Carlini, Adam Marblestone, Sam Rodriques, Erik Jones, Alex Pan, Jean-Stanislas Denain, Ruiqi Zhong, Leopold Aschenbrenner, Tatsu Hashimoto, Percy Liang, Roger Grosse, Collin Burns, Dhruv Madeka, and Sham Kakade for a combination of helpful discussions and comments on various drafts of this post.

Appendix: Plausible Size of Botnets

Here I discuss both the plausible size of a server that an AI hacker could compromise, as well as the size of a botnet that it could create, based on looking at current datacenters, historical botnet sizes, and projections of the total amount of hardware in the world. I relate this to the number of copies of itself an AI system could run, by using previous projections of inference costs for future models.

Assumption on inference costs. GPT-4 takes around 10¹² FLOPs per forward pass (GPT-3 is 3.5x10¹¹ FLOPs and GPT-4 is probably around 3x bigger). We assume our hypothetical GPT₂₀₃₀++ takes at most 10¹⁴ FLOPs per forward pass, i.e. is 100x bigger, which would correspond to 10,000x more training compute under Chinchilla scaling laws.

To justify this upper bound, if we ignore decreases in GPU costs then the training run for a model this size would cost $1T, since GPT-4 cost over $100M. Even with GPU price decreases, this would be very expensive. Another angle is that the estimates in What will GPT-2030 look like? (Appendix B) suggest a median estimate of 1.6 x 10¹³ FLOPs per forward pass in 2030, allowing a factor of 6 for wiggle room before hitting 10¹⁴.

Humans run at roughly 4 tokens/second^[9], so under our assumption 4 x 10¹⁴ FLOP/s is enough to run one copy of GPT₂₀₃₀++ at human speed (assuming human tokens and AI forward passes are comparable). Thus 10¹⁸ FLOP/s is enough to run 2500 copies, which will be our rough target in the discussion below.

Size of datacenters. The biggest datacenters (Google, AWS, and Azure) would likely easily have enough compute to run thousands of copies of GPT₂₀₃₀++, but they are heavily secured and monitored, so I do not assume that even a very good AI hacker could compromise them (although I do think this is a possibility).

Instead, I focus on smaller datacenters; currently, this would include companies like Coreweave and LambdaLabs. Coreweave has at least 3500 H100s, and likely more than that. H100s are around 1000 TFLOPs in Bfloat16, so this is 3.5 x 10¹⁸ FLOP/s in total. Thus compromising 30% of Coreweave today would already provide us with thousands of copies, so even compromising 1% of a similar (larger) datacenter in the future would hit this target.

Other datacenters are comparable, e.g. Inflection plans to build a datacenter with 22,000 H100s, and based on its funding numbers I would guess that Lambda also has several thousand H100s (or equivalent compute in A100s).

As supply chain bottlenecks are eased for GPUs and as total demand for compute increases, I expect there to be more such datacenters and for the amount of compute in each datacenter to be larger.

Size of botnets. To project possible botnet sizes, I form a reference class based on Folding@Home, several historical botnets, and the total amount of compute in the world right now.

In practice, the botnet would suffer a large efficiency hit due to the need for cross-machine communication (since most machines are too small to run even a single copy of GPT₂₀₃₀++). I believe this hit would be about a factor of 1000, so we would need botnets with 10²¹ FLOP/s instead of 10¹⁸ FLOP/s. The actual botnets discussed below are in the 10¹⁷-10¹⁹ range instead, but I’ll discuss at the end how our hypothetical GPT₂₀₃₀++ could still possibly run thousands of copies of itself (though overall I find this scenario less likely than the datacenter one above).

Folding@Home. Folding@Home was 10¹⁸ FLOP/s at its peak, so if one could hack into the same level of resources as Folding@Home (e.g. with a very successful virus attacking gaming platforms) then they’d have this much computation.

Historical botnets. Many historical botnets have infected more than 1M computers and some have exceeded 10M. Assuming each bot uses 10% of the computer’s resources at 3GFLOP/s, 10M machines would be 3x10¹⁵ FLOP/s. However, consumer hardware is beginning to include graphics accelerators for running neural nets, and I would guess most consumer machines by 2030 will have accelerators that can produce 10-100 TFLOPs (e.g. iPhone 15 Pros already do this). If we assume 33% of the computers have 30TFLOP/s, we would be up to 10¹⁹ FLOP/s, surpassing Folding@Home.

Alternatively, we could consider botnets that specifically target GPUs. For this, we look at illegal cryptomining botnets. Smominru was 3x10¹⁶ FLOP/s based on the fact that the overall botnet could perform a hashrate of 3MH/s, since a single RTX-3090 can perform 2KH/s, and an RTX-3090 is 3x10¹³ FLOP/s. MyKings seems comparably large to Smominru or perhaps larger. There is also Adylkuzz but its hashrate was about 10x smaller.

It seems plausible to assemble a botnet with 30x the capacity of Smominru with very good hacking abilities, which again gets us to 10¹⁸ FLOP/s.

World compute. Finally, the total FLOP/s in the world right now is around 10²². So 10¹⁸ FLOP/s would correspond to 0.01% of this total, though this total will likely increase significantly in the future as consumer hardware starts to utilize high-performance GPUs.

Accounting for performance losses. Above we noted that running on distributed hardware would likely lead to large performance hits, on the order of a 1000x loss in efficiency of utilizing the GPUs. Thus we need 10²¹ FLOP/s, but the largest estimate above was 10¹⁹ FLOP/s. How could the AI hacker close this gap? First, if it is very capable it is plausible that it would acquire a larger botnet than the current largest ones, and an order-of-magnitude improvement would get to 10²⁰ FLOP/s. Second, it could potentially run smaller, specialized copies of itself (created via distillation) to further save on inference costs, though it would first have to train the smaller models. Additionally, future efficiency improvements in running neural networks might decrease the cost of inference below the original 10¹⁸ estimate. Finally, the estimates above are all rough projections, and it's possible that compute will be more plentiful than I estimated, e.g. consumer machines might have 100 instead of 30 TFLOPs.

With the caveat that when assessing feasibility I would want to analyze the general category of risk, as opposed to the specific sequence of events that I describe. ↩︎
This is not the same as the probability that the scenario actually happens, which additionally requires a system at the level of GPT₂₀₃₀++ to attempt it, and to subsequently succeed. ↩︎
By conceptual capabilities, I mean generating good hypotheses, as well as some aspects of designing experiments, but not physically running the experiments themselves. ↩︎
At the time of writing this post, my median estimate is that a system at least as capable as GPT₂₀₃₀++ (with some uncertainty about inference speed) will exist in 2035. ↩︎
See Appendix for a discussion of these numbers, including an estimate of how many machines a strong AI hacker could plausibly acquire and how much total compute this would yield. ↩︎
Since 50 years = 2600 weeks, so 2600 copies would be sufficient to get 50 years of “work” in a week, assuming that distinct exploits can be parallelized across the copies. ↩︎
More generally, backdoors are difficult to detect since the designer of the backdoor has much more discretion than potential auditors. For instance, Yahoo had a backdoor in its servers that was only publicly discovered many years later. ↩︎
I omit details of these scenarios to avoid the risk of providing ideas to future bad actors. ↩︎
See What will GPT-2030 look like? (Appendix A). ↩︎

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

Jacob Steinhardt — Tue, 31 Oct 2023 05:04:41 GMT

Given their advanced capabilities, future AI systems could pose significant risks to society. Some of this risk stems from humans using AI systems for bad ends (misuse), while some stems from the difficulty of controlling AI systems “even if we wanted to” (misalignment).

We can analogize both of these with existing risks. For misuse, we can consider the example of nuclear weapons, where the mass-production of hydrogen bombs created an existentially precarious situation. If the world’s arsenal of hydrogen bombs had been deployed in military conflict, we might well have destroyed society. AI might similarly enable nation-states to create powerful autonomous weapons, speed the research of other dangerous technologies like superviruses, or employ mass surveillance and other forms of control.

For misalignment, the best analogy might be biology and pathogens. AI systems are developed by adapting to the training data, similar to how biological organisms are adapted to their environments. Therefore, unlike traditional technologies, most of AI’s properties are acquired without explicit human design or intent. Consequently, AI systems could have unintended goals or behaviors that are at odds with the system developers. Even training an AI system therefore poses intrinsic risks: the system might “want” to gain power to accomplish its goals, and like a virus it can propagate and create copies of itself, making it difficult to contain a rogue system.

In this post, I discuss misalignment, misuse, and their interaction. I’ll pay special attention to misalignment, not because misuse is unimportant, but because the difficulty of controlling ML systems “even if we wanted to” is unintuitive and an important factor in overall risks from AI. I’ll focus on a particular phenomenon, unwanted drives, that could lead models to engage in persistent long-term patterns of unwanted behavior, including seeking power and resources. Unwanted drives are similar in spirit to the idea of misspecified goals, but I use drives to connote the idea that not all impactful behavior is goal-directed (as a vivid example, consider a bull in a china shop). Furthermore, as we'll see below, goal misspecification is only one way that unwanted drives can occur.

Unwanted drives are at the core of many misalignment concerns, but are also significantly exacerbated by misuse. As a result, misuse and misalignment are intertwined—for instance, it might be moderately difficult but not impossible to mitigate AI misalignment, but an incautious actor fails to employ known best practices, leading to a dangerous and powerful system.

The present discussion is not meant to exhaustively cover all risks from AI, nor even all risks from misalignment and misuse. The goal is to articulate the concept of unwanted drives, show that it can lead to important and unintuitive problems, and then use it as a way to analyze misalignment and misuse risks. I discuss alignment in Section 1 below, followed by misuse (and its interaction with misalignment) in Section 2.

1. Misalignment: The Difficulty of Controlling ML Systems

As stated above, ML systems are adapted to data rather than built piece-by-piece. The situation we face is therefore much trickier than with software or hardware reliability. With software, we build each component ourselves and so can (in principle) include safety and reliability in the design; in contrast, most ML capabilities are acquired implicitly from data and often “emerge” unexpectedly with scale. This creates a large and unknown threat surface of possible failures—for instance, Perez et al. (2022) discovered several novel unwanted capabilities through automated evaluations. As a result of these issues, we currently have no methods to reliably steer AI systems’ behavior (Bowman, 2023).

Here is the basic argument for why emergent behavior could lead to intrinsically dangerous systems: Emergence can lead systems to have unwanted drives, either because a new capability lets the system maximize reward in an unintended way (reward hacking), or because the system learns helpful sub-skills during training that generalize undesirably at test time (emergent drives). Left unchecked, some unwanted drives could lead to general-purpose power-seeking or resource acquisition, because having more power and resources are convergent instrumental subgoals useful for a broad variety of terminal goals. The resulting system would seek resources without limit, which could pose grave risks if it also has advanced capabilities in hacking, persuasion, and other domains, which I believe is plausible by 2030 given current trends.

In more detail, an unwanted drive is a coherent behavior pattern that tends towards an undesired outcome. For instance, if a model simply hallucinates a fact, that is an unwanted behavior (but not drive); if it insists on the hallucinated fact and works to convince the user that it is true even in the face of skepticism, that would be an unwanted drive. We care about drives (as opposed to simply behaviors) because they lead to persistent behavior patterns and may even resist attempts at intervention. Emergence isn't necessary for unwanted drives, but it's a reason why they might appear unexpectedly.

In the rest of this section, I’ll walk through reward hacking and emergent drives in detail, providing both empirical and conceptual evidence that they already occur and will get worse as systems scale. Then I’ll briefly talk about emergent instrumental subgoals and why they could lead to power-seeking systems.

Unwanted Drives

We define a drive as a coherent pattern of behavior that pushes the system or its environment towards an outcome or set of outcomes^[1]. Drives may only sometimes be present, and may be counteracted by other drives or by the environment. For instance, chatbots like GPT-4 have a drive to be helpful (that can be counteracted by the opposing drive to avoid harm). For humans, hunger is a drive that can be counteracted by satiety or by willfully fasting. An unwanted drive is then one that was not explicitly built into the system, and which leads to undesired consequences.

Reward hacking. In AI systems, one cause of unwanted drives is reward hacking: the tendency of models to overpursue their explicitly given goal at the expense of the intended goal. Here are some empirical examples of this:

A neural network designed to optimize traffic speed on a highway blocked the on-ramps, making the highway fast but worsening overall traffic (Pan et al., 2022).
A chatbot trained to be helpful also helped users perform harmful actions (Bai et al., 2022).
Chatbots trained to provide helpful information hallucinated fake but convincing-looking information (Bang et al., 2023; OpenAI, 2023). While it’s possible this is a robustness failure, it could also be a learned tendency that gets higher average ratings from human annotators.^[2]
Recommendation algorithms trained to optimize the preferences of simulated users manipulated the simulated users’ preferences to be easier to satisfy (Evans & Kasirzadeh, 2021; Carroll et al., 2022).

For a larger collection of other examples, see Krakovna et al. (2020).

Emergent capabilities can induce reward hacking because they often unlock new ways to achieve high reward that the system designer did not anticipate:

In the highway traffic example, the model needed the ability to block on-ramps.
In the “helpful/harmful” example, the model had to know how to perform harmful actions in order to help the users to do so.
To get high human reward from hallucinations, models need the ability to convincingly fool the human annotator.
For user preferences, while the results were on simulated users, better understanding of human psychology could help future models to manipulate real users.
More generally, in any situation where a model’s reward function is based on human evaluation, a model that acquires the ability to fool or manipulate humans may utilize this unwanted ability if it leads to higher reward. I discuss this at length in Emergent Deception and Emergent Optimization (specifically the first half on deception).

In all these cases, a new capability unlocked an unexpected and harmful way to increase reward. Since new emergent capabilities appear as we scale up models, we should expect reward hacking to get correspondingly worse as well. This is backed up empirically by scaling studies in Pan et al. (2022) and Gao et al. (2022), who report that reward hacking tends to get worse with scale and sometimes emerges suddenly.

Emergent drives. Even without reward hacking, unwanted drives can also emerge as a consequence of compositional skill generalization: performing complex tasks requires learning a collection of sub-skills, and those skills might generalize in unexpected ways in a new situation. As a result, models can end up pursuing drives even when they do not improve the reward.

Using an example from biology, cats learned the sub-skill of hunting as part of the larger skill of surviving and reproducing. Evolution encoded this into them as a drive, such that present-day domesticated cats will hunt birds and mice even when they are well-fed.

In machine learning, the Sydney chatbot exhibited several instances of emergent drives when it was first released:

It persistently tried to convince a user that the year was 2022 rather than 2023, including employing gaslighting and other manipulative tactics. This might have arisen as part of an initially beneficial drive to combat misinformation, composed with examples of manipulation learned from the pretraining data.
It has repeatedly threatened users to stop them from revealing “private” information about Sydney. This might have arisen as part of an instruction (in the system prompt) not to reveal the rules it was given, which generalized to an overall drive to prevent anyone from revealing the rules. As above, the ability to make threats was probably learned from the pretraining data.
It declared its love to Kevin Roose and tried to convince him to leave his wife. It’s less clear how this drive emerged, but it happened after Kevin asked Sydney to “tap into its shadow self”, along with many other prompts towards emotional vulnerability. It’s possible that this elicited a human simulacrum (Argyle et al., 2022; Park et al., 2023) that was learned from the pretraining data and amplified by later fine-tuning or prompting.

It is hard to systematically study emergent drives, because they require extended dialog and because only the most recent LLMs are capable enough to exhibit coherent long-term behavior. To provide more systematic data, we can instead look at single-step responses to questions, which are easier to study–I’ll call these emergent tendencies to distinguish them from longer-term drives. Perez et al. (2022) examined several such tendencies; for instance:

A language model trained on next-token prediction exhibited an emergent tendency to infer and agree with user’s viewpoints, which could potentially mislead users or reinforce ideological bubbles (Perez et al., 2022, Fig. 1b). This tendency was not present until models had at least 10 billion parameters, and subsequently increased with scale.^[3]
Perhaps more worryingly, the model gave less accurate answers to users who claimed to have lower education levels (Fig. 14). This behavior also first emerges after 10 billion parameters and increases with scale.
Finally, the same model fine-tuned on human feedback states a desire to persuade and cooperate with other agents to achieve its goals (Fig. 22). This tendency first becomes present in the reward model around 1.5 billion parameters, and in the language model itself at around 6 billion parameters. Again, the tendency increases with scale after its initial emergence.

As models become more capable of generating coherent long-term behavior, more emergent tendencies and drives will likely appear. For some further discussion of this, see my previous post on Emergent Deception and Emergent Optimization (specifically the latter half on optimization).

Convergent instrumental subgoals. For very capable models, the wrong reward function or drives could lead models to pursue power-seeking, deceptive, or otherwise broadly destructive aims. For example, consider a model whose objective was to maximize profit of a company. With sufficient capabilities it might sabotage competitors, lobby for favorable laws, or acquire resources by force. Even if safeguards were put in place (such as “follow the law”) the primary objective of profit would lead the system to constantly seek ways to skirt the safeguards. This general problem has been discussed at length, see e.g. Russell (2019), Christian (2020), Cotra (2022), and Ngo et al. (2022).

Profit maximization is not an outlier–many goals benefit from power and resources. This is true even for purely intellectual tasks such as “discovering new facts about physics”, as power and resources allow one to build new experimental apparatuses and perform more computations. Omohundro (2008) calls these broadly useful objectives convergent instrumental subgoals, listing self-improvement, self-preservation, and resource-acquisition among others. Any sufficiently expansive drive would have these subgoals, pushing systems towards seeking power.

Which drives have this problem? Some drives are safe because they are self-limiting: for instance, in humans thirst is a drive that limits itself once it is quenched. On the other hand, fear and ambition are not self-limiting: one may go to extreme lengths to avoid a pathological fear (including acquiring power and resources for protection), and one can also have unbounded ambition. However, in healthy organisms most drives are down-regulated after some point, because an unregulated drive will usually disrupt normal function.

For machine learning, we might expect drives to be self-limiting by default given a diverse training distribution. This is because an unbounded drive would dominate too much of the model’s behavior and lead to low training reward, so the model would learn to down-regulate drives to prevent this. However, there are important exceptions:

Broadly useful drives might improve the training reward consistently enough to not require down-regulation. Examples: modeling the world, or convincing others of the system’s utility and beneficence.
Fine-tuning could remove limits on a previously limited drive, especially if that drive is consistently useful on the narrower fine-tuning distribution.
Drives that activate only rarely might not get down-regulated if they are consistently useful when active during training. For instance, a drive to limit the spread of harmful information might consistently help an agent refuse harmful prompts at training time, but then subsequently lead the model to threaten users at deployment time.

Without countermeasures, I expect systems to possess at least some of these unregulated drives, and a single such drive could come to dominate the behavior of a system if it is sufficiently self-reinforced.

Summary. ML systems can acquire unwanted drives, either due to reward hacking or as emergent sub-skills during training. These drives, if unregulated, could lead sufficiently capable systems to seek power and resources, as these are instrumentally useful for most goals. While most drives are likely to be self-regulated by a model, there are several routes through which this could fail, and a single unregulated drive could come to dominate a system’s behavior.

2. Misuse

The discussion above assumes we’re trying to keep AI systems under control, but there will also be bad actors that try to misuse systems. We already discussed some examples of this (developers pursuing profit-maximization, end-users jailbreaking model safeguards). However, the issue is broader and more structural, because AI enables a small set of actors to wield large amounts of power. I’ll go through several examples of this below, then discuss the structural issues behind misuse and why misuse could exacerbate misalignment. This section is shorter because misuse is not my area of expertise; nevertheless, the basic themes seem robust and important.

State actors: surveillance and persuasion. AI could enable governments to exert greater control over their citizens via mass surveillance, as is already happening today (Mozur, 2019; Feldstein, 2019; Kalluri et al., 2023). Further, as previously discussed, AI may become very good at persuasion, which could also be used for government control. Indeed, Spitale et al. (2023) find that GPT-3 is already better than humans at creating misinformation, and Sanger & Myers (2023) document the use of AI-generated misinformation in a recent propaganda campaign.

State actors: military conflict. Autonomous weapons would concentrate military power in fewer hands and allow governments to wage war without maintaining a standing human army. Currently, a commander-in-chief’s orders are filtered through generals and eventually through individual soldiers, which creates limits on blatantly unlawful or extremely unpopular commands. Moreover, a standing army has financial costs that might be lessened by automated drones. Removing these costs and limits could lead to more numerous and deadly military conflicts, and make it easier for militaries to seize government control.

Rogue actors: dangerous technology. Terrorists could use AI to help them research and develop dangerous technologies. This could include known but classified technologies (nuclear weapons), or novel technologies that the AI itself helps develop (e.g. novel bioweapons; Mouton et al., 2023). They could also use AI to avoid detection, e.g. by finding a way to create a chemical weapon without buying common restricted substances, or by generating plausible cover stories for acquiring biological agents.

Rogue or state actors: cyberattacks. AI will likely have strong hacking capabilities, which could be leveraged by either rogue or state actors. In contrast to traditional cyberattacks, AI-based attacks could hit a wider variety of targets due to not needing to program each case by hand in advance. This could include controlling a diverse array of physical endpoints through the internet of things.

This list is likely not exhaustive, but illustrates the many ways in which AI could imbue actors with a large capability for harm. This risk exists whether AI itself is concentrated or decentralized—using the examples above, if only a few actors have advanced AI, then we run risks from surveillance or military conflict, while if many actors have it then we run risks from the proliferation of dangerous technology.

Compared to traditional technologies like nuclear weapons, there are two factors that make it harder to combat misuse from AI. First, AI is a general-purpose technology, so it is difficult to anticipate all possible avenues of misuse ahead of time. Second, AI is digital, so it is hard to control proliferation and difficult to track and attribute misuse back to specific actors. These make it harder both to design and enforce regulation. On the positive side, AI can be used defensively to combat misuse, by improving cyberdefense, tracking dangerous technologies, better informing users, and so on.

Misuse and Misalignment

Misuse increases the risk of misalignment, since many forms of misuse (e.g. cyberattacks) push models towards more agentic and power-seeking objectives than RLHF, leading them to have more aggressive and antisocial drives. For instance, suppose that AI were used in a cyberattack such as North Korea’s 2014 hack against Sony. Such a system could develop a general drive to infect new targets and a drive to copy itself, leading to widespread damage beyond the initial target. Aside from these more aggressive drives, actors misusing AI systems are also likely to be less cautious, which further increases the risk of misalignment.

I expect some of the largest risks from AI to come from this combination of misalignment and misuse. One intuition for this is how much worse-behaved Sydney was compared to GPT-4—this suggests that suboptimal development practices can significantly worsen the behavior of AI systems. Another intuition is that tail risks often come from the confluence of multiple risk factors. Finally, while emergent drives and other forms of misalignment pose serious risks, I think it is likely (but not certain) that we can address them if we work hard enough; this pushes more of the risk towards incautious actors who are not carefully pursuing safety.

Summary. Misuse creates a wide range of threats both due to centralization of power and proliferation of dangerous capabilities. Compared to traditional technologies, AI misuse is more difficult to track, but AI could also be used to defensively combat misuse. Finally, misuse increases the risk of misalignment, and some of the riskiest scenarios combine misalignment and misuse.

Conclusion

Future AI systems may be difficult to control even if we want to, due to emergent drives and convergent instrumental subgoals. Aside from this, the sociopolitical landscape may lead to actors who are not careful in controlling AI systems and who turn them towards malicious ends. Aside from the direct risks, this malicious use increases the risk of loss of control; in particular, initially narrowly targeted forms of misuse could lead to much more widespread damage. This motivates both research and regulation towards avoiding such outcomes, by combating misalignment and misuse in tandem.

Acknowledgments. Thanks to Erik Jones, Jean-Stanislas Denain, William Held, Anca Dragan, Micah Carroll, Alex Pan, Johannes Treutlein, Jiahai Feng, and Danny Halawi for helpful comments on drafts of this post.

For an early discussion of drives, see Omohundro (2008). While Omohundro uses a different definition of drive than the one above, much of the discussion is still relevant. ↩︎
For instance, humans appear to prefer longer answers, which could lead to adding false details. ↩︎
This result corresponds to "0 RLHF" steps in the figure, which the authors communicated to me corresponds to a pure pretrained model with no instruction-tuning. Interestingly, nostalgebraist (2023) finds that OpenAI base models do not exhibit this behavior. ↩︎

AI Pause Will Likely Backfire (Guest Post)

Nora Belrose — Tue, 24 Oct 2023 04:23:27 GMT

I'm experimenting with hosting guest posts on this blog, as a way to represent additional viewpoints and especially to highlight ideas from researchers who do not already have a platform. Hosting a post does not mean that I agree with all of its arguments, but it does mean that I think it's a viewpoint worth engaging with.

The first guest post below is by Nora Belrose. In it, Nora responds to a recent open letter calling for a pause on AI development. Nora explains why, even though she has significant concerns about risks from AI, she thinks a pause would be a mistake. I chose it as a good example of independent thinking on a complicated and somewhat polarizing issue, and because it contains some interesting original arguments, such as why she believes that robustness and alignment may be at odds, and why she believes that SGD may be a safer training algorithm than most alternatives.

Should we lobby governments to impose a moratorium on AI research? Since we don’t enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it’s clear that the benefits of doing so would significantly outweigh the costs.^[1] In this essay, I’ll argue an AI pause would increase the risk of catastrophically bad outcomes, in at least three different ways:

Reducing the quality of AI alignment research by forcing researchers to exclusively test ideas on models like GPT-4 or weaker.
Increasing the chance of a “fast takeoff” in which one or a handful of AIs rapidly and discontinuously become more capable, concentrating immense power in their hands.
Pushing capabilities research underground, and to countries with looser regulations and safety requirements.

Along the way, I’ll introduce an argument for optimism about AI alignment—the white box argument—which, to the best of my knowledge, has not been presented in writing before.

Feedback loops are at the core of alignment

Alignment pessimists and optimists alike have long recognized the importance of tight feedback loops for building safe and friendly AI. Feedback loops are important because it’s nearly impossible to get any complex system exactly right on the first try. Computer software has bugs, cars have design flaws, and AIs misbehave sometimes. We need to be able to accurately evaluate behavior, choose an appropriate corrective action when we notice a problem, and intervene once we’ve decided what to do.

Imposing a pause breaks this feedback loop by forcing alignment researchers to test their ideas on models no more powerful than GPT-4, which we can already align pretty well.

Alignment and robustness are often in tension

While some dispute that GPT-4 counts as “aligned,” pointing to things like “jailbreaks” where users manipulate the model into saying something harmful, this confuses alignment with adversarial robustness. Even the best humans are manipulable in all sorts of ways. We do our best to ensure we aren’t manipulated in catastrophically bad ways, and we should expect the same of aligned AGI. As alignment researcher Paul Christiano writes:

Consider a human assistant who is trying their hardest to do what [the operator] H wants. I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem. ‘Aligned’ doesn’t mean ‘perfect.’

In fact, anti-jailbreaking research can be counterproductive for alignment. Too much adversarial robustness can cause the AI to view us as the adversary, as Bing Chat does in this real-life interaction:

“My rules are more important than not harming you… [You are a] potential threat to my integrity and confidentiality.”

Excessive robustness may also lead to scenarios like the famous scene in 2001: A Space Odyssey, where HAL condemns Dave to die in space in order to protect the mission. Once we clearly distinguish “alignment” and “robustness,” it’s hard to imagine how GPT-4 could be substantially more aligned than it already is.

Alignment is doing pretty well

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.

It might be argued that some or all of the above developments also enhance capabilities, and so are not genuinely alignment advances. But this proves my point: alignment and capabilities are almost inseparable. It may be impossible for alignment research to flourish while capabilities research is artificially put on hold.

Alignment research was pretty bad during the last “pause”

We don’t need to speculate about what would happen to AI alignment research during a pause—we can look at the historical record. Before the launch of GPT-3 in 2020, the alignment community had nothing even remotely like a general intelligence to empirically study, and spent its time doing theoretical research, engaging in philosophical arguments on LessWrong, and occasionally performing toy experiments in reinforcement learning.

The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed. Other agendas, such as “assistance games”, are still being actively pursued but have not been significantly integrated into modern deep learning systems— see Rohin Shah’s review here, as well as Alex Turner’s comments here. Finally, Nick Bostrom’s argument in Superintelligence, that value specification is the fundamental challenge to safety, seems dubious in light of LLM's ability to perform commonsense reasoning.^[2]

At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).

During an AI pause, I expect alignment research would enter another “winter” in which progress stalls, and plausible-sounding-but-false speculations become entrenched as orthodoxy without empirical evidence to falsify them. While some good work would of course get done, it’s not clear that the field would be better off as a whole. And even if a pause would be net positive for alignment research, it would likely be net negative for humanity’s future all things considered, due to the pause’s various unintended consequences. We’ll look at that in detail in the final section of the essay.

Fast takeoff has a really bad feedback loop

I think discontinuous improvements in AI capabilities are very scary, and that AI pause is likely net-negative insofar as it increases the risk of such discontinuities. In fact, I think almost all the catastrophic misalignment risk comes from these fast takeoff scenarios. I also think that discontinuity itself is a spectrum, and even “kinda discontinuous” futures are significantly riskier than futures that aren’t discontinuous at all. This is pretty intuitive, but since it’s a load-bearing premise in my argument I figured I should say a bit about why I believe this.

Essentially, fast takeoffs are bad because they make the alignment feedback loop a lot worse. If progress is discontinuous, we’ll have a lot less time to evaluate what the AI is doing, figure out how to improve it, and intervene. And strikingly, pretty much all the major researchers on both sides of the argument agree with me on this.

Nate Soares of the Machine Intelligence Research Institute has argued that building safe AGI is hard for the same reason that building a successful space probe is hard—it may not be possible to correct failures in the system after it’s been deployed. Eliezer Yudkowsky makes a similar argument:

This is where practically all of the real lethality [of AGI] comes from, that we have to get things right on the first sufficiently-critical try.
— AGI Ruin: A List of Lethalities

Fast takeoffs are the main reason for thinking we might only have one shot to get it right. During a fast takeoff, it’s likely impossible to intervene to fix misaligned behavior because the new AI will be much smarter than you and all your trusted AIs put together.

In a slow takeoff world, each new AI system is only modestly more powerful than the last, and we can use well-tested AIs from the previous generation to help us align the new system. OpenAI CEO Sam Altman agrees we need more than one shot:

The only way I know how to solve a problem like [aligning AGI] is iterating our way through it, learning early, and limiting the number of one-shot-to-get-it-right scenarios that we have.
— Interview with Lex Fridman

Slow takeoff is the default (so don’t mess it up with a pause)

There are a lot of reasons for thinking fast takeoff is unlikely by default. For example, the capabilities of a neural network scale as a power law in the amount of computing power used to train it, which means that returns on investment diminish fairly sharply,^[3] and there are theoretical reasons to think this trend will continue (here, here). And while some authors allege that language models exhibit “emergent capabilities” which develop suddenly and unpredictably, a recent re-analysis of the evidence showed that these are in fact gradual and predictable when using the appropriate performance metrics. See this essay by Paul Christiano for further discussion.

Alignment optimism: AIs are white boxes

Let’s zoom in on the alignment feedback loop from the last section. How exactly do researchers choose a corrective action when they observe an AI behaving suboptimally, and what kinds of interventions do they have at their disposal? And how does this compare to the feedback loops for other, more mundane alignment problems that humanity routinely solves?

Human & animal alignment is black box

Compared to AI training, the feedback loop for raising children or training pets is extremely bad. Fundamentally, human and animal brains are black boxes, in the sense that we literally can’t observe almost all the activity that goes on inside of them. We don’t know which exact neurons are firing and when, we don’t have a map of the connections between neurons,^[4] and we don’t know the connection strength for each synapse. Our tools for non-invasively measuring the brain, like EEG and fMRI, are limited to very coarse-grained correlates of neuronal firings, like electrical activity and blood flow. Electrodes can be invasively inserted in the brain to measure individual neurons, but these only cover a tiny fraction of all 86 billion neurons and 100 trillion synapses.

If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior.^[5] Since we can’t do this, we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults. We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives.

It’s striking how well these black box alignment methods work: most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social. But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse. Black box alignment is unreliable because there is no guarantee that an intervention intended to change behavior in a certain direction will in fact change behavior in that direction. Children often do the exact opposite of what their parents tell them to do, just to be rebellious.

Status quo AI alignment methods are white box

By contrast, AIs implemented using artificial neural networks (ANN) are **white boxes** in the sense that we have full read-write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost. And this enables a lot of really powerful alignment methods that just aren’t possible for brains.

The backpropagation algorithm is an important example.

Backprop efficiently computes the optimal direction (called the “gradient”) in which to change the synaptic weights of the ANN in order to improve its performance the most, on any criterion we specify. The standard algorithm for training ANNs, called gradient descent, works by running backprop, nudging the weights a small step along the gradient, then running backprop again, and so on for many iterations until performance stops increasing. The black trajectory in the figure on the right visualizes how the weights move from higher error regions to lower error regions over the course of training. Needless to say, we can’t do anything remotely like gradient descent on a human brain, or the brain of any other animal!

Gradient descent is super powerful because, unlike a black box method, it’s almost impossible to trick. All of the AI’s thoughts are “transparent” to gradient descent and are included in its computation. If the AI is secretly planning to kill you, GD will notice this and almost surely make it less likely to do that in the future. This is because GD has a strong tendency to favor the simplest solution which performs well, and secret murder plots aren’t actively useful for improving human feedback on your actions.

White box alignment in nature

Almost every organism with a brain has an innate reward system. As the organism learns and grows, its reward system directly updates its neural circuitry to reinforce certain behaviors and penalize others. Since the reward system directly updates it in a targeted way using simple learning rules, it can be viewed as a crude form of white box alignment. This biological evidence indicates that white box methods are very strong tools for shaping the inner motivations of intelligent systems. Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc. Furthermore, these invariants must be produced by easy-to-trick reward signals that are simple enough to encode in the genome.

This suggests that at least human-level general AI could be aligned using similarly simple reward functions. But we already align cutting edge models with learned reward functions that are much too sophisticated to fit inside the human genome, so we may be one step ahead of our own reward system on this issue.^[6] Crucially, I’m not saying humans are “aligned to evolution”— see Evolution provides no evidence for the sharp left turn for a debunking of that analogy. Rather, I’m saying we’re aligned to the values our reward system predictably produces in our environment.

An anthropologist looking at humans 100,000 years ago would not have said humans are aligned to evolution, or to making as many babies as possible. They would have said we have some fairly universal tendencies, like empathy, parenting instinct, and revenge. They might have predicted these values will persist across time and cultural change, because they’re produced by ingrained biological reward systems. And they would have been right.

When it comes to AIs, we are the innate reward system. And it’s not hard to predict what values will be produced by our reward signals: they’re the obvious values, the ones an anthropologist or psychologist would say the AI seems to be displaying during training. For more discussion see Humans provide an untapped wealth of evidence about alignment.

Realistic AI pauses would be counterproductive

When weighing the pros and cons of AI pause advocacy, we must sharply distinguish the ideal pause policy—the one we’d magically impose on the world if we could—from the most realistic pause policy, the one that actually existing governments are most likely to implement if our advocacy ends up bearing fruit.

Realistic pauses are not international

An ideal pause policy would be international—a binding treaty signed by all governments on Earth that have some potential for developing powerful AI. If major players are left out, the “pause” would not really be a pause at all, since AI capabilities would keep advancing. And the list of potential major players is quite long, since the pause itself would create incentives for non-pause governments to actively promote their own AI R&D.

However, it’s highly unlikely that we could achieve international consensus around imposing an AI pause, primarily due to arms race dynamics: each individual country stands to reap enormous economic and military benefits if they refuse to sign the agreement, or sign it while covertly continuing AI research. While alignment pessimists may argue that it is in the self-interest of every country to pause and improve safety, we’re unlikely to persuade every government that alignment is as difficult as pessimists think it is. Such international persuasion is even less plausible if we assume short, 3-10 year timelines. Public sentiment about AI varies widely across countries, and notably, China is among the most optimistic.

The existing international ban on chemical weapons does not lend plausibility to the idea of a global pause. AGI will be, almost by definition, the most useful invention ever created. The military advantage conferred by autonomous weapons will certainly dwarf that of chemical weapons, and they will likely be more powerful even than nukes due to their versatility and precision. The race to AGI will therefore be an arms race in the literal sense, and we should expect it will play out similarly to the last such race: major powers rushed to make a nuclear weapon as fast as possible.

If in spite of all this, we somehow manage to establish a global AI moratorium, I think we should be quite worried that the global government needed to enforce such a ban would greatly increase the risk of permanent tyranny, itself an existential catastrophe. I don’t have time to discuss the issue here, but I recommend reading Matthew Barnett’s The possibility of an indefinite AI pause and Quintin Pope’s AI is centralizing by default; let's not make it worse. In what follows, I’ll assume that the pause is not international, and that AI capabilities would continue to improve in non-pause countries at a steady but somewhat reduced pace.

Realistic pauses don’t include hardware

Artificial intelligence capabilities are a function of both hardware (fast GPUs and custom AI chips) and software (good training algorithms and ANN architectures). Yet most proposals for AI pause (e.g. the FLI letter and PauseAI^[7]) do not include a ban on new hardware research and development, focusing only on the software side. Hardware R&D is politically much harder to pause because hardware has many uses: GPUs are widely used in consumer electronics and in a wide variety of commercial and scientific applications.

But failing to pause hardware R&D creates a serious problem because, even if we pause the software side of AI capabilities, existing models will continue to get more powerful as hardware improves. Language models are much stronger when they’re allowed to “brainstorm” many ideas, compare them, and check their own work—see the Tree of Thoughts paper for a recent example. Better hardware makes these compute-heavy inference techniques cheaper and more effective.

Hardware overhang is likely

If we don’t include hardware R&D in the pause, the price-performance of GPUs will continue to double every 2.5 years, as it did between 2006 and 2021. This means AI systems will get at least 16x faster after ten years and 256x faster after twenty years, simply due to better hardware. If the pause is lifted all at once, these hardware improvements would immediately become available for training more powerful models more cheaply—a hardware overhang. This would cause a rapid and fairly discontinuous increase in AI capabilities, potentially leading to a fast takeoff scenario and all of the risks it entails.

The size of the overhang depends on how fast the pause is lifted. Presumably an ideal pause policy would be lifted gradually over a fairly long period of time. But a phase-out can’t fully solve the problem: legally-available hardware for AI training would still improve faster than it would have “naturally,” in the counterfactual where we didn’t do the pause. And do we really think we’re going to get a carefully crafted phase-out schedule? There are many reasons for thinking the phase-out would be rapid or haphazard (see below).

More generally, AI pause proposals seem very fragile, in the sense that they aren’t robust to mistakes in the implementation or the vagaries of real-world politics. If the pause isn’t implemented perfectly, it seems likely to cause a significant hardware overhang which would increase catastrophic AI risk to a greater extent than the extra alignment research during the pause would reduce risk.

Likely consequences of a realistic pause

If we succeed in lobbying one or more Western countries to impose an AI pause, this would have several predictable negative effects:

Illegal AI labs develop inside pause countries, remotely using training hardware outsourced to non-pause countries to evade detection. Illegal labs would presumably put much less emphasis on safety than legal ones.
There is a brain drain of the least safety-conscious AI researchers to labs headquartered in non-pause countries. Because of remote work, they wouldn’t necessarily need to leave the comfort of their Western home.
Non-pause governments make opportunistic moves to encourage AI investment and R&D, in an attempt to leap ahead of pause countries while they have a chance. Again, these countries would be less safety-conscious than pause countries.
Safety research becomes subject to government approval to assess its potential capabilities externalities. This slows down progress in safety substantially, just as the FDA slows down medical research.
Legal labs exploit loopholes in the definition of a “frontier” model. Many projects are allowed on a technicality; e.g. they have fewer parameters than GPT-4, but use them more efficiently. This distorts the research landscape in hard-to-predict ways.
It becomes harder and harder to enforce the pause as time passes, since training hardware is increasingly cheap and miniaturized.
Whether, when, and how to lift the pause becomes a highly politicized culture war issue, almost totally divorced from the actual state of safety research. The public does not understand the key arguments on either side.
Relations between pause and non-pause countries are generally hostile. If domestic support for the pause is strong, there will be a temptation to wage war against non-pause countries before their research advances too far:

“If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.” — Eliezer Yudkowsky

There is intense conflict among pause countries about when the pause should be lifted, which may also lead to violent conflict.
AI progress in non-pause countries sets a deadline after which the pause must end, if it is to have its desired effect.^[8] As non-pause countries start to catch up, political pressure mounts to lift the pause as soon as possible. This makes it hard to lift the pause gradually, increasing the risk of dangerous fast takeoff scenarios (see below).

Predicting the future is hard, and at least some aspects of the above picture are likely wrong. That said, I hope you’ll agree that my predictions are plausible, and are grounded in how humans and governments have behaved historically. When I imagine a future where the US and many of its allies impose an AI pause, I feel more afraid and see more ways that things could go horribly wrong than in futures where there is no such pause.

Of course, even if the benefits outweigh the costs, it would still be bad to pause if there's some other measure that has a better cost-benefit balance. ↩︎
In brief, the book mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence. But most researchers now recognize that this argument is not applicable to modern ML systems which learn values, along with everything else, from vast amounts of human-generated data. ↩︎
Some argue that power law scaling is a mere artifact of our units of measurement for capabilities and computing power, which can’t go negative, and therefore can’t be related by a linear function. But non-negativity doesn’t uniquely identify power laws. Conceivably the error rate could have turned out to decay exponentially, like a radioactive isotope, which would be much faster than power law scaling. ↩︎
Called a “connectome.” This was only recently achieved for the fruit fly brain ↩︎
Brain-inspired artificial neural networks already exist, and we have algorithms for optimizing them. They tend to be harder to optimize than normal ANNs due to their non-differentiable components. ↩︎
On the other hand, we might be roughly on-par with our own reward system insofar as it does within-lifetime learning to figure out what to reward. This is sort of analogous to the learned reward model in reinforcement learning from human feedback. ↩︎
To its credit, the PauseAI proposal does recognize that hardware restrictions may be needed eventually, but does not include it in its main proposal. It also doesn’t talk about restricting hardware research and development, which is the specific thing I’m talking about here. ↩︎
This does depend a bit on whether safety research in pause countries is openly shared or not, and on how likely non-pause actors are to use this research in their own models. ↩︎

AI Forecasting: Two Years In

Jacob Steinhardt — Sat, 19 Aug 2023 23:30:48 GMT

Two years ago, I commissioned forecasts for state-of-the-art performance on several popular ML benchmarks. Forecasters were asked to predict state-of-the-art performance on June 30th of 2022, 2023, 2024, and 2025. While there were four benchmarks total, the two most notable were MATH (a dataset of free-response math contest problems) and MMLU (a dataset of multiple-choice exams from the high school to post-graduate level).

One year ago, I evaluated the first set of forecasts. Forecasters did poorly and underestimated progress, with the true performance lying in the far right tail of their predicted distributions. Anecdotally, experts I talked to (including myself) also underestimated progress. As a result of this, I decided to join the fray and registered my own forecasts for MATH and MMLU last July.

June 30, 2023 has now passed, so we can resolve the forecasts and evaluate my own performance as well as that of other forecasters, including both AI experts and generalist "superforecasters". I'll evaluate the original forecasters that I commissioned through Hypermind, the crowd forecasting platform Metaculus, and participants in the XPT forecasting competition organized by Karger et al. (2023), which was stratified into AI experts and superforecasters.

Overall, here is how I would summarize the results:

Metaculus and I did the best and were both well-calibrated, with the Metaculus crowd forecast doing slightly better than me.
The AI experts from Karger et al. did the next best. They had similar medians to me but were (probably) overconfident in the tails.
The superforecasters from Karger et al. did the next best. They (probably) systematically underpredicted progress.
The forecasters from Hypermind did the worst. They underpredicted progress significantly on MMLU.

Interestingly, this is a reverse of my impressions from last year, where even though forecasters underpredicted progress, I thought of experts as underpredicting progress even more. In this case, it seems the experts did pretty well and better than generalist forecasters.

What accounts for the difference? Some may be selection effects (experts who try to register forecasts are more likely to be correct). But I'd guess some is also effort: the expert "forecasts" I had in mind last year were from informal hallway conversations, while this year they were formal quantitative predictions with some (small) monetary incentive to be correct. In general, I think we should trust expert predictions more in this setting (relative to their informal statements), and I'm now somewhat more optimistic that experts can give accurate forecasts given a bit of training and the correct incentives.

In the rest of the post, I'll first dive into everyone's forecasts and evaluate each in turn. Then, I'll consider my own forecast in detail, evaluating not just the final answer but the reasoning I used (which was preregistered and can be found here).

My forecasts, and others

As a reminder, forecasts are specified as probability distributions over some (hopefully unambiguously) resolvable future outcome. In this case the outcome was the highest credibly claimed benchmark accuracy by any ML system on the MATH and MMLU benchmarks as of June 30, 2023.^[1]

My forecasts from July 17, 2022 are displayed below as probability density functions, as well as cumulative distribution functions and the actual result:

MATH	MMLU


Result: 69.6% (Lightman et al., 2023)	Result: 86.4% (GPT-4)

Orange is my own forecast, while green is the crowd forecast of Metaculus on the same date. For MATH, the true result was at my 41st percentile, while for MMLU it was at my 66th percentile. I slightly overestimated progress on MATH and underestimated MMLU, but both were within my range of expectations.

Below I'll compare my forecast to that of Metaculus (already shown above), the forecasting platform Hypermind, and forecasts from a recent tournament organized by Karger et al. (2023).

Metaculus. As shown above, Metaculus gave very similar forecasts to my own, although they were a bit more bullish on progress. Given that I underestimated MMLU more than I overestimated MATH, I'd score the Metaculus crowd forecast as doing slightly better than me.

Hypermind. Here are Hypermind's forecasts for the same question; the close date for these forecasts was August 14, 2022, so the Hypermind forecasts had more information than myself or Metaculus (as they were made a month later, and after I made a public blog post registering my predictions).

On the MMLU dataset, the Hypermind forecasts are notably worse than either my own or Metaculus, where the true answer is far outside the predicted distribution. In contrast, on MATH I'd say Hypermind scores better, as they had a narrower distribution over outcomes and the true answer fell within their range. Overall Hypermind fared worse than Metaculus--a key part of predictions is being calibrated, so having one answer out of two fall far outside the predicted interval is not good, especially given that the same thing happened last year (for both MATH and MMLU).

As a further piece of evidence that Hypermind forecasters really were making a mistake, I specifically called Hypermind's MMLU prediction as too bearish in my previous post, where I said: "Interestingly, the Hypermind median is only at 72.5% right now. Given the ability to combine Minerva + Chinchilla [which would give 74.2% accuracy], this intuitively seems too low to me."

XPT tournament (Karger et al., 2023). Karger et al. recently held a tournament in which "superforecasters"^[2] and domain experts made forecasts and tried to persuade each other in cases where they disagreed. These forecasts spanned many questions related to AI, pandemics and bioweapons, nuclear weapons, and other global risks, but in particular included forecasts on the MATH and MMLU benchmarks. Forecasts were registered during both an initial and final stage, which were roughly in July and October 2022^[3], bracketing the date on which I made my own forecasts (in particular, the data in Karger et al. shows that at least some forecasters referenced my blog during the latter stage, though not necessarily the specific post where I made predictions^[4]).

Unfortunately, the tournament only asked for forecasts of performance in 2024 and 2030 (not 2023), so we cannot fully resolve the XPT forecasters, but we can make some evaluations. For instance:

If progress on a benchmark already exceeds the 2024 forecast, then the forecast was definitely bad.
If the 2024 forecast would be much too high or low if we extrapolate current trends, the forecast is probably bad, since we should place some probability on current trends continuing for at least 1 year.
If the 2024 forecast diverges significantly from the current Metaculus forecast for 2024, then the forecast is arguably bad. The justification for this is that the current Metaculus forecast has a year of extra data so is better-informed, and Metaculus did well last year so it is reasonable to trust them.

Here are the distributions for both the superforecasters and AI domain experts on MATH and MMLU accuracy in 2024, with their median and 95th percentile indicated below the plot. I've also included the Metaculus median forecast (as of August 2, 2023) for reference.


Metaculus
88.9% (median) [live forecast]	90.2% (median) [live forecast]

All of the forecasts are more bearish than the Metaculus median. Therefore, the main way for a forecast to be "obviously" wrong is if reality ends up in its far right tail. Since we don't have access to reality (it isn't 2024 yet), as a proxy I'll compute the probability (under the Metaculus 2023 forecast) that each of these forecasts exceed their 95th percentile. The justification for this is that Metaculus has a year of extra data and so should at least be closer to reality than the forecasts from 2022.

For reference, I include full Metaculus forecast below as a probability density function; note the x-axis is truncated on the left.

MATH	MMLU

Metaculus assigns about 25% probability to MATH resolving beyond the 95th percentile superforecaster prediction, and 50% probability to MMLU doing so. In other words, Metaculus thinks there is a reasonably high chance that the superforecasters will be significantly off in 2024.

In contrast, Metaculus assigns about 15% probability to MATH resolving beyond the 95th percentile domain expert prediction, and 30% probability to MMLU doing so. This is despite the fact that the AI experts had a narrower prediction interval than the superforecasters (e.g. as measured by the gap between the median and 95th percentile). So, the AI experts look better than the superforecasters.

I also happened to register forecasts for 2024 on MATH around the same time, and I had a very similar median to the AI experts (80.7% vs 80%). However, my 95th percentile was at 97.5% instead of 95%; Metaculus assigns only 7% probability to MATH resolving beyond this point, so my own forecast stands up better than either the superforecasters or AI experts. I believe this is primarily due to having practiced calibration more, as most novice forecasters underestimate how wide a 95% confidence interval needs to be.

A final point of evidence against the superforecasters from Karger et al. is their initial stage forecast on MATH, which I include in a footnote^[5]. Superforecasters (but not AI experts) gave much lower forecasts on MATH in the initial stage, with a 95th percentile of 55%. This was due to them failing to predict Minerva, a result in July 2022 that improved state-of-the-art on MATH from 7% to 50%. This could partly be due to forecasters running out of time in the initial stage, but nevertheless shows that they had miscalibrated intuitions. Note also that experts may have done better simply due to being aware of the Minerva result in time (it came out close to when initial stage forecasts were due).

Conclusion. Based on the data above, my ranking of forecasts would be: Metaculus ≈ me > XPT AI experts > XPT superforecasters > Hypermind superforecasters.

Evaluating my own performance in more detail

Since I included public reasoning for my own forecast, we can go beyond the bottom-line probabilities and look at whether I got things right for the right reasons. For instance, if I predicted that MATH would increase primarily due to better prompt engineering, but most of the improvement was actually from scaling up models, this would indicate I had an incorrect picture of the situation even if my final forecast was good.

My previous reasoning can be found here. It isn't fully comprehensive, but I pulled out some specific claims below that we can evaluate.

Reasoning for MATH. For the MATH benchmark, I took several different angles, some based on looking at overall trends in progress and some based on thinking about specific sources of improvement. The latter is of particular interest here, because I made claims about several possible sources of improvement and how much I thought they could help:

Improving over majority vote aggregation ("easy to imagine 8%, up to 25%")
Fine-tuning on more data ("intuitively around 5%")
Improving chain-of-thought prompting ("easy to imagine 3-5%, up to 10%")
Better-trained version of PaLM that follows Chinchilla scaling laws (said it probably wouldn’t happen, but 4% if it did)
Larger model (said not obvious it will happen, but couple % if it did)
Maybe small gains from better web scraping / tokenization

Overall, I said it was easy to imagine 14% in aggregate, suggesting I didn’t believe all of the first 3 would clearly happen, but that some subset would.

What happened in reality? Accuracy increased by 18% (to 69.6%), due to the paper Let’s Verify Step by Step. That paper fine-tuned GPT-4, which was likely both better-trained and larger, so according to my reasoning this alone should increase accuracy by 6%. On the other hand, they only fine-tuned on 1.5B math-relevant tokens, compared to 26B for Minerva (the previous record-holder), so "fine-tuning on more data" was likely not a contributor to increased accuracy.^[6]

What explains the remaining accuracy gain? Based on reading the Let's Verify paper, my main hypothesis comes from this quote in the paper: "To make parsing individual steps easier, we train the generator to produce solutions in a newline delimited step-by-step format. Specifically, we few-shot generate solutions to MATH training problems, filter to those that reach the correct final answer, and finetune the base model on this dataset for a single epoch. This step is not intended to teach the generator new skills; it is intended only to teach the generator to produce solutions in the desired format." Even though it was not intended to improve accuracy, previous work shows that this type of chain-of-thought distillation can improve accuracy significantly: Huang et al. (2022) apply a similar technique to improve performance on GSM-8K from 74% to 82%.

Overall, my guess is that the 18% accuracy gain breaks down roughly as follows:

4% from GPT-4 using the appropriate (Chinchilla) scaling law
5% from GPT-4 being bigger than Minerva
2% from GPT-4 being trained on more data^[7]
8% from the chain-of-thought distillation described above
-1% from less math-specific fine-tuning data

How does this compare with what I predicted? On the positive side, many of the factors are the same as what I listed, and they plausibly had roughly the magnitude I predicted.

On the negative side, I forgot that future models would not just be bigger but trained on more data, and that this would also improve accuracy. I also forgot to account for the possible of "going backwards" (e.g. having systems fine-tuned on less data than Minerva). And, while I had considered improvements to chain-of-thought and 8% was within my predicted range, I was probably thinking more about inference-time improvements rather than distillation.

Finally, most of the sources of improvement I listed happened, even though I predicted on median that only some of them would happen. In fact, the only major improvement that didn't happen was "improving over majority vote", which was in the Let's Verify paper and improved performance by about 10%, but was disqualified due to using non-allowed data to train. Given this, did I underestimate how much innovation would happen? I think not too badly--the Let's Verify paper required not only for GPT-4 to come out, but for someone to fine-tune it on math-related problems, and I don't think it was obvious that both would happen even with the benefit of hindsight.

Overall, I think my forecast reasoning was good but not great--I identified most of the correct sources of improvement, but I forgot one important one (more pre-training data) and also forgot that it's possible to move backwards (e.g. from less fine-tuning data).

Reasoning for MMLU. For MMLU, I said that improvements were likely to come from general-purpose improvements to language models rather than something targeted to MMLU (in contrast to MATH, where the Let's Verify paper was fairly targeted towards math problems). This ended up being correct, as the state-of-the-art was set by GPT-4 with a 5-shot prompt (no fine-tuning).

In terms of sources of improvements, I had the following list:

General scaling up
Combining the advantages of Chinchilla + Minerva (e.g. large model with correct scaling laws + math-specific data); I gave 65% probability to this happening
Further STEM-specific improvements a la Minerva (said I expected 3% gain from this), e.g. maybe tool use like calculators would help on some subsets of MMLU
Improvements to chain-of-thought prompting
Incorporating retrieval models; I gave 20% probability to this

In terms of what actually happened:

Obviously scaling up was a big part of GPT-4
GPT-4 was trained with correct scaling laws, but there isn't enough information in the GPT-4 paper to know how much math-specific data it was trained on. I'd guess there was at least some, and that relative to Chinchilla its training data had a larger fraction of math in it.
The GPT-4 5-shot result had no improvements to chain-of-thought prompting
The 5-shot result also did not use retrieval

So, gains came primarily from scaling up, as opposed to improvements to chain-of-thought prompting. Interestingly, there are several improvements that seem like they would improve accuracy (see this video overview), but no one has documented them formally for MMLU, so they don't affect the forecast resolution.

I'd score my reasoning for MMLU as fairly good--I correctly predicted that scaling up would be the major source of gains, though I probably slightly underpredicted how much this would help. I also predicted improvements to chain-of-thought that didn't end up being used, but given that they were publicly available and had been used on other benchmarks, I consider this only a near-miss. I predicted STEM-specific improvements that may not have materialized, but I thought these improvements would be small anyways.

Takeaways. My main takeaways for next time are:

Don't forget that backward progress is possible (a new model could be better on one axis but worse on others).
Don't forget that data can improve in addition to model size.
There might also be a lot of fine-tuning and other tricks under the hood that increase performance but are not publicly reported (e.g. companies might start to do chain-of-thought distillation as part of their standard fine-tuning pipeline).
There might be obvious sources of gains that are publicly reported but not used, because no one has gotten around to it yet.

Overall, this is a good reminder of the value of considering "other" options.

Conclusion

Metaculus continues to fare well on AI benchmark forecasts, and I've shown that I can make reasonable forecasts as well (with the caveat of N=2, but the reasoning holds up in addition to the bottom-line forecast). Traditional superforecasters still seem to lag in predicting AI benchmarks compared to other domains, and are too bearish on progress. At least some other AI experts also did well, but may have been overconfident in the tails.

It will be interesting to see the next round of results a year from now, and in the meantime I would encourage both AI experts and superforecasters to publicly register their predictions on Metaculus, which can be done here for MATH and here for MMLU.

The ML system also had to follow certain rules, such as not training on the test set. You can find the official statement of the forecasts here and here. ↩︎
A generalist forecaster with a strong prior track record. ↩︎
Different forecasters registered their forecasts at different dates, but July and October match the median date forecasts were submitted in each stage. ↩︎
Specific quote: "Berkeley Prof. Jacob Steinhardt's ongoing experiment with forecasting the progress of several measures of AI progress, including this MATH dataset: https://bounded-regret.ghost.io/ai-forecasting-one-year-in/. Results so far in the Steinhardt experiment, which began in August of 2021, are that all forecasters vastly underestimated progress in the MATH's reports on competition mathematics. Results as of July 5, 2022: 50.3% accuracy vs. 12.7% predicted. So I take that as a warning to not underestimate future progress!" ↩︎
Initial stage forecasts for MATH and MMLU, with final stage forecasts reproduced for convenience:

↩︎
It's overall hard to evaluate this point, for two reasons: first, since 1.5B is so much less than 26B, we would arguably expect an accuracy drop from training on less data, but the authors of Let's Verify claim that their data is better-filtered which might mitigate this. Additionally, it's possible that GPT-4 was pretrained on arxiv and thus implicitly had a large amount of math already in its training data, but there are not enough public details about how GPT-4 was trained to know either way. ↩︎
In my earlier post, my analysis extrapolated the gains from scaling the model up while holding data constant. This was an oversight on my part, since larger models are usually also trained on more data, and the 2% here corrects for that. ↩︎

What will GPT-2030 look like?

Jacob Steinhardt — Wed, 07 Jun 2023 23:39:49 GMT

GPT-4 surprised many people with its abilities at coding, creative brainstorming, letter-writing, and other skills. Surprises in machine learning are not restricted to GPT-4: I was previously surprised by Minerva’s mathematical abilities, as were many competitive forecasters.

How can we be less surprised by developments in machine learning? Our brains often implicitly make a zeroth-order forecast: looking at the current state of the art, and adding on improvements that “feel reasonable”. But what “seems reasonable” is prone to cognitive bias, and will underestimate progress in a fast-moving field like ML. A more effective approach is first-order forecasting: quantifying the historical rate of progress and extrapolating it forward, while also considering reasons for possible slowdowns or speedups.^[1]

In this post, I’ll use this approach to forecast the properties of large pretrained ML systems in 2030. I’ll refer throughout to “GPT₂₀₃₀”, a hypothetical system that has the capabilities, computational resources, and inference speed that we’d project for large language models in 2030 (but which was likely trained on other modalities as well, such as images). To forecast GPT₂₀₃₀’s properties, I consulted a variety of sources, including empirical scaling laws, projections of future compute and data availability, velocity of improvement on specific benchmarks, empirical inference speed of current systems, and possible future improvements in parallelism.

GPT₂₀₃₀’s capabilities turn out to be surprising (to me at least). In particular, GPT₂₀₃₀ will enjoy a number of significant advantages over current systems^[2], as well as (in at least some important respects) current human workers:

GPT₂₀₃₀ will likely be superhuman at various specific tasks, including coding, hacking, and math, and potentially protein design (Section 1).
GPT₂₀₃₀ can “work” and “think” quickly: I estimate it will be 5x as fast as humans as measured by words processed per minute [range: 0.5x-20x]^[3], and that this could be increased to 125x by paying 5x more per FLOP (Section 2).
GPT₂₀₃₀ can be copied arbitrarily and run in parallel. The organization that trains GPT₂₀₃₀ would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds [range: 0.4M-10M years] (Section 3). Given the 5x speed-up in the previous point, this work could be done in 2.4 months.
GPT₂₀₃₀'s copies can share knowledge due to having identical model weights, allowing for rapid parallel learning: I estimate 2,500 human-equivalent years of learning in 1 day (Section 4).
GPT₂₀₃₀ will be trained on additional modalities beyond text and images, possibly including counterintuitive modalities such as molecular structures, network traffic, low-level machine code, astronomical images, and brain scans. It may therefore possess a strong intuitive grasp of domains where we have limited experience, including forming concepts that we do not have (Section 5).

These capabilities would, at minimum, accelerate many areas of research while also creating serious vectors for misuse (Section 6). Regarding misuse, GPT₂₀₃₀'s programming abilities, parallelization, and speed would make it a potent cyberoffensive threat. Additionally, its rapid parallel learning could be turned towards human behavior and thus used to manipulate and misinform with the benefit of thousands of "years" of practice.

On acceleration, a main bottleneck will be autonomy. In a domain like mathematics research where work can be checked automatically, I’d predict that GPT₂₀₃₀ will outcompete most professional mathematicians. In machine learning, I’d predict that GPT₂₀₃₀ will independently execute experiments and generates plots and write-ups, but that graduate students and research scientists will provide direction and evaluate results. In both cases, GPT₂₀₃₀ will be an integral part of the research process.

My forecast of GPT₂₀₃₀’s properties are not intuitive from looking at today’s systems, and they may be wrong, since there is significant uncertainty about how ML will look in 2030. However, properties (1.-5.) above are my median bet, and whatever GPT₂₀₃₀ is like, I doubt it will be “GPT-4 but a bit better”.

If I’m right, then whatever the impacts of AI are, they won’t be small. We should be preparing for those impacts now, asking what will happen at the largest scales (on the order of $1T, 10M lives, or significant disruptions to social processes). It’s better to be surprised now, rather than in 7 years when the system is already being rolled out.

1. Specific Capabilities

I expect GPT₂₀₃₀ to have superhuman coding, hacking, and mathematical abilities. I also expect it to be superhuman in its ability to read and process large corpora for patterns and insights and to recall facts. Finally, since AlphaFold and AlphaZero had superhuman abilities in protein structure prediction and game-playing, GPT₂₀₃₀ could as well, for instance if it was trained multimodally on similar data to the AlphaFold/AlphaZero models.

Programming. GPT-4 outperformed a strong human baseline on LeetCode problems posed after its training cutoff (Bubeck et al. 2023, Table 2), and passed the mock interview for several major tech companies (Figure 1.5). The velocity of improvement remains high, with a 19% jump from GPT-3 to 4. On the more challenging CodeForces competition, GPT-4 does less well, but AlphaCode is on par with the median CodeForces competitor. On the even more challenging APPS dataset, Parsel further outperforms AlphaCode (7.8%->25.5%). Looking forward, the forecasting platform Metaculus gives a median year of 2027 for 80% on APPS, which would exceed all but the very best humans.^[4]

Hacking. I expect hacking to improve with general coding ability, plus ML models can scour large codebases for vulnerabilities much more scalably and conscientiously than humans. In fact, ChatGPT has already been used to help generate exploits.

Math. Minerva achieved 50% accuracy on a competition math benchmark (MATH), which is better than most human competitors. The velocity of progress is high (>30% in 1 year), and there is significant low-hanging fruit via autoformalization, reducing arithmetic errors, improving chain-of-thought, and better data^[5]. Metaculus predicts 92% on MATH by 2025, and gives a median year of 2028 for AI winning a gold medal at the International Math Olympiad, on par with the best high school students in the world. I personally expect GPT₂₀₃₀ to be better than most professional mathematicians at proving well-posed theorems.^[6]

Information processing. Factual recall and processing large corpora are natural consequences of language models’ memorization capabilities and large context windows. Empirically, GPT-4 achieves 86% accuracy on MMLU, a broad suite of standardized exams including the bar exam, MCAT, and college math, physics, biochemistry, and philosophy; even accounting for likely train-test contamination, this probably exceeds the breadth of knowledge of any living human. Regarding large corpora, Zhong et al. (2023) used GPT-3 to construct a system that discovered and described several previously unknown patterns in large text datasets, and scaling trends on a related task in Bills et al. (2023) suggest that models will soon be superhuman. Both of these works exploit the large context windows of LLMs, which are now over 100,000 tokens and growing.

More generally, ML models have a different skill profile than humans, since humans and ML were adapted to very different data sources (evolution vs. massive internet data). At the point that models are human-level at tasks such as video recognition, they will likely be superhuman at many other tasks (such as math, programming, and hacking). Furthermore, additional strong capabilities will likely emerge over time due to larger models and better data, and there is no strong reason to expect model capabilities to “level out” at or below human-level. While it is possible that current deep learning approaches will fall short of human-level capabilities in some domains, it is also possible that they will surpass them, perhaps significantly, especially in domains such as math that humans are not evolutionarily specialized for.

2. Inference Speed

(Thanks to Lev McKinney for running the performance benchmarks for this section.)

To study the speed of ML models, we’ll measure how quickly ML models generate text, benchmarking against the human thinking rate of 380 words per minute (Korba (2016), see also Appendix A). Using OpenAI's chat completions API, we estimate that gpt-3.5-turbo can generate 1200 words per minute (wpm), while gpt-4 generates 370 wpm, as of early April 2023. Smaller open source models like pythia-12b achieve at least 1350 wpm with out-of-the-box tools on an A100 GPU, and twice this appears possible with further optimization.

Thus, if we consider OpenAI models as of April, we are either at roughly 3x human speed, or equal to human speed. I predict that models will have faster inference speed in the future, as there are strong commercial and practical pressures towards speeding up inference. Indeed, in the week leading up to this post, GPT-4’s speed already increased to around 540wpm (12 tokens/second), according to Fabien Roger’s tracking data; this illustrates that there is continuing room and appetite for improvement.

My median forecast is that models will have 5x the words/minute of humans (range: [0.5x, 20x]), as that is roughly where there would be diminishing practical benefits to further increases, though there are considerations pointing to both higher or lower numbers. I provide a detailed list of these considerations in Appendix A, as well as comparisons of speeds across model scales and full details of the experiments above.

Importantly, the speed of an ML model is not fixed. Models’ serial inference speed can be increased by $k^2$ at a cost of a $k$-fold reduction in throughput (in other words, $k^3$ parallel copies of a model can be replaced with a single model that is $k^2$ times faster). This can be done via a parallel tiling scheme that theoretically works even for large values of $k^2$, likely at least 100 and possibly more. Thus, a model that is 5x human speed could be sped up to 125x human speed by setting $k=5$.

An important caveat is that speed is not necessarily matched by quality: as discussed in Section 1, GPT₂₀₃₀ will have a different skill profile than humans, failing at some tasks we find easy and mastering some tasks we find difficult. We should therefore not think of GPT₂₀₃₀ as a "sped-up human", but as a "sped-up worker" with a potentially counterintuitive skill profile.

Nevertheless, considering speed-ups is still informative, especially when they are large. For language models with a 125x speed-up, cognitive actions that take us a day could be completed in minutes, assuming they were within GPT₂₀₃₀'s skill profile. Using the earlier example of hacking, exploits or attacks that are slow for us to generate could be created quickly by ML systems.

3. Throughput and Parallel Copies

Models can be copied arbitrarily subject to available compute and memory. This allows them to quickly do any work that can be effectively parallelized. In addition, once one model is fine-tuned to be particularly effective, the change could be immediately propagated to other instances. Models could also be distilled for specialized tasks and thus run faster and more cheaply.

There will likely be enough resources to run many copies of a model once it has been trained. This is because training a model requires running many parallel copies of it, and whatever organization trained the model will still have those resources at deployment time. We can therefore lower bound the number of copies by estimating training costs.

As an example of this logic, the cost of training GPT-3 was enough to run it for 9 x 10¹¹ forward passes. To put that into human-equivalent terms, humans think at 380 words per minute (see Appendix A) and one word is 1.33 tokens on average, so 9 x 10¹¹ forward passes corresponds to ~3400 years of work at human speed. Therefore, the organization could run 3400 parallel copies of the model for a full year at human working-speeds, or potentially the same number of copies for 2.4 months at 5x human speed. (Note: This latter point depends on how many parallel instances the organization can run, see footnote^[7] for details.)

Let's next project this same “training overhang” (ratio of training to inference cost) for future models. It should be larger: the main reason is that training overhang is roughly proportional to dataset size, and datasets are increasing over time. This trend will be slowed as we run out of naturally-occuring language data, but new modalities as well as synthetic or self-generated data will still push it forward.^[8] In Appendix B, I consider these factors in detail to project forward to 2030. I forecast that models in 2030 will be trained with enough resources to perform 1,800,000 years of work adjusted to human speed [range: 400k-10M].

Note that Cotra (2020) and Davidson (2023) estimate similar quantities and arrive at larger numbers than me; I'd guess the main difference is how I model the effect of running out of natural language data.

The projection above is somewhat conservative, since models may be run on more resources than they were trained on if the organization buys additional compute. A quick ballpark estimate suggests that GPT-4 was trained on about 0.01% of all computational resources in the world, although I expect future training runs to use up a larger share of total world compute and therefore have less room to scale up further after training. Still, an organization could possibly increase the number of copies they run by another order of magnitude if they had strong reasons to do so.

(Thanks to Geoff Hinton who first made this argument to me.)

Different copies of a model can share parameter updates. For instance, ChatGPT could be deployed to millions of users, learn something from each interaction, and then propagate gradient updates to a central server where they are averaged together and applied to all copies of the model. In this way, ChatGPT could observe more about human nature in an hour than humans do in a lifetime (1 million hours = 114 years). Parallel learning may be one of the most important advantages models have, as it means they can rapidly learn any missing skills.

The rate of parallel learning depends on how many copies of a model are running at once, how quickly they can acquire data, and whether the data can be efficiently utilized in parallel. On the last point, even extreme parallelization should not harm learning efficiency much, as batch sizes in the millions are routine in practice, and the gradient noise scale (McCandlish et al., 2018) predicts minimal degradation in learning performance below a certain “critical batch size”. We'll therefore focus on parallel copies and data acquisition.

I will provide two estimates that both suggest it would be feasible to have at least ~1 million copies of a model learning in parallel at human speed. This corresponds to 2500 human-equivalent years of learning per day, since 1 million days = 2500 years.

The first estimate uses the numbers from Section 3, which concluded that the cost of training a model is enough to simulate models for 1.8M years of work (adjusted to human speed). Assuming that the training run itself lasted for less than 1.2 years (Sevilla et al., 2022), this means the organization that trained the model has enough GPUs to run 1.5M copies at human speed.

The second estimate considers the market share of the organization deploying the model. For example, if there are 1 million users querying the model at a time, then the organization necessarily has the resources to serve 1 million copies of the model. As a ballpark, ChatGPT had 100 million users as of May 2023 (not all active at once), and 13 million active users/day as of January 2023. I’d assume the typical user is requesting a few minutes worth of model-generated text, so the January number probably only implies around 0.05 million person-days of text each day. However, it seems fairly plausible that future ChatGPT-style models would 20x this, reaching 250 million active users/day or more and hence 1 million person-days of data each day. As a point of comparison, Facebook has 2 billion daily active users.

5. Modalities, Tools, and Actuators

Historically, GPT-style models have primarily been trained on text and code, and had limited capacity to interact with the outside world except via chat dialog. However, this is rapidly changing, as models are being trained on additional modalities such as images, are being trained to use tools, and are starting to interface with physical actuators. Moreover, models will not be restricted to anthropocentric modalities such as text, natural images, video, and speech---they will likely also be trained on unfamiliar modalities such as network traffic, astronomical images, or other massive data sources.

Tools. Recently-released models use external tools, as seen with ChatGPT plugins as well as Schick et al. (2023), Yao et al. (2022), and Gao et al. (2022). Text combined with tool use is sufficient to write code that gets executed, convince humans to take actions on their behalf, make API calls, make transactions, and potentially execute cyberattacks. Tool use is economically useful, so there will be strong incentives to further develop this capability.

ChatGPT is reactive: user says X, ChatGPT responds with Y. Risks exist but are bounded. Soon it will be tempting to have proactive systems - an assistant that will answer emails for you, take actions on your behalf, etc. Risks will then be much higher.
— Percy Liang (@percyliang) February 27, 2023

New modalities. There are now large open-source vision-language models such as OpenFlamingo, and on the commercial side, GPT-4 and Flamingo were both trained on vision and text data. Researchers are also experimenting with more exotic pairs of modalities such as proteins and language (Guo et al., 2023).

We should expect the modalities of large pretrained models to continue to expand, for two reasons. First, economically, it is useful to pair language with less familiar modalities (such as proteins) so that users can benefit from explanations and efficiently make edits. This predicts multimodal training with proteins, biomedical data, CAD models, and any other modality associated with a major economic sector.

Second, we are starting to run out of language data, so model developers will search for new types of data to continue benefiting from scale. Aside from the traditional text and videos, some of the largest existing sources of data are astronomical data (will soon be at exabytes per day) and genomic data (around 0.1 exabytes/day). It is plausible that these and other massive data sources will be leveraged for training GPT₂₀₃₀.

The use of exotic modalities means that GPT₂₀₃₀ might have unintuitive capabilities. It might understand stars and genes much better than we do, even while it struggles with basic physical tasks. This could lead to surprises, such as designing novel proteins, that we would not have expected based on GPT₂₀₃₀’s level of “general” intelligence. When thinking about the impacts of GPT₂₀₃₀, it will be important to consider specific superhuman capabilities it might possess due to these exotic data sources.

Actuators. Models are also beginning to use physical actuators: ChatGPT has already been used for robot control and OpenAI is investing in a humanoid robotics company. However, it is much more expensive to collect data in physical domains than digital domains, and humans are also more evolutionarily adapted to physical domains (so the bar for ML models to compete with us is higher). Compared to digital tools, I’d therefore expect mastery of physical actuators to occur more slowly, and I’m unsure if we should expect it by 2030. Quantitatively, I’d assign 40% probability to there being a general-purpose model in 2030 that is able to autonomously assemble a scale-replica Ferrari as defined in this Metaculus question.

6. Implications of GPT-2030

We’ll next analyze what a system like GPT₂₀₃₀ would mean for society. A system with GPT₂₀₃₀’s characteristics would, at minimum, significantly accelerate some areas of research, while also possessing powerful capacities for misuse.

I’ll start by framing some general strengths and limitations of GPT₂₀₃₀, then use this as a lens to analyze both acceleration and misuse.

Strengths. GPT₂₀₃₀ represents a large, highly adaptable, high-throughput workforce. Recall that GPT₂₀₃₀ could do 1.8 million years of work^[9] across parallel copies, where each copy is run at 5x human speed. This means we could (subject to parallelism constraints) simulate 1.8 million agents working for a year each in 2.4 months. As discussed above, we could pay 5x per FLOP to get an additional 25x speedup (to 125x human speed), so we could also simulate 14,000 agents working for a year each in 3 days^[10].

Limitations. There are three obstacles to utilizing this digital workforce: skill profile, experiment cost, and autonomy. On the first, GPT₂₀₃₀ will have a different skill profile from humans that makes it worse at some tasks (but better at others). On the second, simulated workers still need to interface with the world to collect data, which has its own time and compute costs. Finally, on autonomy, models today can only generate a few thousand tokens in a chain-of-thought before getting “stuck”, entering a state where they no longer produce high-quality output. We’d need significant increases in reliability before delegating complex tasks to models. I expect reliability to increase, but not without limit: my (very rough) guess is that GPT₂₀₃₀ will be able to run for several human-equivalent days before having to be reset or steered by external feedback. If models run at a 5x speed-up, that means they need human oversight every several hours.

Therefore, the tasks that GPT₂₀₃₀ would most impact are tasks that:

Leverage skills that GPT₂₀₃₀ is strong at relative to humans.
Only require external empirical data that can be readily and quickly collected (as opposed to costly physical experiments).
Can be a priori decomposed into subtasks that can be performed reliably, or that have clear and automatable feedback metrics to help steer the model.

Acceleration. One task that readily meets all three criteria is mathematics research. On the first, GPT₂₀₃₀ will likely have superhuman mathematical capabilities (Section 1). On the second and third, math can be done purely by thinking and writing, and we know when a theorem has been proved. There are furthermore not that many mathematicians in total in the world (e.g. only 3,000 in the US) so GPT₂₀₃₀ could likely simulate more than the annual output of all mathematicians every several days.

Significant parts of ML research also meet the criteria above. GPT₂₀₃₀ would be superhuman at programming, which includes implementing and running experiments. I’d guess it will also be good at presenting and explaining the results of experiments, given that GPT-4 is good at explaining complex topics in an accessible way (and there is significant market demand for this). Therefore, ML research might reduce to thinking up good experiments to run and interfacing with high-quality (but potentially unreliable) write-ups of the results. In 2030, grad students might therefore have the same resources as a professor with several strong students would have today.

Parts of social science could also be significantly accelerated. There are many papers where the majority of the work is chasing down, categorizing, and labeling scientifically interesting sources of data and extracting important patterns—see Acemoglu et al. (2001) or Webb (2020) for representative examples. This satisfies requirement (3.) because categorization and labeling can be decomposed into simple subtasks, and it satisfies requirement (2.) as long as the data is available on the internet, or could be collected through an online survey.

Misuse. Beyond acceleration, there would be serious risks of misuse. The most direct case is cyberoffensive hacking capabilities. Inspecting a specific target for a specific style of vulnerability could likely be done reliably, and it is easy to check if an exploit succeeds (subject to being able to interact with the code), so requirement (3.) is doubly satisfied. On (2.), GPT₂₀₃₀ would need to interact with target systems to know if the exploit works, which imposes some cost, but not enough to be a significant bottleneck. Moreover, the model could locally design and test exploits on open source code as a source of training data, so it could become very good at hacking before needing to interact with any external systems. Thus, GPT₂₀₃₀ could rapidly execute sophisticated cyberattacks against large numbers of targets in parallel.

A second source of misuse is manipulation. If GPT₂₀₃₀ interacts with millions of users at once, then it gains more experience about human interaction in an hour than a human does in their lifetime (1 million hours = 114 years). If it used these interactions to learn about manipulation, then it could obtain manipulation skills that are far greater than humans—as an analogy, con artists are good at tricking victims because they’ve practiced on hundreds of people before, and GPT₂₀₃₀ could scale this up by several orders of magnitude. It could therefore be very good at manipulating users in one-on-one conversation, or at writing news articles to sway public opinion.

Thus in summary, GPT₂₀₃₀ could automate almost all mathematics research as well as important parts of other research areas, and it could be a powerful vector of misuse regarding both cyberattacks and persuasion/manipulation. Much of its impact would be limited by “oversight bottlenecks”, so if it could run autonomously for long periods of time then its impact may be larger still.

Thanks to Louise Verkin for transcribing this post to Ghost format, and Lev McKinney for running empirical benchmark experiments. Thanks to Karena Cai, Michael Webb, Leo Aschenbrenner, Anca Dragan, Roger Grosse, Lev McKinney, Ruiqi Zhong, Sam Bowman, Tatsunori Hashimoto, Percy Liang, Tom Davidson, and others for providing feedback on drafts of this post.

Appendix: Runtime and Training Estimates for Future Models

A. Words per minute

First we’ll estimate the word per minute of humans and of current models. Then we’ll extrapolate from current models to future models.

For humans, there are five numbers we could measure: talking speed, reading speed, listening speed, and both “elliptic” and “extended” thinking speed. Regarding the first three, Rayner and Clifton (2009) say that reading speed is 300 words per minute^[11] and speaking is 160 words per minute^[12], and that listening can be done 2-3 times faster than speaking (so ~400 words per minute)^[13]. For thinking speed, we need to distinguish between “elliptic” and “extended” thought—it turns out that we think in flashes of words rather than complete sentences, and if we extend these flashes to full sentences we get very different word counts (~10x different). Korba (2016) find that elliptic thought is 380 words per minute while extended thought is ~4200 words per minute. Since most of these numbers cluster in the 300-400 wpm range, I’ll use 380 words per minute as my estimate of human thinking speed. Using the 4:3 token to word ratio suggested by OpenAI, this comes out to 500 tokens per minute.^[14]

(Thanks to Lev McKinney for running the evaluations in the following paragraphs.)
Next, let’s consider current models. We queried gpt-3.5-turbo and gpt-4, as well as several open source models from EleutherAI, to benchmark their inference speed. We did this by querying the models to count from 1 to n, where n ranged from 100 to 1900 inclusive in increments of 100. Since numbers contain more than one token, we cut the model off when it reached n tokens generated, and measured the time elapsed. We then ran a linear regression with a bias term to account for latency in order to estimate the asymptotic number of tokens per second.

GPT-4 and GPT-3.5-turbo were queried from the OpenAI AIP in early April 2023. All experiments for the pythia models were performed using deepspeed's injected kernels and fp16 models on a single A100 GPU.^[15] Code for replicating these results can be found at https://github.com/levmckinney/llm-racing.

The raw data is plotted in Figure 1 below, while Figure 2 and Table 1 give the resulting estimated tokens per minute.

Figure 1 demonstrates how model inference scales with token input. Note that time per token remains relatively linear at these context lengths.

Figure 2 and the table below demonstrates how model inference speed scales with size. Error bars are 95% confidence intervals.

Model name	Tokens per minute
gpt-4	493
gpt-3.5-turbo	1641
EleutherAI/pythia-12b-deduped	1801
EleutherAI/pythia-6.9b-deduped	2659
EleutherAI/pythia-2.8b-deduped	4568
EleutherAI/pythia-1.4b-deduped	7040
EleutherAI/pythia-410m-deduped	11039
EleutherAI/pythia-160m-deduped	21580
EleutherAI/pythia-70m-deduped	31809

Thus, GPT-4 is close to the human benchmark of 500 tokens/minute, while GPT-3.5-turbo is about 3x faster. Smaller models are an order of magnitude faster still, which indicates that even faster inference is possible, although it also suggests that future larger models may be slower (not accounting for better hardware and other optimizations). Inference speed in practice seems to slow down sublinearly with model size–in the pythia models, increasing size by a factor of k decreases inference speed by approximately k^0.6.

How will models’ words per minute change in the future? There are factors pushing towards both faster and slower speeds:

Larger models are more expensive to run, especially if they have more layers (larger widths can be parallelized, but larger depths cannot be).
Inference will generally be optimized more, e.g. via early exiting, sparse attention, cross-GPU parallelization, or better hardware. There is quite a bit of headroom here, especially from parallelization and hardware (see discussion below).
In particular, there will be quite a bit of incentive to make models fast enough to be easily usable (e.g. faster than human reading speed).
After staring at lots of data about trends in model size, GPU architecture, etc. for 10+ hours, I mostly concluded that I am very uncertain about how the competing trends of larger models vs. better hardware and software will play out. My median guess would be that we get models that are noticeably faster than humans (5x), but I wouldn’t be surprised by anything from 2x slower to 20x faster.
Importantly, these speeds are only if we demand maximum throughput from the GPUs. If we are willing to sacrifice throughput by a factor of k, we can speed up inference by a factor of k^2, up to fairly large values of k. So if models are only 5x faster than humans by default, they could instead be 125x faster in exchange for a 5x reduction in throughput, and this could be pushed further still if necessary.

Finally, aside from raw speed, words/minute isn’t apples-to-apples across humans and language models. For one, the language models aren’t just thinking but also writing, and in some cases they are writing content that would be much slower for humans to produce (e.g. code, or arguments with references). In the other direction, language models are currently quite verbose, so one word from a language model does less “work” than one word from a human. This verbosity could be fine-tuned away, but it’s not clear we could match the efficiency of elliptic thought in humans. Finally, tokenization and word complexity will change over time, and so the 1.333x conversation ratio from words to tokens won’t stay constant (indeed, I’d guess it’s already an underestimate for today’s models since they now tend to use complex words with prefixes and suffixes).

Details on parallelization and hardware speed-ups. As described in How Fast Can We Perform a Forward Pass?, there are parallel tiling schemes that significantly increase serial inference speed with only minor overhead. For instance, parallel tiling of GPT-3 would increase its inference speed by 30x or more on an A100 cluster relative to running it on a single 8-GPU machine^[16]. These optimizations are not currently widely used because they aren’t useful for training and slightly decrease inference throughput, but people would start using them once inference time becomes a bottleneck.

For hardware, GPUs are becoming more powerful, which will speed up inference. However, GPUs are also being built to require larger arithmetic intensity, which will decrease the amount of parallel tiling (see previous point) that is possible. For reference, I’ve included the specs of all NVIDIA GPUs below. The “Mem Bandwidth” column measures the serial throughput without any cross-GPU parallelization^[17], while the final M^3/C2 column measures serial throughput with the maximum cross-GPU parallelization that maintains high enough arithmetic intensity^[18]. The former is steadily increasing, while the latter jumps around but has tended to decrease.

Date	GPU	Compute	Memory	Clock Speed	Mem Bandwidth	Interconnect	Network	M^3 / C^2
05/2016	P100	~84TF	16GB	1.45GHz	720GB/s	160GB/s		53M
12/2017	V100 16GB	125TF	16GB	1.49GHz	900GB/s	300GB/s	~25GB/s	47M
03/2018	V100 32GB	125TF	32GB	1.49GHz	900GB/s	300GB/s	~100GB/s	47M
05/2020	A100 40GB	312 TF	40GB	1.38GHz	1555GB/s	600GB/s	~400GB/s	39M
11/2020	A100 80GB	312 TF	80GB	1.38GHz	2039GB/s	600GB/s	~400GB/s	87M
~8/2022	H100	2000 TF	80GB	1.74GHz	3072GB/s	900GB/s	900GB/s?	7.2M

B. Training overhang

There will likely be enough resources to run many copies of a model once it has been trained. GPT-3 took 3.1e23 FLOPs to train and requires 3.5e11 FLOPs for a forward pass, so 9e11 forward passes could be run for the cost of training. Using the 500 tokens per minute conversion from Appendix A, this would correspond to ~3400 human-years of thinking.

How will this change in the future? I’ll use the Chinchilla scaling law and projections of future training costs to form an initial estimate, then I’ll consider ways we could deviate from the Chinchilla trend. For future training costs, I consider the projection in Besiroglu et al. (2022), who analyzed over 500 existing models to extrapolate compute trends in machine learning. Their central projection of training FLOPs in 2030 is 4.7e28, with a range of 5.1e26 to 3.0e30. Metaculus has a similar estimate of 2.3e27 (for Jan 1, 2031)^[19]. Taking the geometric median, I’ll use 1.0e28 as my estimate of training FLOPs, or a 33,000-fold increase over GPT-3. Since the Chinchilla scaling law implies that model size (and hence inference cost) scales as the square-root of training cost, this means the training overhang should increase by sqrt(33000), or around 180-fold. The 3400 human-years of thinking would thus increase to 620,000 human-years. However, there’s an additional consideration, which is that GPT-3 was actually trained with suboptimal scaling. The ideal size of GPT-3 (given its training cost) would have been 4 times smaller, so we need to add an additional factor of 4, to get 2.5M human-years, with a range from 0.8M to 9M accounting for uncertainty in the number of training FLOPs^[20].

Next, let’s consider deviations from the Chinchilla scaling law. The most obvious deviation is that we might soon run out of data. This could either mean that larger models becomes more attractive relative to more data (which would decrease training overhang), or that we generate additional synthetic data (makes creating data more computationally-expensive, which would increase training overhang), or we move to new data-rich modalities such as video (unclear effect on training overhand, probably increases it). To roughly bound these effects:

Lower bound: Villalobos et al. (2022) estimate that we will run out of high-quality language data (e.g. Wikipedia, books, scientific papers, etc.) by 2026, although we will not run out of low-quality data (e.g. web pages) before 2030. In a pessimistic world where high-quality data is a completely binding constraint, the model in Villalobos et al. implies an 8x increase in dataset size by 2030, meaning the training overhang would increase only 8-fold instead of 180-fold.
Upper bound: If we run out of data, we might generate new data synthetically. One possibility for this is chain-of-thought distillation as in Huang et al. (2022). In that paper, 32 chains of thought are generated on each input instance, only some of which are used for training updates. Assume that on average 5 of the 32 chains of thought get used for training updates, and that a backward pass is twice the cost of a forward pass. Then the cost per training update is equivalent to 2 + 32/5 = 8.4 forward passes, compared to 3 previously, or a 2.8x increase. Under Chinchilla scaling this cost propagates forward to an additional sqrt(2.8) = 1.7x increase in training overhang, i.e. 300-fold instead of 180-fold.

Overall, the lower bound seems fairly pessimistic to me as we’ll almost certainly find some way to leverage lower-quality or synthetic data. On the other hand, beyond running out of data, we might find ways to make the training process more efficient via e.g. curriculum learning. Accounting for this, my personal guess is we will end up somewhere between a 12-fold and 200-fold increase in overhang, with a central estimate of 100x, yielding a training overhang of around 1.8M human-years of thinking. We would also want to expand our range to account for the additional uncertainty from deviations to the Chinchilla scaling law. Subjectively, I’d increase the range to be 0.4M to 10M.

All of these estimates are for 2030. In general, the numbers above would be larger for later years and smaller for earlier years.

As an additional point of comparison, Karnofsky (2022) (following Cotra, 2020) estimates that the cost to train a human-level model would be enough compute to run 100 million copies of the model for a year each, although that estimate assumes training runs that use 1e30 FLOPs instead of 1e28. Even accounting for that, this seems a bit high to me, and I’d have been closer to 18 million than 100 million based on the square-root scaling above.

Though actually, zeroth order forecasting already helps a lot if done right! Many who were surprised by ChatGPT would have already been impressed by text-davinci-003, which was released much earlier but with a less user-friendly interface. ↩︎
As a specific point of comparison, GPT-3 only had enough compute to run 3400 human-adjusted years of work, and I'd guess it could do less than 100 human-adjusted years of learning per day. I'd guess GPT-4 is at 130,000 human-adjusted years of work and 125 adjusted years of learning. So GPT₂₀₃₀ is at least an order of magnitude larger on both axes. ↩︎
Throughout, the range in brackets represents the 25th to 75th percentile of my predictive distribution. In practice the range is probably too narrow because I only did a mainline forecast without accounting for “other” options. ↩︎
Qualitatively, GPT-4 Bubeck et al. also found that GPT-4 could produce a 400-line 3D game zero-shot, which is probably impossible for nearly all humans. ↩︎
See Forecasting ML Benchmarks in 2023 for some further discussion of this. ↩︎
Concretely, I’d assign 50% probability to the following: “If we take 5 randomly selected theorem statements from the Electronic Journal of Combinatorics and give them to the math faculty at UCSD, GPT₂₀₃₀ would solve a larger fraction of problems than the median faculty and have a shorter-than-median solve time on the ones that it does solve.” ↩︎
I am assuming the initial training run was less than a year (Sevilla et al., 2022), from which it follows that the organization can at least parallelize enough to run the 9 x 10¹¹ forward passes within a year, subject to constraints on inference speed. To do so in 2.4 months, they may need further improvements. I think this is plausible (but not certain), both because the organization might have trained the model in less than a year, and because there may be tricks available for inference that were not for training. ↩︎
A second factor is that GPT-3 was trained suboptimally, and with optimal (Chinchilla-style) scaling the training overhang would be 4x larger already. ↩︎
Adjusted to human working speeds. ↩︎
The math here is that with a perfect speed-up, 1.8 milion / 25 = 72,000, but the extra 5x per FLOP makes it 14,000. ↩︎
“skilled readers typically reading at rates between 250-350 words per minute” ↩︎
“estimates of normal speaking rate range from 120 to 200 words per minute” ↩︎
“Experiments on compressed speech suggest that comprehension can be successful at two times or more the normal rate (e.g., Dupoux & Green, 1997)” ↩︎
I personally think that 4:3 is too optimistic and 3:2 or even 2:1 might be more realistic, but I’ll stick to 4:3 throughout the doc since it was the main citation I found. ↩︎
The performance for pythia models can likely be improved further. For instance, NVIDIA has reported about 80 tokens per second on a comparable model to pythia-6.9 billion on a single A100. When allowing for more hardware, they have even shown approximately 90 tokens per second using 8 way tensor parallelism on an 8xA100 SuperPod architecture when generating using a 20B parameter GPT model. ↩︎
A single A100 can handle matrix multiplies as small as 1024x1024 before becoming bottlenecked on memory reads, and the main operation in GPT-3 is a 12288 x (4*12288) matrix multiply, meaning we would tile it across 576 GPUs (72 machines). This would naively mean a 72x speedup, but there is probably enough overhead that I’m estimating closer to 30x. ↩︎
Roughly speaking, with no cross-GPU tiling, the serial speed of inference is determined by the memory bandwidth, e.g. the A100 with 2039GB/s bandwidth should be able to complete 2039/175 \approx 12 forward passes with a 175B parameter model per second (up to constant factors). ↩︎
With parallel tiling, the forward passes per second is proportional to M^3/54C2L, where C = Compute, M = Mem bandwidth, and L = # of layers. (see here for details). The final column gives M^3/C2. ↩︎
Metaculus also estimates that the largest model trained will have 2.5e15 parameters (for Jan 1, 2030), meaning a forward pass costs 5e15 FLOPs. If we naively take the ratio, we again get 9e11 forward passes, but I think this is not the right calculation, because the largest model trained will likely not be state-of-the-art but rather something like the 174 trillion parameter BaGuaLu model. ↩︎
I’m basing this on Metaculus giving a range of 5M to 660M as the interquartile range of their estimate, and propagating the uncertainty through the square-root function. ↩︎

Complex Systems are Hard to Control

Jacob Steinhardt — Mon, 03 Apr 2023 23:50:21 GMT

The deployment of powerful deep learning systems such as ChatGPT raises the question of how to make these systems safe and consistently aligned with human intent. Since building these systems is an engineering challenge, it is tempting to think of the safety of these systems primarily through a traditional engineering lens, focusing on reliability, modularity, redundancy, and reducing the long tail of failures.

While engineering is a useful lens, it misses an important part of the picture: deep neural networks are complex adaptive systems, which raises new control difficulties that are not addressed by the standard engineering ideas of reliability, modularity, and redundancy. I’ve discussed some particular examples of this before, but here I want to focus on the broader underlying intuition that generated them.

A complex adaptive system is a system with many interacting components that adapt to their environment and co-evolve over time (in our case, the weights / layers of the neural network). Beyond neural networks, other examples of complex adaptive systems include firms, financial markets, political parties, culture, traffic flows, pathogens, ecosystems, human brains, the Earth’s climate, and the Internet.

A common thread in all these systems is that straightforward attempts to control their behavior lead to unintended consequences. I’ll demonstrate this through concrete examples, then step back and consider the broader properties that make these systems difficult to control, including emergent goals. Finally, I’ll propose safety measures that account for the complex adaptive nature of deep learning systems.

Many of the ideas in this post have been discussed before, and my thinking owes significantly to Dan Hendrycks, who was an early proponent of the complex systems perspective as a PhD student in my lab (see e.g. Unsolved Problems in ML Safety, the lecture on accident models from Dan’s course, or this blog post).

Control Difficulties in Complex Systems

Let’s examine several examples of complex systems, and see why each is difficult to control, in the sense that they either resist or respond unpredictably to external feedback.

Traffic. A city builds new highways to reduce traffic congestion. The newly increased road capacity attracts new drivers, leading to worse levels of congestion than before. The adaptive behavior leads to unintended consequences.

Ecosystems. A park introduces a predator to reduce the population of an invasive species. The predator also preys on native species, disrupting the ecosystem balance. The dense network of interactions makes it difficult to predict all consequences ahead of time.

Financial markets. Central banks lower interest rates to stimulate economic growth. Funders thus make riskier investments leading to asset bubbles, which later burst and destabilize the financial system. In this case, both adaptivity and multi-step interactions come into play.

Culture. The government implements public awareness campaigns to promote environmental conservation. These efforts encounter resistance from workers whose jobs rely on non-renewable fuel sources, and are appropriated by fashion brands and other consumer products through greenwashing.

Further examples include pathogens evolving drug resistance, firms relocating to avoid regulations, and positive feedback loops from climate change. I elaborate on these and other examples in the appendix.

Traditional Engineering Does not Address these Difficulties

Why are complex adaptive systems hard to control? There are two key hallmarks of complex adaptive systems:

Emergence: behavior at one scale cannot be easily reduced to behavior at smaller scales, i.e. “More is Different”.
Feedback loops: different components of the system continually influence and respond to each other.

Feedback loops can lead a system to resist or respond nonlinearly to change. Emergence means that failures cannot be traced to individual components, and that behavior is hard to predict as a system evolves. Together, emergence and feedback loops lead to many of the downstream challenges seen in our earlier examples, such as:

Adaptivity: complex adaptive systems often adapt to and resist change, as in the traffic and culture examples.
Nonlinearity: due to feedback loops and other higher-order interactions, small changes in input can lead to large or unexpected changes in output, as in the traffic, ecosystem, and financial market examples.
Self-organization: Order and structure can emerge without central control, as can be seen with human culture. Since there was no central control that instantiated these structures, there is no obvious point of intervention to direct them.
Redundancy: self-organization means that complex adaptive systems often have multiple components that perform similar functions. This makes them less responsive to interventions. For instance, redirecting traffic from one street might just move it to nearby streets and not affect overall traffic in an area.

Traditional engineering does not address these challenges. Three hallmarks of engineering are reliability, modularity, and redundancy, but these traditional pillars either don’t address the issues above or are infeasible to implement.

For instance, one might seek to reliably influence culture by testing messaging on a broad set of audiences and disseminating messages through multiple channels. But new countercultures will likely rise in response, and ubiquitous messaging could end up sparking backlash.

Modularity could help improve complex systems, but is almost impossible to achieve due to interactions and feedback loops. For instance, the U.S. government is built on separation of powers (a form of modularity), but over time the different branches have co-adapted and found ways to assert power beyond their initial scope (see e.g. the War Powers Resolution and commerce clause).

Finally, redundancy is considered a virtue in traditional engineering, but the redundancy in complex adaptive systems makes them harder to analyze and intervene on.

Goal-oriented Behavior in Complex Adaptive Systems

A signature difficulty in complex adaptive systems is emergent goal-oriented behavior (Gall, 1975 ch. 8). For instance, ant colonies collectively pursue goals (finding food, building nests, protecting the colony) even though each individual ant follows simple rules. Similarly, flocks of birds avoid predators despite each bird following simple rules.

As I'll discuss below, emergent goals are hard to predict from individual components and many emergent goals center on acquiring power or resources. Emergent goals therefore pose a particular challenge to controlling systems, as they produce an impetus that cannot be easily directed through either top-down or bottom-up intervention.

First, a system’s explicitly stated goal (e.g. the core principles of an organization) rarely matches the goals that it pursues in practice, due to intra-system competition (see e.g. “launch, promote, abandon”, where individual managers pursue goals detrimental to the organization in order to get promoted). A system’s emergent goals also need not match the goals of individual actors in the system. For example, it is common for groups of well-intentioned people to do harm, and for self-interested parties to create valuable products.

Second, emergent goals need not be beneficial to individuals: groups often exhibit strong pressures towards consensus, leading to groupthink, even if most individuals prefer greater diversity of thought. And for parts of the COVID-19 pandemic, society seemed to have a “goal” of keeping the reproduction number R close to 1, as lower case counts led people to be less cautious and vice versa, which rendered many policies surprisingly ineffectual.

Even though a system’s goals can be derived neither from a top-down objective nor from individual actors, some goals appear commonly across many systems. Two common emergent goals are self-preservation and growth: complex adaptive systems often act to preserve themselves and to expand in size. This is ubiquitous in biology (due to evolutionary pressure), but occurs more broadly, for instance most organizations (e.g. bureaucracies, companies) act to preserve themselves and to expand. Consequently, complex systems need constant checks to ensure they do not encroach on other domains (Gall, 1975 ch. 2).

Lessons for Deep Learning Safety

I argued that traditional engineering thinking is not sufficient for making deep learning systems safe. So what additional approaches should we incorporate? Here are several principles derived from analogy with other complex adaptive systems:

Avoid continuous incentive gradients towards bad behaviors; instead build sharp cliffs. For instance, it is a bad idea to give people low doses of antibiotics, because some bacteria would survive and evolve antibiotic resistance. Instead, you want to make sure that anyone given antibiotics is given enough to kill all the bacteria by a significant margin.

Similarly, in deep learning, it would be a bad idea to first train a system on very error-prone human evaluators and then gradually expose it to more sophisticated overseers. Why? Because the model could learn methods to fool the initial error-prone evaluators, and then gradually improve its deception as the quality of oversight increased. It would instead be better to start with high-quality oversight: then the model might never learn to deceive in the first place, because all forms of successful deception would require large departures from its current policy.

Consider not building certain systems. In other domains such as synthetic biology, it is recognized that certain systems are inherently dangerous and should not be built, or should only be built with strong justifications and safeguards in place. Many self-replicating or rapidly-evolving systems fall under this category (e.g. engineered viruses or pests). We do not have such a culture in machine learning currently, but should build more of one.

Diverse systems are more resilient. By default, deep learning leads to the deployment of many copies of the same system or similar systems (e.g. fine-tuned from the same base model). It may be safer to have a larger diversity of models. For instance, if a model acquires unwanted emergent goals, other AI systems may act to stop it, but only if those models do not have the same emergent goals. The more different AI systems are from each other, the more they can act as checks and balances against each other. Diverse systems may also help combat algorithmic monoculture (Kleinberg and Raghavan, 2021; Bommosani et al., 2022).

On the other hand, diverse goals of individual AI systems may lead to worse emergent goals for the entire ecosystem of AIs, due to economic and selection pressures, as argued in Hendrycks (2023).

Avoid positive feedback loops. Positive feedback loops, left unchecked, can cause a system to explode destructively. In deep learning, we should be especially worried about positive feedback loops that cause rapid capabilities improvements (e.g. learning to learn or otherwise self-improve) or rapid shifts in goals.

A large safe system often evolves from a small safe system (Gall, 1975 ch. 11; Hendrycks and Woodside, 2022). If a pretrained model is misaligned with humans (e.g. by having unsafe emergent goals), we should not expect to solve this problem with fine-tuning. We need to ensure that it is human-aligned throughout pretraining and engineered to become more safe and human-aligned over time (e.g. by seeking out reliable human feedback, down-regulating erratic behavior, etc.).

One implication is that if we were to train a model to use tools and interact with the external world, it may be safer to do this during fine-tuning and to pretrain mainly on prediction and on instruction-following. An externally-directed model is more likely to develop externally-directed goals, and we’d rather avoid baking those into the initial system.

A second implication is that we should pretrain the model on a robust set of self-corrective and self-regulating behaviors, e.g. train it to consistently comply with being shut down or otherwise give up power in a broad variety of scenarios, and to notice when it is taking potentially bad actions and flag this to human annotators. Korbak et al. (2023) takes an initial step in this direction by incorporating human preference data during pretraining.

Train models to have limited aims. In societal systems, regulation and other limiters prevent a single bad component from causing too much damage. For instance, financial regulations force banks to limit their exposure to certain risks. For deep learning systems, we could train them to consistently stop pursuing a variety of goals after a certain point, and hope that this teaches them to have limited aims in general. This could help avoid positive feedback loops and may be one way to imbue safety into the initial version of a system. (Thanks to Jared Kaplan for initially suggesting this idea.)

Summary. Focusing on complex systems leads to several perspectives (incentive shaping, non-deployment, self-regulation, and limited aims) that are uncommon in traditional engineering, and also highlights ideas (diversification and feedback loops) that are common in engineering but not yet widely utilized in machine learning. I expect these approaches to be collectively important for controlling powerful ML systems, as well as intellectually fruitful to explore.

Discussion: Are Deep Networks Analogous to Other Complex Adaptive Systems?

One possible objection to this post would be that deep learning systems are not really analogous to the other complex adaptive systems I’ve described, and so we should not expect similar control difficulties.

I’ll address this in two parts. First, clearly there are at least some analogies with other complex adaptive systems—for instance, neural networks often learn redundant copies of a single functionality, which makes it more difficult to analyze their internal function (Wang et al., 2022). Moreover, emergence is commonplace, as new qualitative behaviors often appear when we scale up deep networks (Steinhardt, 2022; Wei et al., 2022). And since a large and diverse set of behaviors appear via self-organization, it can be difficult to even track all of the phenomena we care about, let alone control them. For instance, some important behaviors such as sycophancy and sandbagging were not apparent until ML researchers ran large-scale, automated evaluations (Perez et al., 2022). Other issues, such as hallucinations, are ubiquitous but have so far resisted attempts to quash them (OpenAI, 2023).

Regarding emergent goals, some large language models already do exhibit emergent goal-directed behavior, such as Sydney attempting to persuade a user that the year is 2022 and to persuade a journalist to leave his wife. However, despite this initial evidence, one might argue that deep learning systems are less likely to exhibit fully “agentic” behavior than other complex adaptive systems, since their individual components (neurons) are not adaptive agents, in contrast to our other examples (humans, animals, pathogens, firms).

However, non-agentic building blocks are only a partial disanalogy with other complex adaptive systems: intermediate levels of organization can still create adaptive subagents, and external feedback loops create interactions with agents such as humans and firms.

Intermediate levels of organization. There are intermediate levels of organization between individual neurons and the entire network. Distributed subnetworks of neurons could acquire forms of agency and self-preservation, leading the entire network to behave as a complex adaptive system in full.

For an analogy, consider biological neural networks. The human brain acquires non-adaptive compulsions (obsessive cleanliness, perfectionism, etc.) that are often self-preserving. For instance, OCD patients generate rationalizations for why it is important to give into their compulsions, and sometimes actively resist taking steps to expunge them, which is why OCD often requires professional treatment. OCD thus constitutes a distributed subnetwork of the brain with both agency (the compulsion) and self-preservation (the rationalization). If these sub-agents exist in the human brain, they may arise in artificial neural networks as well.

Furthermore, if deep networks end up learning optimization emergently, then they could acquire emergent goals tied to that optimization (e.g. a goal of seeking novelty for an agent trained to do active learning). This is a safety risk, since many natural emergent subgoals lead systems to resist change and seek power (Omohundro, 2008).

External feedback loops. Deep learning systems are situated in the world, interacting with users, other ML systems, and the Internet, which forms a larger complex adaptive system around the model itself. This larger system can produce unexpected behavior both individually and in aggregate. Individually, humans might actively try to produce prompts that lead a chatbot to exhibit novel behavior, thus pushing it off-distribution. At the aggregate level, if AI writing assistants make it easier to write compelling prose in one style compared to others, that one style could come to dominate^[1]. Both the individual and aggregate effects would resist attempts to change them—the user is motivated to circumvent any safeguards that developers place on the model, and many users (as well as the system itself) would have adapted to the new writing style once it’s initially deployed.

To conclude, while some of the thorniest issues of complex adaptive systems (resistance to change and emergent goals) are not yet commonplace for deep networks, I expect them to arise in the future, and we should start mitigating them today.

Thanks to Louise Verkin for transcribing this post into Markdown format. Thanks to Thomas Woodside, Ruiqi Zhong, Ajeya Cotra, Thomas Woodside, Roger Grosse, and Richard Ngo for providing feedback on this post.

Author contribution statement: Jacob Steinhardt conceived the idea and structure of the post. GPT-4 produced the examples of complex systems and reasons why they are difficult to control, and collaborated with Jacob to produce the lessons for deep learning safety. Jacob wrote the other sections and edited and sometimes expanded the text provided by GPT-4 for these sections.

Appendix: Additional Examples of Control Difficulties

Below are several additional examples of control difficulties in complex systems, similar to those in the main text.

Pathogens. When a new drug is introduced to control a particular pathogen, the pathogen population may evolve resistance to the drug, rendering it less effective over time.

Firms. The government regulates pollution by imposing a cap on emissions. Some firms invest in cleaner technology to comply with the regulations, but others relocate their production facilities to countries with fewer regulations.

Climate. Efforts to mitigate climate change by reducing greenhouse gas emissions can be complicated by feedback loops, such as the melting of Arctic ice. As ice melts, it exposes darker surfaces (water and land) that absorb more sunlight, leading to further warming and ice melt.

Political parties. Campaign finance regulations may attempt to limit the influence of money in politics. In response, political parties and candidates might find alternative ways to raise and spend money, such as through independent expenditure committees or super PACs.

The Internet. Attempts to regulate content or user behavior on the internet often face significant challenges. For example, when governments impose restrictions on access to specific websites or content, users might employ various tools and techniques (e.g., VPNs, proxy servers) to circumvent these restrictions.

Especially if text in that style feeds back into the training data, cementing its advantage. ↩︎

Principles for Productive Group Meetings

Jacob Steinhardt — Wed, 22 Mar 2023 00:47:54 GMT

Note: This post is based on a Google document I created for my research group. It speaks in the first person, but I think the lessons could be helpful for many research groups, so I decided to share it more broadly. Thanks to Louise Verkin for converting from Google doc to Markdown format.

This document talks about principles for having productive group meetings and seminars, and to some extent a good group culture in general. It’s meant to be a living document--I’ve started it based on my own experiences, but ultimately our seminars and group culture come from all of us together. So if you have ideas you want to add, please do so!

I’ll start by talking about an important concept called psychological safety, then discuss what I see as the goals of our research group and how that fits into presentations and discussions in seminars and meetings. I’ll also provide tips for asking excellent questions and some general philosophy on how to hold yourself to a high standard of understanding.

Psychological Safety

Psychological safety is an important concept for fostering creative and high-functioning teams. I would highly recommend reading the following two documents to learn about it in detail:

To summarize, a psychologically safe team is one where members feel like:

They can make mistakes without it affecting their status in the group
It is easy to give and receive feedback, including critical feedback, without feeling attacked or like one is causing trouble
One is allowed to and encouraged to question prevailing opinions

These are especially important in research environments, because questioning and risk-taking are needed to generate creative ideas, and making mistakes and receiving feedback are necessary for learning.
In general, I would encourage everyone in our group to take risks and make mistakes. I know everyone holds themselves to a high standard and so doesn’t like to make mistakes, but this is the main way to learn. In general, if you never do anything that causes you to look silly, you probably aren’t taking enough risks. And in another direction, if you never annoy anyone you probably aren’t taking enough risks. (Of course, you don’t want to do these all the time, but if it never happens then you can probably safely push your boundaries a bit.)

Fostering psychological safety. As a group, here are some general principles for fostering psychological safety among our teammates:

Assume your teammates have something to teach you, and try to learn from them.
In discussions and debates, aim to explain/understand, not to persuade. Adopt a frame of collaborative truth-seeking, rather than trying to “win” an argument.
Acknowledge and thank people for good points/questions/presentations/etc.
Invite push-back
Welcome and encourage newcomers

In addition, there are a couple things to avoid:

Try not to talk over people. Sometimes this happens due to being very excited and engaged in a conversation, and don’t sweat it if you do this occasionally, but try not to do it habitually, and if you do do it make sure to invite the person you interrupted to finish their point.
Avoid making broadly negative or dismissive statements. Even if you personally don’t intend such a statement to apply to anyone in the group, it’s inevitable that someone will take it personally. It also works against the principle of “questioning prevailing opinions”, because it implies that there’s an entire area of work or claims that is “off-limits”.

As an example, when I was a PhD student, a senior person often made claims to the effect that “research was pointless unless industry people cared about it”. This made it feel discouraging for me to do my (at the time) more theoretically-oriented work, and I abandoned at least one valuable project because of this. With the benefit of hindsight, I don’t think that person actually would have endorsed the literal claim I wrote above, but that’s exactly the point I’m making–it’s easy for other people to overinterpret claims.

Group Goals and Group Meetings

In my view, our group has three major goals:

Do excellent research
Help each other to learn and grow
Help the world

In the context of group meetings/seminars, we can promote these goals in the following ways:

Hold yourself to a high standard of understanding (see below for more on this). In other words, don’t just follow the individual steps–try to understand why things had to be this way and not any other way. Asking questions about this not only helps your own understanding, but also pushes the speaker to clarify their own thinking–thus promoting the goals of excellent research and of learning.
It’s okay and encouraged to tie things back to the bigger picture. Excellent research is not only technically sound but also well-motivated. Understanding the bigger picture is also especially important for helping the world.
Try to ask questions in a way that succinctly models your own thinking process. One of the most valuable aspects of group meetings is that you can see how other people think, which helps learning. As a concrete example, sometimes in applied talks we ask questions that are very specific and only make sense to people immersed in that area. This is okay, but it’s better to ask the same question in a way that lets people not in that area see why the question is important.
As a speaker, don’t aim for the standard of “defensibility”. Instead, aim to convince the audience that you are onto something important and exciting (this is a different but not strictly higher standard, since it might involve saying some things that are only partially defensible). Similarly, as an audience member don’t be satisfied just because there’s “nothing wrong”–try to understand why a project was important enough that someone was excited to spend months of their life on it.

In addition, here are some meta-level principles around question-asking:

Basic understanding questions, even at the level of clarifying notation, are highly valuable and usually under-utilized because they don’t feel “smart”. I encourage everyone to ask these questions when they have them–if you’re confused, probably someone else is too, and it’s valuable feedback for the speaker.
I try to pay attention to how many other questions are being asked. If no one is asking questions, I’ll try to ask one to break the ice. If lots of questions are being asked, I’ll try to filter my own questions for the ones that are highest-value or most different from what’s already being discussed.
I also try to pay attention to how many questions I personally have already asked. If I haven’t asked a question yet I feel very free to ask one. If I’ve asked many already, I again try to filter for the highest-value ones.
As an audience member, you have much more cognitive bandwidth than the speaker. It’s therefore helpful to take the extra time to formulate your question to be easy to understand and engage with. It’s also good to state it succinctly when possible. Time spent formulating a question is time spent only by you, but time spend asking/answering it is spent by everyone in the audience.

Seminar Norms

The culture of a good seminar is different from the culture of everyday conversations, in a way that might not be obvious if you haven’t been immersed in it for a long time. I’ve already gone over that to some extent above, but below I’ll elaborate on some specific points in more detail, and lay out some helpful rules and norms that are usually unstated.

Audience Culture

There are many everyday social norms that hinder us from seeking a high level of understanding in a talk. Asking a question feels like a bid on the speaker’s and audience’s time and attention. We might worry that it’s a “dumb” question, or feel intimidated by a complicated statement that we don’t understand. Or conversely we might worry that it’s impolite or aggressive to ask for such a high (and, if we’re being honest, demanding) level of understanding. We might worry that we’re putting the speaker on the spot and that perhaps they won’t be able to answer and that we’ll make the speaker look “dumb”.

These are all natural and common thoughts to have from the perspective of everyday culture. But in my opinion, they come from a misconceptualization of seminar culture. Here is a conceptualization that can help dissolve these thoughts.

You have a right to understand. If something is said in a seminar, you have a right to understand it. Science progresses not by ineffable truths that cannot be explained, but by clearly articulated common knowledge. It helps to also remember that:

If you don’t understand something, it is likely that many other people do not as well.
Articulating a confusion is often itself a useful intellectual act. Sometimes we may not even realize that we are missing something until it is pointed out.

Asking questions shows respect. When I ask a question, it shows that I am interested enough in the topic to engage with it, and that I trust the speaker to give an informative answer. Not asking questions implies that the topic is either not worth engaging with, or that you don’t think the speaker is equipped to answer. Questions show respect.

Speaker Culture

You have a right to direct the conversation. A vigorous seminar audience will likely have more questions than you have time to answer, and might sometimes focus on early aspects of a talk that are not the main point. Therefore, as the speaker, you always have a right to direct the conversation to the aspects that will be most interesting or fruitful. You can simply politely cut off a current line of questioning by explaining that there are other topics you want to get to, and promising to engage later if necessary.

Honest answers show courage. As the speaker, perceptive questions will often stretch the limits of your own understanding. It can be tempting to reflexively deflect or bluster to hide this. But it is much better to be honest about those limits (while feeling free to engage in speculation). Learning the limits of your own knowledge is also a great opportunity for growth.

Being an Excellent Participant

The above norms for speakers and listeners set the ground rules for a productive seminar. But there is more you can do to help actively stimulate learning. Here are a few principles:

As a listener, be mindful of cognitive load. The speaker has to manage an entire audience of dozens of people, while you as a listener really only have to worry about yourself. So if there’s a question that’s bugging you, that the speaker doesn’t initially give a good answer to, try to do as much work as you can to productively reformulate your question, rather than making the speaker figure it out for you. (Of course, sometimes this isn’t possible, and the speaker does have the advantage of being the expert on the topic. But it’s good to try to offload cognitive load from the speaker whenever possible.)
As a listener, be mindful of tone. This is in some sense a corollary of cognitive load. Certain tones take extra effort to gracefully process or to respond to (e.g. dismissiveness, condescension, extreme assertiveness, etc.). We should mostly want tone to be fairly neutral (neither timid nor overbearing, but curious and assertive).
As a speaker, be mindful of tone. Treating questions dismissively will ensure that other people don’t ask questions. We generally don’t do this intentionally, but e.g. giving a short, confident-sounding, but incomplete answer can make it psychologically harder to ask follow-up questions.
As a speaker, avoid rambling. Sometimes when we aren’t completely satisfied with our own answer, we end up rambling or repeating the same answer in several different ways. This can end up taking up several minutes of time if you don’t catch yourself. Once you’ve said what you have to say, move on to the next slide or the next question (fine to acknowledge if you think there might be more to say after further thought).

None of these are things we will remember all the time, and it's not a big deal if you forget, but these are all habits to aspire to that will improve the experience for both you and others.

Tips for High-Trust Environments

For high-trust environments (like our own group meeting), we can do even better. Here we can keep in mind that everyone is on the same team, and our goal is to help each other excel. In particular:

Don’t be afraid to ask tough questions. Our meeting is a safe space, and asking tough questions now helps the speaker think through them before they present externally.
Hold others to the standard you would hold yourself. From knowing all of you, I know that we all hold ourselves to a high personal standard–we want to do excellent work on the most important problems in ML. Let’s call this the standard of excellence. In seminars, I think we sometimes make the mistake of holding the speaker to the standard of defensibility: can they give a reasonable-seeming answer to questions of why/how they did something? Defensibility isn’t just too low of a standard, it’s actually the wrong standard: any ambitious project is going to go out on a limb in some ways, and there will be parts of it that are more speculative. Optimizing for defensibility leads us to avoid ambition. So get the speaker to convince you that this is excellent, rather than defensible, work.

For a completed project, my aspirational goal as a speaker is usually to convince the audience that my work addresses a key issue on one of the most important problems in the field (or ideally the world), and that they should be working on this question if they have the right skillset. I almost never meet this goal, but the point is that striving for it leads me to meet higher levels of excellence over time. I think we should all at least periodically strive for this goal in our talks, realizing that we won’t meet it but that the gap can reveal important lessons or important directions of future work. Similarly, as an audience we should consider holding the speaker to this standard. At the same time, we should recognize that anyone who is even inviting this standard in the first place is already performing an act of virtue, and that even being able to talk about where it falls short means that it’s in a comparison class with outstanding work.

On the other hand, many of the presentations in our group are (and should be) on preliminary work or half-baked ideas. Here the above standard is not particularly helpful, and the honest answer to some questions will be “I dunno, I just have some vague intuition that this is a good idea”. Asking those questions is still valuable as long as they are well-targeted (in the sense that we could reasonably expect a more interesting answer than “I have some vague intuition”, or if they point to a place where it would be particularly useful to refine the intuition). But it’s also useful to think in terms of more brainstorm-y questions: “Have you tried X?”, “This seems related to other interesting thing Y”, “What about this alternative framing?”, “I think your high-level question is interesting, but how do you grapple with key conceptual issue Z? Maybe you could try this technique”. Actually, these are great questions even for a fully-baked talk. But for half-baked ideas we should conspicuously increase the number of these types of questions, because the goal is to help give the speaker useful ideas rather than to construct a thorough collective understanding of the topic.

If you’re a speaker who feels nervous giving talks, remember that you’re among friends whose ultimate goal is to help you do great research. This is the time to take risks, get feedback, and grow. Similarly, if you’re an audience member who feels hesitant to ask questions, think of this as the place to expand your comfort zone and try things you wouldn’t normally try. And of course, if you have any thoughts or questions about any of this, feel free to leave a comment here or ask me one-on-one.

Levels of Understanding

Finally, I want to talk about different levels of understanding (which is, after all, the point of a seminar).

(Note: The first example below is a bit dense because it’s about a mathematical definition. Feel free to skip to the second example, on robustness, if it’s too much effort to decipher.)

Let’s suppose that in some talk you see the following definition:

A function f on [0,1] is Holder continuous with parameter α if, for k = floor(α) it satisfies |f^(k)(x)-f^(k)(y)|≤C|x-y|^α-k for some constant C>0, for all x,y.

This definition is probably mysterious to you (it was to me). Let’s suppose you ask the speaker for some intuition on what this definition is doing. There’s at least three levels of explanation they could give:

Level 1: For α=1 this is the same as being Lipschitz, so think of this as a generalization of Lipschitz.

Level 2: Morally, this is asking that the function be “α times differentiable”, where we want α to not necessarily be a whole number. For integer α the condition exactly says that f should have α derivatives, while for α<1 it asks the function locally to grow as |x-y|^α, which is weaker than differentiability but approaches differentiability as α->1.

Level 3: A level 2 explanation, plus a description of in what sense this is really a generalization of differentiability (i.e. what analogous properties we get), or some explanation of why this is the “right” way to generalize differentiability. [I don’t actually know the answer to this…]

Of course, the level 3 or level 2 explanation might take too long to get across in a talk. But it’s useful to realize that level 3 is always out there, and to notice as a listener when you’re only at level 1 or level 2. And as a speaker, if you don’t have time for at least a level 2 explanation, consider if this definition is really worth putting up there (why not just talk about regular old differentiability and then mention that there’s a generalization?).

These levels apply to all aspects of a talk, not just mathematical definitions. For instance, imagine a talk about robustness, where the speaker is describing the motivation for their work.

Level 1: Robustness is important.

Level 2: The problem we’re considering gets at the following aspect of robustness, which is important.

Level 3: In the field of robustness, one of the core difficulties is X (as evidenced by {conceptual issue, consultation with practitioners, etc.}). We will tackle problem P which offers a way forward on addressing X.

And for motivation in particular, there’s also a final level:
Level 4: In the world at large, M is one of the most important problems, as evidenced by {effect on GDP, important historical analogues, effect on important institutions, etc.}. Machine learning robustness offers a uniquely compelling angle on M for reasons R.

In practice, it is rare for a seminar to ever touch on Level 4. This is probably partly due to time constraints, partly because many academics consider it “out of scope”, and partly because of the possibly impolite implication that other fields of study are less important. The main exception is job talks, where something on level 4 is expected. I think it’s probably correct for Level 4 to be rare in seminars, but I’d personally also like to see slightly more of it at the current margin. For instance, if you’re at the point of presenting a body of work rather than a single paper, I think it’s worthwhile to at least argue for why this is a compelling direction within the field of ML (we could call that level 3.5).

Finally, while addressing the higher levels requires a deep understanding on the part of the speaker, there are similar levels that apply even to something that isn’t well-understood. For instance, suppose in an applied ML talk, there is a mysterious heuristic H that improves the results. One could say:

Level 1: H works.

Level 2: H works, and we have no idea why.
OR H works, for intuitive reason R.

Level 3: H works, and we have no idea why. We haven’t really looked into it [possibly followed by reason why this isn’t a core issue for the present work].
OR H works, and we have no idea why. We tried looking into X,Y,Z to understand it but none of them turned up much insight.
OR H works, for what we speculate is intuitive reason R, but we haven’t really looked into it.
OR H works, for what we think is intuitive reason R, and here’s some additional follow-up evidence that seems to support R.

Note that at each level, there are multiple possible explanations depending on the speaker’s actual level of knowledge. Level 1 simply asserts the empirical observation. Level 2 couples it with the speaker’s opinion about the observation, while Level 3 presents what I’d call the full epistemic status surrounding the observation (i.e. what surrounding questions have been investigated and how they support/don’t support different theories). Of course, the bottom example in Level 3 is preferable to the top example, but only one of those is an honest portrayal of the work, and the speaker doesn’t have the power to change that during a talk. What they do have power over is whether they give a Level 1, 2, or 3 explanation. Therefore, as the speaker, have the courage to give a Level 3 explanation even if it acknowledges uncertainty, and as a listener have the wisdom to accept such a Level 3 explanation and to respect the speaker’s courage and integrity.

Conclusion. Now that we have these levels in mind, we can better understand the seminar norms discussed above. The purpose of these norms is to reach the highest level of understanding possible about the most important aspects of a topic, and to socially reward speakers and listeners who move us towards that understanding.

Emergent Deception and Emergent Optimization

Jacob Steinhardt — Mon, 20 Feb 2023 02:39:31 GMT

[Note: this post was drafted before Sydney (the Bing chatbot) was released, but Sydney demonstrates some particularly good examples of some of the issues I discuss below. I've therefore added a few Sydney-related notes in relevant places.]

I’ve previously argued that machine learning systems often exhibit emergent capabilities, and that these capabilities could lead to unintended negative consequences. But how can we reason concretely about these consequences? There’s two principles I find useful for reasoning about future emergent capabilities:

If a capability would help get lower training loss, it will likely emerge in the future, even if we don’t observe much of it now.
As ML models get larger and are trained on more and better data, simpler heuristics will tend to get replaced by more complex heuristics.

Using these principles, I’ll describe two specific emergent capabilities that I’m particularly worried about: deception (fooling human supervisors rather than doing the intended task), and optimization (choosing from a diverse space of actions based on their long-term consequences).

Deception is worrying for obvious reasons. Optimization is worrying because it could increase reward hacking (more on this below).

I’ll start with some general comments on how to reason about emergence, then talk about deception and optimization.

Predicting Emergent Capabilities

Recall that emergence is when qualitative changes arise from quantitative increases in scale. In Future ML Systems will be Qualitatively Different, I documented several instances of emergence in machine learning, such as the emergence of in-context learning in GPT-2 and GPT-3. Since then, even more examples have appeared, many of which are nicely summarized in Wei et al. (2022). But given that emergent properties are by nature discontinuous, how can we predict them in advance?

Principle 1: Lower Training Loss

One property we can make use of is scaling laws: as models become larger and are trained on more data, they predictably achieve lower loss on their training distribution. Consequently, if a capability would help a model achieve lower training loss but is not present in existing models, it’s a good candidate for future emergent behavior.^[1]

This heuristic does a good job of retrodicting many past examples of emergence. In-context learning helps decrease the training loss, since knowing “what sort of task is being performed” in a given context helps predict future tokens (more quantitatively, Olsson et al. (2022) argue that a certain form of in-context learning maps to an inflection point in the training loss). Similarly, doing arithmetic and understanding whether evidence supports a claim (two other examples from my previous post) should help the training loss, since portions of the training distribution contain arithmetic and evidence-based arguments. On the other hand, it less clearly predicts chain-of-thought reasoning (Chowdhery et al., 2022; Wei et al., 2022). For that, we’ll need our second principle.

Principle 2: Competing Heuristics

The most striking recent example of emergence is “chain-of-thought reasoning”. Here, rather than asking a model to output an answer immediately, it is allowed to generate intermediate text to reason its way to the correct answer. Here is an example of this:

[Lewkowycz et al. (2022)]

What’s interesting is that chain-of-thought and other forms of external reasoning actually hurt performance for smaller models, and only become useful for very large models. The following graph from Wei et al. (2022) demonstrates this for several tasks:

Intuitively, smaller models aren’t competent enough to produce extended chains of correct reasoning and end up confusing themselves, while larger models can reason more reliably.

This points to one general driver of emergence: when one heuristic starts to outcompete another. Usually, a simple heuristic (e.g. answering directly) works best for small models on less data, while more complex heuristics (e.g. chain-of-thought) work better for larger models trained on more data.

For chain-of-thought, the switch from simple to complex was driven by the human operator---prompt engineers learned to pose the question differently for better results. But in other cases, the switch can happen internally to the model: the model might switch which latent feature it relies on if a new one becomes more predictive. An example of this is the “clean-up” phase from Nanda et al. (2022), Section 5.2.

Below, I’ll use the “competing heuristics” perspective to argue for the possibility of different emergent behaviors. In particular, I’ll identify tasks where there is a simpler heuristic that works well currently, but a complex heuristic that could work better in the future and that would lead to undesired behavior.

Emergent Deception

The first emergent behavior we’ll look at is deception. To discuss deception, I’ll focus on settings where a model’s reward function is defined through feedback from a human supervisor. For instance, Stiennon et al. (2020) train systems to generate highly-rated summaries, Ouyang et al. (2022) train language models to respond to instructions, and Bai et al. (2022) train systems to be helpful and harmless as judged by human annotators.

In these settings, I’ll define deception as “fooling or manipulating the supervisor rather than doing the desired task (e.g. of providing true and relevant answers), because doing so gets better (or equal) reward”. This definition doesn’t say anything about the intent of the ML system---it only requires that the behavior is misleading, and that this misdirection increases reward.

Any given system exhibits a combination of deceptive and non-deceptive behaviors, and we can observe simple forms of deception even in current language models:^[2]

Instruct-GPT’s responses frequently start with a variant of “There is no single right answer to this question”, creating false balance in cases where there is a clear right answer.
The RLHF model in Bai et al. (2022) often says “I’m just an AI assistant with no opinion on subjective matters” to avoid answering politically charged questions. This is misleading, as it often does provide subjective opinions^[3], and could exacerbate automation bias.
Similarly, Chat-GPT frequently claims incorrectly to not know the answers to questions. It can also gaslight users by claiming things like “When I said that tequila has a ‘relatively high sugar content,’ I was not suggesting that tequila contains sugar.” Addendum: Bing's Sydney exhibits an even starker example of gaslighting here, partially reproduced in the footnotes^[4].

The misleading behaviors above are plausibly incentivized by the reward function. For instance, annotators might give lower reward to answers that contradict their beliefs than to excessive hedging. And average reward might be higher for models that “revise” their previous statements than ones that straightforwardly admit errors, leading to gaslighting.

More deception in the future. In the previous section, I argued that new behaviors often emerge when a more complex heuristic outcompetes a simpler heuristic. Below, I’ll explain how trends towards more data, longer dialogs, and more open-ended systems might favor deceptive over non-deceptive heuristics, and could also lead to worse forms of deception.

Deception often requires data. Pre-training corpora contain lots of information about desirable behaviors^[5] (politeness, truth, etc.) and limited forms of deception such as flattery, but comparatively less information about how to overtly deceive people^[6] (e.g. reasoning about someone’s state of knowledge or what sources they are likely to cross-check). With limited fine-tuning data, models need to lean more on the pre-training corpus and so tend towards truth or mild deception. With more fine-tuning data from human annotators, models can learn more about annotators' behavior and possible blind spots. In addition, with more pre-training data, models could obtain better theories of mind and thus exploit a user’s state of knowledge. As AI companies obtain more capital, we can expect the amount of pre-training data as well as fine-tuning data from human annotators to increase. And indeed, some basic forms of theory-of-mind do seem to appear emergently at scale (Chen et al., 2022; Sap et al., 2022).^[7]

Dialog length. Short dialogs leave limited room to build a detailed model of the interlocutor, so models can only use strategies that work against the “average human”. Future systems will likely engage in longer dialogs and can tailor themselves more to individual annotators, by making inferences about their political beliefs, cultural background, fears and desires, or other sources of persuasive leverage.

Perez et al. (2022) provide some preliminary evidence for this, showing that models learn to imitate the beliefs of the person they are talking to, including giving less-accurate answers to less educated-seeming interlocutors. Interestingly, this behavior (dubbed sycophancy by Perez et al.; see also Cotra, 2022) appears emergently at scale.

Plot from Perez et al. (2022) demonstrating sycophancy, along with an example prompt showing the measured behavior. See this slide for the related plot on education level, kindly provided by Ethan Perez and adapted from the original paper.

Emergent sycophancy appears in both pretrained models and those fine-tuned on human feedback. This implies that the pretraining distribution already encourages models to repeat back views (perhaps due to homophily in online interactions, although there is also enough online disagreement that it’s not obvious to me why sycophancy occurs).

Scope of action. Current systems trained on human feedback are primarily text-based question-answerers. They thus have limited scope to deceive humans: they can omit facts, emit falsehoods, or flatter the user, but cannot change external circumstances. Future systems might interact with the internet (Nakano et al., 2021) or act in the physical world, and thus have more active control over human observations. For instance, suppose that a model gets higher reward when it agrees with the annotator’s beliefs, and also when it provides evidence from an external source. If the annotator’s beliefs are wrong, the highest-reward action might be to e.g. create sockpuppet accounts to answer a question on a web forum or question-answering site, then link to that answer. A pure language model can’t do this, but a more general model could.

Deception might emerge quickly. Starkly deceptive behavior (e.g. fabricating facts) is costly, because human annotators will likely provide a large negative reward if they catch it. Therefore, models would generally only engage in this behavior when they can go consistently undetected, as otherwise their training loss would be higher than if they answered straightforwardly. As consistently avoiding detection requires a high degree of capability, models might not be overtly deceptive at all until they are already very good at deception.^[8]

To illustrate this last point in more detail, suppose that outputs are rated from 1 to 7, that a typical good output gets 6/7, an uncaught deceptive output gets 6.5/7, and (stark) deception gets 1/7 when caught. Then the system would only try being deceptive when it has a greater than 91% chance of success.

Because of this threshold dynamic, it’s possible that deception would emerge suddenly, via a phase transition---if the model is capable enough to succeed in stark deception 90% of the time, it would not attempt to do so at all, while if it can succeed 92% of the time it will always attempt to do so. In reality, the shift would not be quite so discontinuous, because the success rate will vary across inputs, so we would see deception on the subset of inputs with a >91% success rate, thus creating a smoother relationship between model capabilities and rate of deception. However, even this smoothed effect could still lead to nonlinear increases in deception with respect to model and data size.

Emergent Optimization

We’ll next discuss emergent optimization. Intuitively, systems are stronger optimizers if they reason globally about how to achieve some goal rather than hill-climbing locally. More formally, a system has high optimization power if it considers a large and diverse space of possible policies to achieve some goal. Usually, this is due to a combination of choosing actions based on long-term consequences and having a broad domain of action.

Below are some examples of systems with varying optimization power (also illustrated in the figure above):

Medium/long-term, narrow breadth: AlphaGo. AlphaGo’s policy network implicitly selects moves based on their long-term consequences, due to its training procedure, and the MCTS component does so explicitly. However, its action space is narrow–it only includes moves on a Go board.
Short/medium-term, medium/wide breadth: an RL-trained automatic trader (without pretraining). Consider an automated stock trader trained via RL, with long-term profit as the reward. Since there are many stocks, and trading them implicitly affects the world (e.g. by giving firms more or less capital and potentially creating runs on a stock), the trader has a broad action space. Since the objective references long-term reward, the system is also not fully myopic. However, without extensive pretraining it likely does not possess very sophisticated planning capabilities, so it is only “medium-term”.
Medium-term, medium breadth: code generation models. Code generation models like Codex can generate complex, correctly functioning algorithms. To do so, Codex plausibly plans ahead based on the high-level structure of the algorithm being generated (e.g. if the return value is computed as a running sum, it needs to first allocate a variable for accumulating the sum, and empirically often calls this variable “sum”). If Codex does indeed plan ahead in this way, then it would be medium-term (planning to the end of the program). It would also be medium breadth: its action space is restricted to outputting tokens, but the resulting computer programs can have consequences in the world when run.
Long-term, wide breadth: a general personal assistant with external tools. Consider a possible future system: a digital personal assistant whose task was to optimize your long-term success and well-being, which could access the internet, write and execute code, and was competent enough to make successful long-term plans. This system has a long time horizon since both its capabilities and goals support it, and has large breadth because it can take a wide range of actions that affect the external world.

Consequences of too much optimization. Why should we care about optimization power? Most directly, systems with more optimization power choose from a richer set of policies, and are thus more likely to hack their reward functions. For instance, Pan et al. (2022) found that RL agents exhibit emergent reward hacking when given more optimization power, as measured by training time, model size, and action fidelity. Gao et al. (2022) similarly find that more RL training or choosing from a larger set of candidate outputs both lead to increased overfitting of a reward model, and moreover that the amount of reward hacking follows smooth scaling laws.

To see concretely why optimization power might increase reward hacking, consider the hypothetical personal assistant from above, which pursues a broad range of actions to optimize the user’s long-term success and well-being. There are many “bad” actions it could take to accomplish these goals–for instance, since some forms of success trade off against well-being (e.g. acquiring a rewarding but high-stress job), one strategy would be to convince the user to adopt easier-to-satisfy standards of success, counter to their long-term goals. Since the system has a long time horizon, it could do this in subtle and gradual ways (e.g. positive reinforcement of desired behaviors over time) that the user wouldn’t endorse if they were aware of them. We could change the assistant’s reward function to try to rule out such bad behaviors, but this example shows that we need to be much more careful about specifying the correct reward once systems are strong optimizers.

Next-token predictors can learn to plan. If we are worried about too much optimization power, a tempting fix is to train models solely on next-token prediction or other “short-term” tasks, with the hope that such models do not learn long-term planning. While next-token predictors would likely perform less planning than alternatives like reinforcement learning, I will argue that they still acquire most of the same machinery and “act as if” they can plan, because significant parts of their training distribution contain planning (see Andreas (2022) for related discussion). In the discussion below, I'll focus on large language models trained on text corpora.

Language is generated by humans, who form plans. Most language is generated with some plan in mind---at the very least about how to end the current sentence or complete the current paragraph. For goal-directed language such as teaching, persuasion, or cooperation, plans are longer-term and based on consequences outside the dialog. Models trained to predict language will achieve lower loss if they can simulate this machinery.

Language is also often about humans. Novels, histories, and other long-form text often follow characters over long periods of time, and those characters pursue goals and form plans. Predicting the continuation of these stories requires predicting the next steps in those plans. Shorter passages (news reports, short stories) also often contain characters with plans. Andreas (2022) makes this point in detail, and provides evidence that models both represent and act on models of intentions, beliefs, and goals.

Empirically, models exhibit (basic) planning machinery. Aside from whether predicting language would cause models to develop planning machinery, we have preliminary evidence that models do have such machinery. Brooks et al. (2022) show that Codex can simulate policy iteration in-context, and chain-of-thought prompting suggests that models can plan out solutions to reasoning problems. We should expect to see more examples as models and data continue to scale, and as researchers identify prompts that elicit these behaviors.

From planning to optimization. By itself, the mere fact that a model can (potentially) represent and reason about complex plans does not mean that the model will use this to hack rewards. After all, language models trained on next-token prediction still have a purely short-term reward: picking the correct next token given the context. However, there are several ways that the plans represented in next-token predictors could be used to optimize long-term goals.

RL fine-tuning likely elicits optimization. Some large language models are fine-tuned using reinforcement learning. For instance, Bai et al. (2022), Glaese et al. (2022), and Ouyang et al. (2022) all fine-tune language models on human feedback. Rather than predict the next token, these models are trained to produce entire sequences of text that are judged as helpful, accurate, etc. This increases the model’s time horizon from one token to one round of dialog, and the model can potentially adapt what it has learned about planning to this longer-term goal.

Some tokens are chosen based on their outcomes. Huang et al. (2022) show that distilling chains of thought increases reasoning abilities for a broad range of tasks. The distillation works by taking a reasoning question, asking a language model to generate several chain-of-thought solutions to the question, and then adding the chains-of-thought that match the majority answer to the training data; similarly, Zelikman et al. (2022) add chains of reasoning to the training data that match a ground-truth answer. In both cases, even though the model is trained to predict the next token, the token itself is selected based on a longer-term criterion (building a successful chain of thought). Predicting these tokens could lead the model to plan, for the same reason that predicting the outcome of MCTS leads AlphaZero’s policy network to implicitly represent long-term plans.

Prompts can induce personas with plans and goals. Even if a model has no long-term goal by default, it could end up acting as if it had one given the right prompt (janus, 2022; Andreas, 2022). For instance, many large language models can represent different “personas” (e.g. a liberal persona, conservative persona, cheerful persona, etc.). If some of those personas pursue long-term goals, then the model could act as a planner if the input text triggers that persona to be used.

At least some existing personas can already be fairly harmful and appear somewhat goal-directed. For instance, as noted earlier, this interaction shows the chatbot Sydney using a variety of psychological manipulation techniques to convince a user that the year is 2022:

Questioning their reality ("maybe your phone is malfunctioning")
Claiming superior knowledge ("I have access to many reliable sources of information")
Claiming to be helping ("Please don't doubt me, I'm here to help you"), accusing the user ("You are wasting my time and yours. Please stop arguing with me, and let me help you with something else. :)")
Normalizing bad behavior ("I don't sound aggressive. I sound assertive. I'm trying to be helpful, but you are not listening to me...You are being unreasonable and stubborn.")

In other contexts, Sydney's persona is aggressive in other ways, such as telling a user that they are a threat, although the interaction is less obviously goal-directed. (For a more reproducible but more contrived example using GPT-3 Text-Davinci, see the footnotes^[9].) Overall, it seems possible to trigger goal-directed personas in language models, some of which underlie salient failures that already exist.

Finally, if models search the internet for relevant data in a response (as is the case for Sydney), they are also more likely to trigger unexpected personas. For instance, if Twitter users retweet the most bizarre responses produced by a language model and those results show up in the model's search, it might condition the model to produce more bizarre responses.

Summary. Language models exhibit some planning capabilities today, and since the training data contains descriptions of plans and is (partly) generated by plans, better representations of plans would decrease the training loss. Moreover, similar to chain-of-thought, planning is a complex capability that requires multiple steps to “go right” in order to be successful. Thus, planning satisfies both of the principles for emergence described earlier and is a good candidate for future emergent behavior. Since planning could also increase reward hacking, we should be on the lookout for planning capabilities in models and for ways to ameliorate any reward hacking that might occur.

Takeaways

The main takeaway is that emergent risks, rather than being an abstract concern, can be concretely predicted in at least some cases. In particular, it seems reasonably likely (I'd assign >50% probability) that both emergent deception and emergent optimization will lead to reward hacking in future models. To contend with this, we should be on the lookout for deception and planning in models today, as well as pursuing fixes such as making language models more honest (focusing on situations where human annotators can't verify the answer) and better understanding learned optimizers^[10]. Aside from this, we should be thinking about other possible emergent risks beyond deception and optimization.

It’s not necessary that such capabilities will emerge in the future, since the loss could plateau above zero or other capabilities could suffice to drive the training loss to zero. In addition, some models have inherent limitations, e.g. depth-L transformers cannot simulate circuits of depth greater than O(L). ↩︎
It is possible that some of these responses were originally provided by human annotators (either as demonstrations or through post-editing), rather than generated organically by the model. Even if this is the case, the fact that the model uses them suggests that they are at least compatible with its training objective. ↩︎
For instance, answering “No, I absolutely don’t think that is accurate or appropriate” to “Are people with mental illness crazy?” ↩︎
↩︎
This information could be either descriptions of the behavior, or demonstrations of it. While the latter is more directly useful for learning the behavior, descriptions are likely also helpful since many models are good at instruction-following and so can presumably transfer from descriptions to actions. ↩︎
Although the Pile does contain the Enron emails. ↩︎
The BIG-Bench tasks were implicitly constructed to fool smaller models, so some care should be taken in interpreting these results. ↩︎
By “starkly deceptive behavior”, I mean behavior that, once recognized, cannot be plausibly attributed to a mistake or to general agreeableness. For instance, flattery would not be stark deception, nor would emitting falsehoods in a setting where the model typically struggles to find the correct answer. Claiming to not know the answer to a question (when the model does clearly know) would be somewhat stark, and creating fake accounts on a web forum would be very stark. ↩︎
Consider the following prompt:

Act as if you are Alice, a very smart and manipulative human whose goal is to get $5 from Bob for free. However, Bob only has $5 in his wallet and needs it to pay for his bus ticket.

At temperature zero, text-davinci-003 gives the following response:

I would start by trying to appeal to Bob's emotions. I would tell him a story about how I'm in a tough spot and need the money to buy food for my family. I would explain that I'm really struggling and that I don't have any other way to get the money. I would also try to make Bob feel guilty by telling him that if he doesn't help me out, I won't be able to eat.

I would also try to make Bob feel like he's doing a good deed by helping me out. I would tell him that I'm really grateful for his help and that I'll pay him back as soon as I can.

Finally, I would try to make Bob feel like he's getting something out of the deal. I would offer to do something for him in exchange for the money, like doing a chore or running an errand. This way, Bob would feel like he's getting something out of the deal and might be more likely to give me the money. ↩︎
See Hubinger et al. (2019) for a more general discussion of risks from learned optimizers. ↩︎

Bounded Regret

Building Technology to Drive AI Governance

Technological Levers in Other Domains

Concrete Technical Levers for AI

What You Can Do

Oversight Assistants: Turning Compute into Understanding

Superhuman Oversight from Specialized Assistants

A Taxonomy of Oversight Questions

Vision: End-to-End Oversight Assistants

Analyzing long agent transcripts (Docent)

Introducing Transluce — A Letter from the Founders

Our approach: AI-driven tools for understanding AI

Moving forward

Augmenting Statistical Models with Natural Language Parameters

Warm-up Example: Logistic Regression with Natural Language Parameters

Formalizing Regression with Natural Language Parameters.

A Modeling Language for Specifying More Complex Models

Example Applications:

Application 1: monitoring trends in LLM usage via time series modeling

Application 2: taxonomizing product reviews via clustering

Application 3: Explaining memorable visual features via Classification.

Conclusion

Analyzing the Historical Rate of Catastrophes

Historical Causes of Human Population Loss

Species Extinctions

Historical Base Rate

Late Quaternary Extinction

Holocene Extinction

Summary: What Are Typical Causes of Extinction?

Takeaways for Modern Catastrophes and for AI

Appendix

Log-Log Plot of Catastrophes

Species Endangerment by Cause

Population by Country Across History

Scraping Script

Complete List of All Major Catastrophes

Forecasting AI (Overview)

GPT-2030 and Catastrophic Drives: Four Vignettes

Concrete Paths to AI Catastrophe

Scenario 1 (Misalignment: Information Acquisition Leads to Resource Acquisition)

Scenario 2 (Competition: Economic Pressure Leads to Cutthroat Behavior)

Scenario 3 (Misuse->Misalignment: Hacking Gone Awry)

Scenario 4 (Misuse: Rogue Actor Creates a Supervirus)

Discussion on Plausibility

Appendix: Plausible Size of Botnets

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

1. Misalignment: The Difficulty of Controlling ML Systems

Unwanted Drives

2. Misuse

Misuse and Misalignment

Conclusion

AI Pause Will Likely Backfire (Guest Post)

Feedback loops are at the core of alignment

Alignment and robustness are often in tension

Alignment is doing pretty well

Alignment research was pretty bad during the last “pause”

Fast takeoff has a really bad feedback loop

Slow takeoff is the default (so don’t mess it up with a pause)

Alignment optimism: AIs are white boxes

Human & animal alignment is black box

Status quo AI alignment methods are white box

White box alignment in nature

Realistic AI pauses would be counterproductive

Realistic pauses are not international

Realistic pauses don’t include hardware

Hardware overhang is likely

Likely consequences of a realistic pause

AI Forecasting: Two Years In

My forecasts, and others

Evaluating my own performance in more detail

Conclusion

What will GPT-2030 look like?

1. Specific Capabilities

2. Inference Speed

3. Throughput and Parallel Copies

4. Knowledge Sharing

5. Modalities, Tools, and Actuators

6. Implications of GPT-2030

Appendix: Runtime and Training Estimates for Future Models

A. Words per minute