Measuring and Forecasting Risks from AI

As part of my work for Open Philanthropy, I recently wrote a call for grant proposals on measuring and forecasting risks from future AI systems. I personally think this is a promising area that could absorb $10 million or more in high-quality research projects, and I want to use this post to explain why I'm excited about it.

I've written before about why I think measurement is a highly underrated endeavor that we need much more of in the field of machine learning. But beyond this general need, measurement is especially useful for mitigating risks. This is because few people want to build AI that causes harm, but right now it's unclear how large the risks from AI are or what shape they will take. Operationalizing and measuring risks makes it much easier to mobilize resources to reduce them. And to the extent that some actors do want to cause harm (or are indifferent to it), measuring those harms creates political pressure to change course.

Measurement is also particularly helpful for thinking about future risks. Currently many future risks seem fairly speculative to the typical ML researcher---including things like deception, treacherous turns, or extreme forms of reward hacking. While I think people are wrong to confidently write these off, I doubt that more compelling conceptual arguments will make much headway in this debate. It will be far more convincing to systematically generate concrete examples of these behaviors, and show empirically that they grow with model capacity. This will also be the first step towards mitigating them.

In the remainder of this post, I want to clear up two common confusions that I run into when talking to people about measurement.

Measurement Isn't Just Benchmarks

Data sets have recently gotten a lot of attention in machine learning. For instance, Chris Ré and Karan Goel have a nice blog post on data-centric AI, and many conferences are starting to implement a datasets track. A lot of this was initially kicked off by the massive success of the ImageNet benchmark, and by many later dataset contributions that showed that ImageNet wasn't a singular phenomenon and that datasets systematically helped spur important progress.

Given this background, there's two common but understandable misconceptions: that datasets are all about benchmarks, and that measurement is all about datasets.

Datasets are more than benchmarks. Historically, ML has been obsessed with achieving state-of-the-art accuracy on benchmarks, and most datasets in machine learning were primarily used as a benchmark to compare different methods. But datasets can be used to understand machine learning in other ways as well. For instance, Power et al. (2021) use algorithmically generated datasets to discover a "grokking" phenomenon wherein neural networks suddenly learn to generalize correctly after many training steps. Similarly, a recent ICLR submission uses several RL environments to discover a phase transition where agents suddenly learn to overfit or "hack" their reward function. And several out-of-distribution datasets primarily demonstrated how models pick up on spurious cues. In none of these cases was the primary point to evaluate and improve state-of-the-art accuracy.

Measurement is not just datasets. Datasets certainly provide one powerful way to understand models, since in the general case they are simply providing some inputs to a neural network and seeing what happens. However, just as studying the behavior of an organism offers only partial insight, studying the input-output behavior of a neural network has its limits (despite also having clear successes so far).

Beyond input-output behavior, we can also visualize features of neural networks, compute the influence of training data, or check what happens when we randomize or combine different networks. These richer forms of measurement are more nascent but could be very valuable if fully developed, which we can do by better understanding their strengths and limitations. I've particularly appreciated some of the work by Julius Adebayo and his collaborators on measuring the sensitivity of different interpretability methods. Our own group has also started to ground similarity metrics by evaluating their correlation with downstream functionality.

We Can Measure "Far-Off" Risks

A separate objection to measurement is that many of the biggest risks from AI can't be measured until they happen (or are about to happen), and that long-tail risks are long-tailed precisely because they break from empirical trends. Some concrete forms of this worry might be:

Reward hacking will be kept under control as long as AI systems aren't much more capable than humans, but at some point they'll be smart enough to fool humans consistently, and we won't have many "warning shots" before this happens.
A misaligned AI system might strategically follow what humans want (because it knows it will get shut off if it does not), until it is able to seize enough resources to avoid shut-off.
From a broader perspective, there are black swan events all the time, such as the 2001 dot-com burst, the 2008 financial crisis, and the 2016 U.S. election. It doesn't seem like we have a good track record of handling these, so why assume we will in the future?

Regarding the first two examples, I think there are many ways to create and study analogous phenomena today. For instance, we could look at reward hacking for weak human supervisors (such as time-constrained crowd annotators). In addition, two of the papers I cited above specifically show that ML systems can undergo sudden changes, and offer ways to study this. This also shows that measurement is not the same as empirical trend extrapolation---sometimes measurements can even demonstrate that a hoped-for trend is unreliable.

For the final example, I agree that we have a bad track record at long-tail events. But my impression is that approaches that incorporated systematic measurements did better, not worse, than alternatives. For instance, most polls showed that the 2016 U.S. election was close, even though most pundits wrote Trump off. And bad financial instruments underlying the 2008 crisis started to show measurable problems by 2005: "By December 2005, subprime mortgages that had been issued just six months earlier were already showing atypically high delinquency rates" [source].

I don't think measurement is a panacea. But it has a great track record at revealing and mitigating risks, and we should be using it far more than we do.

Measuring and Forecasting Risks from AI

Measurement Isn't Just Benchmarks

We Can Measure "Far-Off" Risks

Jacob Steinhardt

Comments