Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems


There has been much recent discussion about AI risk, meaning specifically the potential pitfalls (both short-term and long-term) that AI with improved capabilities could create for society. Discussants include AI researchers such as Stuart Russell and Eric Horvitz and Tom Dietterich, entrepreneurs such as Elon Musk and Bill Gates, and research institutes such as the Machine Intelligence Research Institute (MIRI) and Future of Humanity Institute (FHI); the director of the latter institute, Nick Bostrom, has even written a bestselling book on this topic. Finally, ten million dollars in funding have been earmarked towards research on ensuring that AI will be safe and beneficial. Given this, I think it would be useful for AI researchers to discuss the nature and extent of risks that might be posed by increasingly capable AI systems, both short-term and long-term. As a PhD student in machine learning and artificial intelligence, this essay will describe my own views on AI risk, in the hopes of encouraging other researchers to detail their thoughts, as well.

For the purposes of this essay, I will define “AI” to be technology that can carry out tasks with limited or no human guidance, “advanced AI” to be technology that performs substantially more complex and domain-general tasks than are possible today, and “highly capable AI” to be technology that can outperform humans in all or almost all domains. As the primary target audience of this essay is other researchers, I have used technical terms (e.g. weakly supervised learning, inverse reinforcement learning) whenever they were useful, though I have also tried to make the essay more generally accessible when possible.


I think it is important to distinguish between two questions. First, does artificial intelligence merit the same degree of engineering safety considerations as other technologies (such as bridges)? Second, does artificial intelligence merit additional precautions, beyond those that would be considered typical? I will argue that the answer is yes to the first, even in the short term, and that current engineering methodologies in the field of machine learning do not provide even a typical level of safety or robustness. Moreover, I will argue that the answer to the second question in the long term is likely also yes --- namely, that there are important ways in which highly capable artificial intelligence could pose risks which are not addressed by typical engineering concerns.

The point of this essay is not to be alarmist; indeed, I think that AI is likely to be net-positive for humanity. Rather, the point of this essay is to encourage a discussion about the potential pitfalls posed by artificial intelligence, since I believe that research done now can mitigate many of these pitfalls. Without such a discussion, we are unlikely to understand which pitfalls are most important or likely, and thus unable to design effective research programs to prevent them.

A common objection to discussing risks posed by AI is that it seems somewhat early on to worry about such risks, and the discussion is likely to be more germane if we wait to have it until after the field of AI has advanced further. I think this objection is quite reasonable in the abstract; however, as I will argue below, I think we do have a reasonable understanding of at least some of the risks that AI might pose, that some of these will be realized even in the medium term, and that there are reasonable programs of research that can address these risks, which in many cases would also have the advantage of improving the usability of existing AI systems.

Ordinary Engineering

There are many issues related to AI safety that are just a matter of good engineering methodology. For instance, we would ideally like systems that are transparent, modular, robust, and work under well-understood assumptions. Unfortunately, machine learning as a field has not developed very good methodologies for obtaining any of these things, and so this is an important issue to remedy. In other words, I think we should put at least as much thought into building an AI as we do into building a bridge.

Just to be very clear, I do not think that machine learning researchers are bad engineers; looking at any of the open source tools such as Torch, Caffe, MLlib, and others make it clear that many machine learning researchers are also good software engineers. Rather, I think that as a field our methodologies are not mature enough to address the specific engineering desiderata of statistical models (in contrast to the algorithms that create them). In particular, the statistical models obtained from machine learning algorithms tend to be:

  1. Opaque: Many machine learning models consist of hundreds of thousands of parameters, making it difficult to understand how predictions are made. Typically, practitioners resort to error analysis examining the covariates that most strongly influence each incorrect prediction. However, this is not a very sustainable long-term solution, as it requires substantial effort even for relatively narrow-domain systems.
  2. Monolithic: In part due to their opacity, models act as a black box, with no modularity or encapsulation of behavior. Though machine learning systems are often split into pipelines of smaller models, the lack of encapsulation can make these pipelines even harder to manage than a single large model; indeed, since machine learning models are by design optimized for a particular input distribution (i.e. whatever distribution they are trained on), we end up in a situation where “Changing Anything Changes Everything” [1].
  3. Fragile: As another consequence of being optimized for a particular training distribution, machine learning models can have arbitrarily poor performance when that distribution shifts. For instance, Daumé and Marcu [2] show that a named entity classifier with 92% accuracy on one dataset drops to 58% accuracy on a superficially similar dataset. Though such issues are partially addressed by work on transfer learning and domain adaptation [3], these areas are not very developed compared to supervised learning.
  4. Poorly understood: Beyond their fragility, understanding when a machine learning model will work is difficult. We know that a model will work if it is tested on the same distribution it is trained on, and have some extensions beyond this case (e.g. based on robust optimization [4]), but we have very little in the way of practically relevant conditions under which a model trained in one situation will work well in another situation. Although they are related, this issue differs from the opacity issue above in that it relates to making predictions about the system's future behavior (in particular, generalization to new situations), versus understanding the internal workings of the current system.

That these issues plague machine learning systems is likely uncontroversial among machine learning researchers. However, in comparison to research focused on extending capabilities, very little is being done to address them. Research in this area therefore seems particularly impactful, especially given the desire to deploy machine learning systems in increasingly complex and safety-critical situations.

Extraordinary Engineering

Does AI merit additional safety precautions, beyond those that are considered standard engineering practice in other fields? Here I am focusing only on the long-term impacts of advanced or highly capable AI systems.

My tentative answer is yes; there seem to be a few different ways in which AI could have bad effects, each of which seems individually unlikely but not implausible. Even if each of the risks identified so far are not likely, (i) the total risk might be large, especially if there are additional unidentified risks, and (ii) the existence of multiple "near-misses" motivates closer investigation, as it may suggest some underlying principle that makes AI risk-laden. In the sequel I will focus on so-called “global catastrophic” risks, meaning risks that could affect a large fraction of the earth’s population in a material way. I have chosen to focus on these risks because I think there is an important difference between an AI system messing up in a way that harms a few people (which would be a legal liability but perhaps should not motivate a major effort in terms of precautions) and an AI system that could cause damage on a global scale. The latter would justify substantial precautions, and I want to make it clear that this is the bar I am setting for myself.

With that in place, below are a few ways in which advanced or highly capable AI could have specific global catastrophic risks.

Cyber-attacks. There are two trends which taken together make the prospect of AI-aided cyber-attacks seem worrisome. The first trend is simply the increasing prevalence of cyber-attacks; even this year we have seen Russia attack Ukraine, North Korea attack Sony, and China attack the U.S. Office of Personnel Management. Secondly, the “Internet of Things” means that an increasing number of physical devices will be connected to the internet. Assuming that software exists to autonomously control them, many internet-enabled devices such as cars could be hacked and then weaponized, leading to a decisive military advantage in a short span of time. Such an attack could be enacted by a small group of humans aided by AI technologies, which would make it hard to detect in advance. Unlike other weaponizable technology such as nuclear fission or synthetic biology, it would be very difficult to control the distribution of AI since it does not rely on any specific raw materials. Finally, note that even a team with relatively small computing resources could potentially “bootstrap” to much more computing power by first creating a botnet with which to do computations; to date, the largest botnet has spanned 30 million computers and several other botnets have exceeded 1 million.

Autonomous weapons. Beyond cyber-attacks, improved autonomous robotics technology combined with ubiquitous access to miniature UAVs (“drones”) could allow both terrorists and governments to wage a particularly pernicious form of remote warfare by creating weapons that are both cheap and hard to detect or defend against (due to their small size and high maneuverability). Beyond direct malicious intent, if autonomous weapons systems or other powerful autonomous systems malfunction then they could cause a large amount of damage.

Mis-optimization. A highly capable AI could acquire a large amount of power but pursue an overly narrow goal, and end up harming humans or human value while optimizing for this goal. This may seem implausible at face value, but as I will argue below, it is easier to improve AI capabilities than to improve AI values, making such a mishap possible in theory.

Unemployment. It is already the case that increased automation is decreasing the number of available jobs, to the extent that some economists and policymakers are discussing what to do if the number of jobs is systematically smaller than the number of people seeking work. If AI systems allow a large number of jobs to be automated over a relatively short time period, then we may not have time to plan or implement policy solutions, and there could then be a large unemployment spike. In addition to the direct effects on the people who are unemployed, such a spike could also have indirect consequences by decreasing social stability on a global scale.

Opaque systems. It is also already the case that increasingly many tasks are being delegated to autonomous systems, from trades in financial markets to aggregation of information feeds. The opacity of these systems has led to issues such as the 2010 Flash Crash and will likely lead to larger issues in the future. In the long term, as AI systems become increasingly complex, humans may lose the ability to meaningfully understand or intervene in such systems, which could lead to a loss of sovereignty if autonomous systems are employed in executive-level functions (e.g. government, economy).

Beyond these specific risks, it seems clear that, eventually, AI will be able to outperform humans in essentially every domain. At that point, it seems doubtful that humanity will continue to have direct causal influence over its future unless specific measures are put in place to ensure this. While I do not think this day will come soon, I think it is worth thinking now about how we might meaningfully control highly capable AI systems, and I also think that many of the risks posed above (as well as others that we haven’t thought of yet) will occur on a somewhat shorter time scale.

Let me end with some specific ways in which control of AI may be particularly difficult compared to other human-engineered systems:

  1. AI may be "agent-like", which means that the space of possible behaviors is much larger; our intuitions about how AI will act in pursuit of a given goal may not account for this and so AI behavior could be hard to predict.
  2. Since an AI would presumably learn from experience, and will likely run at a much faster serial processing speed than humans, its capabilities may change rapidly, ruling out the usual process of trial-and-error.
  3. AI will act in a much more open-ended domain. In contrast, our existing tools for specifying the necessary properties of a system only work well in narrow domains. For instance, for a bridge, safety relates to the ability to successfully accomplish a small number of tasks (e.g. not falling over). For these, it suffices to consider well-characterized engineering properties such as tensile strength. For AI, the number of tasks we would potentially want it to perform is large, and it is unclear how to obtain a small number of well-characterized properties that would ensure safety.
  4. Existing machine learning frameworks make it very easy for AI to acquire knowledge, but hard to acquire values. For instance, while an AI’s model of reality is flexibly learned from data, its goal/utility function is hard-coded in almost all situations; an exception is some work on inverse reinforcement learning [5], but this is still a very nascent framework. Importantly, the asymmetry between knowledge (and hence capabilities) and values is fundamental, rather than simply a statement about existing technologies. This is because knowledge is something that is regularly informed by reality, whereas values are only weakly informed by reality: an AI which learns incorrect facts could notice that it makes wrong predictions, but the world might never “tell” an AI that it learned the “wrong values”. At a technical level, while many tasks in machine learning are fully supervised or at least semi-supervised, value acquisition is a weakly supervised task.

In summary: there are several concrete global catastrophic risks posed by highly capable AI, and there are also several reasons to believe that highly capable AI would be difficult to control. Together, these suggest to me that the control of highly capable AI systems is an important problem posing unique research challenges.

Long-term Goals, Near-term Research

Above I presented an argument for why AI, in the long term, may require substantial precautionary efforts. Beyond this, I also believe that there is important research that can be done right now to reduce long-term AI risks. In this section I will elaborate on some specific research projects, though my list is not meant to be exhaustive.

  1. Value learning: In general, it seems important in the long term (and also in the short term) to design algorithms for learning values / goal systems / utility functions, rather than requiring them to be hand-coded. One framework for this is inverse reinforcement learning [5], though developing additional frameworks would also be useful.
  2. Weakly supervised learning: As argued above, inferring values, in contrast to beliefs, is an at most weakly supervised problem, since humans themselves are often incorrect about what they value and so any attempt to provide fully annotated training data about values would likely contain systematic errors. It may be possible to infer values indirectly through observing human actions; however, since humans often act immorally and human values change over time, current human actions are not consistent with our ideal long-term values, and so learning from actions in a naive way could lead to problems. Therefore, a better fundamental understanding of weakly supervised learning --- particularly regarding guaranteed recovery of indirectly observed parameters under well-understood assumptions --- seems important.
  3. Formal specification / verification: One way to make AI safer would be to formally specify desiderata for its behavior, and then prove that these desiderata are met. A major open challenge is to figure out how to meaningfully specify formal properties for an AI system. For instance, even if a speech transcription system did a near-perfect job of transcribing speech, it is unclear what sort of specification language one might use to state this property formally. Beyond this, though there is much existing work in formal verification, it is still extremely challenging to verify large systems.
  4. Transparency: To the extent that the decision-making process of an AI is transparent, it should be relatively easy to ensure that its impact will be positive. To the extent that the decision-making process is opaque, it should be relatively difficult to do so. Unfortunately, transparency seems difficult to obtain, especially for AIs that reach decisions through complex series of serial computations. Therefore, better techniques for rendering AI reasoning transparent seem important.
  5. Strategic assessment and planning: Better understanding of the likely impacts of AI will allow a better response. To this end, it seems valuable to map out and study specific concrete risks; for instance, better understanding ways in which machine learning could be used in cyber-attacks, or forecasting the likely effects of technology-driven unemployment, and determining useful policies around these effects. It would also be clearly useful to identify additional plausible risks beyond those of which we are currently aware. Finally, thought experiments surrounding different possible behaviors of advanced AI would help inform intuitions and point to specific technical problems. Some of these tasks are most effectively carried out by AI researchers, while others should be done in collaboration with economists, policy experts, security experts, etc.

The above constitute at least five concrete directions of research on which I think important progress can be made today, which would meaningfully improve the safety of advanced AI systems and which in many cases would likely have ancillary benefits in the short term, as well.

Related Work

At a high level, while I have implicitly provided a program of research above, there are other proposed research programs as well. Perhaps the earliest proposed program is from MIRI [6], which has focused on AI alignment problems that arise even in simplified settings (e.g. with unlimited computing power or easy-to-specify goals) in hopes of later generalizing to more complex settings. The Future of Life Institute (FLI) has also published a research priorities document [7, 8] with a broader focus, including non-technical topics such as regulation of autonomous weapons and economic shifts induced by AI-based technologies. I do not necessarily endorse either document, but think that both represent a big step in the right direction. Ideally, MIRI, FLI, and others will all justify why they think their problems are worth working on and we can let the best arguments and counterarguments rise to the top. This is already happening to some extent [9, 10, 11] but I would like to see more of it, especially from academics with expertise in machine learning and AI [12, 13].

In addition, several specific arguments I have advanced are similar to those already advanced by others. The issue of AI-driven unemployment has been studied by Brynjolfsson and McAfee [14], and is also discussed in the FLI research document. The problem of AI pursuing narrow goals has been elaborated through Bostrom’s “paperclipping argument” [15] as well as the orthogonality thesis [16], which states that beliefs and values are independent of each other. While I disagree with the orthogonality thesis in its strongest form, the arguments presented above for the difficulty of value learning can in many cases reach similar conclusions.

Omohundro [17] has argued that advanced agents would pursue certain instrumentally convergent drives under almost any value system, which is one way in which agent-like systems differ from systems without agency. Good [18] was the first to argue that AI capabilities could improve rapidly. Yudkowsky has argued that it would be easy for an AI to acquire power given few initial resources [19], though his example assumes the creation of advanced biotechnology.

Christiano has argued for the value of transparent AI systems, and proposed the “advisor games” framework as a potential operationalization of transparency [20].


To ensure the safety of AI systems, additional research is needed, both to meet ordinary short-term engineering desiderata as well as to make the additional precautions specific to highly capable AI systems. In both cases, there are clear programs of research that can be undertaken today, which in many cases seem to be under-researched relative to their potential societal value. I therefore think that well-directed research towards improving the safety of AI systems is a worthwhile undertaking, with the additional benefit of motivating interesting new directions of research.


Thanks to Paul Christiano, Holden Karnofsky, Percy Liang, Luke Muehlhauser, Nick Beckstead, Nate Soares, and Howie Lempel for providing feedback on a draft of this essay.


[1] D. Sculley, et al. Machine Learning: The High-Interest Credit Card of Technical Debt. 2014. [2] Hal Daumé III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, pages 101–126, 2006. [3] Sinno J. Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [4] Dimitris Bertsimas, David B. Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464–501, 2011. [5] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference in Machine Learning, pages 663–670, 2000. [6] Nate Soares and Benja Fallenstein. Aligning Superintelligence with Human Interests: A Technical Research Agenda. 2014. [7] Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. 2015. [8] Daniel Dewey, Stuart Russell, and Max Tegmark. A survey of research questions for robust and beneficial AI. 2015. [9] Paul Christiano. The Steering Problem. 2015. [10] Paul Christiano. Stable self-improvement as an AI safety problem. 2015. [11] Luke Muehlhauser. How to study superintelligence strategy. 2014. [12] Stuart Russell. Of Myths and Moonshine. 2014. [13] Tom Dietterich and Eric Horvitz. Benefits and Risks of Artificial Intelligence. 2015. [14] Erik Brynjolfsson and Andrew McAfee. The second machine age: work, progress, and prosperity in a time of brilliant technologies. WW Norton & Company, 2014. [15] Nick Bostrom (2003). Ethical Issues in Advanced Artificial Intelligence. Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence. [16] Nick Bostrom. "The superintelligent will: Motivation and instrumental rationality in advanced artificial agents." Minds and Machines 22.2 (2012): 71-85. [17] Stephen M. Omohundro (2008). The Basic AI Drives. Frontiers in Artificial Intelligence and Applications (IOS Press). [18] Irving J. Good. "Speculations concerning the first ultraintelligent machine." Advances in computers 6.99 (1965): 31-83. [19] Eliezer Yudkowsky. "Artificial intelligence as a positive and negative factor in global risk." Global catastrophic risks 1 (2008): 303. [20] Paul Christiano. Advisor Games. 2015.

Jacob Steinhardt

Jacob Steinhardt


Sign in to join the conversation.