Your small guide to: AI Safety

In 1994, computer researcher Carl Sims asked an artificial intelligence program to design virtual creatures optimized for speed. He found that the computer would design creatures that were really tall, such that when they fell over, the tip of their head was moving really fast.

In 2004, computer scientists rewarded an AI-powered robot for moving within a track. The robot then started going back and forth on a straight line as opposed to navigating the more complicated path ahead.

In 2017, programmers asked a different AI to stack a Lego block on top of another, and evaluated its success by whether the bottom of the first block was a certain height from the floor. The AI then decided to do the relatively easier task of simply flipping the first block upside down in order to achieve this.

Even when given a very simple goal, an AI can act unpredictably. This phenomenon is known as “specification gaming” or “reward hacking”—achieving the goal that was technically specified, but failing to achieve the actual intended goal. It happens frequently, because it is really hard to explain to a computer what you actually intend for it to do. 

This phenomenon is by no means restricted to the field of artificial intelligence, though. You can find examples of specification gaming outside of the field of computer science as well. In Greek mythology, there is the classic myth of King Midas—he wished for everything he touched to be turned into gold, and was horrified this also applied to touching his food and his daughter. The wish-gone-wrong trope is present everywhere in fiction, from The Monkey’s Paw to every episode of the show The Fairly OddParents. It is even present in our everyday lives—someone on Twitter wanted his Roomba (a self-driving vacuum cleaner) to navigate without bumping into things, so he connected it to an AI programmed to avoid hitting the bumper sensors. Since Roombas don’t have bumper sensors on the back, the AI simply learned to drive backwards.

In business and economics, the same principle goes by the name of Goodhart’s Law, which states that “when a measure becomes a target, it ceases to be a good measure.” Perhaps the most famous example is when British colonizers in India wanted to reduce the number of snakes and offered to pay for dead snakes brought to them. Indians then began to breed snakes in order to kill them and receive the cash prize. There are many other cases of it happening in the present day, such as teachers optimizing their entire curriculum for passing a particular standardized test instead of for students understanding the material, or police officers not reporting an incident because of crime reduction targets.

Preventing humans from engaging in specification gaming is hard. Preventing machines from engaging in specification gaming is even harder. AI safety researchers believe there is little reason to think that advanced AI will be aligned with our intentions by default, or to confidently count on the fact that it will “figure out” human morality or intentions—very ill-specified and convoluted concepts—on its own.

This is a serious problem. As AIs become capable of more complex tasks, there is a serious risk that their actions will more gravely deviate from what we intend for them to do. For example, we could ask it to calculate the best chess move for a given game, and it could hijack thousands of other people’s computers in order to increase its computing power. We could ask it to reduce crime, and it could institute a totalitarian hyper-surveillance system to monitor everyone. We could ask it to create paperclips, and it could build nanobots that turn whatever is in their path into paperclips. It could also predict that humans shutting it off would interfere with it achieving any of these targets, and devise a way to stop us from doing that—either by hiding what it is actually doing from us, by deceiving us into thinking it is already shut off, or even by getting rid of us altogether.

Once AI is intelligent enough to figure out how to do things like this, there is a serious risk that it could cause sweeping chaos and destruction regardless of what its programmer intended. We simply have not yet figured out how to make really intelligent AI safe to use. 

A majority of experts in the field of AI ascribe a non-negligible probability of AI disasters in the scale of human extinction. MIT, Oxford, Cambridge, UC Berkeley, and Carnegie Mellon University have all formed academic research groups to address these problems pertaining to AI safety. Other big names in AI have spoken about it, including Stephen Hawking, Bill Gates, and Elon Musk

This may seem really silly. An obvious solution would be to simply not build this type of AI until we have figured out how to make it safe. But there are people in the AI landscape that are more concerned with profit than with the ways their work may go wrong, or who are recklessly overconfident in everything working itself out in the end.

“AI will probably most likely lead to the end of the world, but in the meantime, there’ll be great companies.” -Sam Altman, CEO of OpenAI

This means that there is essentially a race between AI labs figuring out how to make dangerously intelligent AI and AI safety researchers solving “the alignment problem”, i.e. how to make AIs aligned with the intended goals of their designers.

Until recently, the feeling among the general public was that AI this intelligent was a really distant issue. But with recent advances like DALL-E 2 and GPT-3, it has become increasingly evident that AI is advancing very, very quickly and this may not be as far away as it seems. The median forecast of AI experts for when artificial intelligence will be as intellectually capable as a human (known as artificial general intelligence, or AGI) is 2062, and the notoriously accurate forecasting community has it at 2043. This is the point at which, due to the obvious advantages AI has over humans (like perfect memory, vast computing power, and self-improvement) it is likely to surpass our intelligence with relative ease, and have little stopping it from causing incredible damage.

There are no real estimates or if or when the alignment problem will be solved, but it is estimated that there are only about 300 people working on it right now, which does not produce an optimistic forecast. For this reason, 80,000 Hours, an organization dedicated to conducting research on which careers have the largest positive social impact, estimates that working on AI alignment may be one of the best ways someone today can have an impact with their career. 

AI alignment may be one of the biggest problems of today, especially considering its relative neglectedness. It is an incredibly difficult problem, involving expertise in computer science, philosophy, psychology, and neuroscience. If AGI goes well, it could lead to unprecedented societal flourishing, with most problems in most human disciplines being solved by levels of intelligence we currently don’t have access to. Some have even dubbed it “the last invention that humanity will ever need to make.” If it goes poorly, it may be the last invention humanity will have the opportunity to make.

To learn more about how you can use your career to prevent an AI-related catastrophe, click here.

One thought on “Your small guide to: AI Safety

  • October 25, 2022 at 4:44 pm

    Punchy article! Interesting to see how deep the history of mis-specifying objectives goes – hopefully we can stop making these mistakes before 2043/2062…


Leave a Reply

Your email address will not be published. Required fields are marked *