Stuart Russell: Human Compatible

I am reading the book Human Compatible by Stuart Russell, which I recently ordered along with Algorithms to Live By (Brian Christian, Tom Griffiths) and The Precipice (Toby Ord). All of these authors and books I have encountered in one way or another from the podcast 80000 Hours. The podcast is all about how to make the most of your career in terms of making something good in the world (or stopping something bad). This is the worldview associated with the Effective Altruism movement.

Stuart Russell is an AI researcher and one of the authors of one of most well-read text books on Artificial Intelligence used in university courses. In Human Compatible, subtitled “AI and the Problem of Control”, he presents his views on what he has come to believe is a major problem of the field, that may lead to some true existential risks for humanity. I have read the first chapter. Basically, he describes the problem being that we have defined the aim of the field to build intelligent machines, defined like this:

Machines are intelligent to their extent that their actions can be expected to achieve their objectives.

Instead, he believes we should aim for this:

Machines are beneficial to the extend that their actions can be expected to achieve our objectives.

I.e., their objectives should be aligned with ours.

My reflection, after reading just this first chapter, is this: is it really possible to control the AI’s objectives in this way? If we truly build super-human intelligence, won’t it also gain the ability to set its own objectives? Similar to how people are able to set their own objectives. And I don’t mean that in the sense that we “truly” are able to do so, having actual free will and so on – just in the everyday sense that I may come to the conclusion that I should spend my life improving the world by following the principles of Effective Altruism, or that I may come to the conclusion that I should seclude from my ordinary world and become a Zen monk, or I may become a suicide bomber or someone who acts on the idea that in order to save life on planet Earth, we need to erase humanity.

As the designers of the AI, we could of course try to limit what kind of objectives the AI should be able to set. Similar to how there are aspects of the human world we simply are not in control over (that there are three dimensions of space and that 2 + 2 = 4, etc), we would set the outer limits of the world of the AI. But… first, can we really do that, and secondly, what about bad actors?

These are some questions I hope to get wiser about as I continue reading the book!