Human Alignment Problems

February 18, 2022

I’ve been reading recently about the Alignment Problem, the concept that superintelligent AIs may not (probably won’t) share human interests like keeping current humans alive and preventing human extinction. Along with factory farming, it’s up there for “things I feel like I should care about based on who I normally agree with but don’t”. The idea is that, even if we try to make an AI that efficiently creates paperclips, if it is sufficiently intelligent, it may realize that in order to maximize number of paperclips created, it needs to melt down other things that are made of metal, like cars and I-beams. Then if humans try to stop it it will need to kill those humans. And prevent the other humans from turning it off or cutting power. Even a seemingly innocuous goal could cause devastating effects for humans given a sufficiently intelligent AI. The AIs interest don’t necessarily align with humans’ interests, ergo “alignment” problem. A bit like calling the threat of nuclear obliteration the Explosion Problem.

But perhaps that’s be cause the concept of alignment problems is not unique to superintelligences. In reinforcement learning, AIs are allowed to explore and rewarded when they do things closer to the desired behavior. If you’re trying to get into underground Roomba Racing, you could hard-code the race course to your Roomba, but then it might struggle at the championships where there’s a different course. You wouldn’t want to have to hard-code a new course every time. So instead, using reinforcement learning, you train it for a while by letting it explore, rewarding it when it went faster, and penalizing it when it bumped into things. An alignment problem would occur if it learned to go backwards because it only has bump sensors on the front (or more likely, it would just drive in small circles or spin in place depending on how it senses its own speed).

By and large, I stand with the skeptics as to whether Machine Learning has much to teach us about human cognition mostly because brains are a lot mushier than computer chips. But if you strip away a lot of the details, I think a comparison can be enlightening. After all, humans were the first reinforcement learners, bumbling around in the darkness until they found rewards and then exploiting those reward sources. I won’t speculate on whether it’s possible for there to be misalignment between humans and nature, but there are certainly many examples of people within systems being misaligned with the intentions of the system.

My favorite example of this is how people interact with The Algorithm, a term I use in the broadest sense to describe how sites decide to present users with content. At a first approximation, The Algorithm ensures that content that generates a lot of user interaction is promoted and shown to others. The assumption is that people will interact with content in normal, human ways, watching things that are interesting to them, commenting when they have something to say, etc. But no, this first approximation of The Algorithm is unmysterious enough that people instead often instead act in specific ways that they know The Algorithm rewards. I’m sure there are other factors here, but I think of this as the connecting thread between repetitive Reddit comments, YouTube SNL compilations that “butter people’s eggrolls”, and Kpop stans who are indistinguishable from bots (in that case to promote their idols).

I don’t want to make it seem like this is a new problem which has arisen with the Internet, it’s just rife with examples. Another example which better matches the scale of the AI Alignment Problem is Capitalism which we can think about above the level of the individual. (I’m trying my hardest to not be insufferable about it, but yes, I am reading Godel, Escher, Bach). At the level of the individual, Capitalism rewards accrual of money which presents its own alignment problems, but at the system level, Capitalism promotes free market economics, meaning that anything that can be sold will be, at some “equilibrium” price. This is a really great feature because, if you can ignore market failures, this means that you’re never getting ripped off. You can buy anything at the cheapest price possible. Just like the paperclip AI, a “successful” societal implementation of Capitalism doesn’t account for the future existence of humans or the planet they live on. And that’s not even to mention manufactured desire.

For the nonradicals among you, another example could be a blogger who pats himself on the back every week for getting a post out. This system rewards a certain type and length of post (i.e. shorter ones about inconsequential semantic issues) and discourages longer, more in-depth posts. I grant, it’s harder to devise a system that promotes putting out those longer pieces (without falling into perfectionist traps). I probably won’t change anything soon, but I’m thinking about it.