Tuesday, April 12, 2016

Self-aware robots and Newcomb's paradox

In my last blog-post, I discussed the Newcomb paradox and made some jokes about introducing new Swedish personal pronouns for one-boxers and two-boxers.

For those of you who didn't read that, Newcomb's problem lets you take the contents of any or both of two boxes, a small one which is known to contain $\$$1000 and a large one which contains either nothing or $\$$1,000,000. The catch is that a superintelligent being capable of simulating your brain has prepared the boxes, and put a million dollars in the large box only if it predicts that you will not take the small one.

Newcomb's paradox seems to be a purely hypothetical argument about a situation that obviously cannot occur. Even a superintelligent alien or AI will never be able to predict people's choices with that accuracy. Some claim that it's fundamentally impossible for reasons of quantum randomness (nobody I talked to dismissed it referring to free will though...).

In this post I argue that the Newcomb experiment is not only feasible, but might very well be significant to the design of intelligent autonomous robots (at least in the context of what Nick Bostrom imagines they could do in this TED talk). Finally, as requested by some readers of my previous post, I will reveal my own Newcombian orientation.

So here's the idea: instead of thinking of ourselves as the subject of the experiment, we can put ourselves in the role of the predictor. The subject will then have to be something that we can simulate with perfect accuracy. It could be a computer program or an abstract model written down on paper, but one way or another we must be able to compute its decision.

This looks simple enough. Just like a chess program chooses among the available move options and gets abstractly punished or rewarded for its decisions, we might let a computer program play Newcomb's game, trying to optimize its payoff. But that model is too simple to be of any interest. If the Newcomb game is all there is, then the experiment will favor one-boxers, punish two-boxers, and nothing more will come out of it.

It becomes more interesting in the context of autonomous agents with general artificial intelligence. A robot moving around in the physical world under the control of an internal computer will have to be able to understand not only its environment, but also its own role in that environment. An intriguing question is how an autonomous artificial intelligence would perceive itself. Should it somehow be aware that it is a robot governed by a computer program?

One thing that makes a computerized intellect very different from a human is that it might be able to "see" its own program in a way that humans can't. There is no way that a human brain would be able to understand in every detail how a human brain works, and it's easy to fall into the trap of thinking that this is some sort of deep truth about how things must be. But an artificial intelligence, even a powerful one, need not necessarily have a large source code. Perhaps its code would be only a few thousand lines, maybe a few million. Its memory will almost certainly be much larger, but will have a simple organization.

So an AI could very well be able to load its source code into its memory, inspect it, and in some sense know that "this is me". The actual computations that the program is able to perform (like building neural network based on everything on the internet) could require several orders of magnitude more memory than taken up by the source code. Still, an AI might very well have a basic organization simple enough for complete introspection.

So should the robot be "self-aware"? The first answer that comes to mind might be yes. Because that seems to make it more intelligent, and perhaps better at interacting with people. After all, we require it to have basic understanding of other physical objects and agents, so why should it have a blind spot stopping it from understanding itself?

But suppose the robot is asked to take a decision in order to optimize something. Then it had better believe that there are different options. If a self-driving car becomes aware of itself and the role of its program, then it might (correctly!) deduce that if it chooses to crash the car, then that choice was implicitly determined by its code, and therefore that's what it was supposed to do. And by the way don't tell me it could reprogram itself, because it couldn't (although Eliezer Yudkowsky at Less Wrong seems to have a different opinion). Whatever it was programmed to do doesn't count as reprogramming itself. Also it doesn't need to, because of Turing universality.

It's not that the AI would think that crashing the car was better than any other option. The problem is that if it becomes aware of being a deterministic process, then the concept of having different options loses its meaning. It's true that this argument doesn't provide a reason for crashing the car rather than doing something else, but I'm not sure I would feel safe in a vehicle driven by an AI that starts chatting about free will versus determinism.

Perhaps an agent with self-awareness and reasoning capabilities must be programmed with a counterfactual idea of having "free will" in order to prevent this sort of disaster? Spooky...

There are other problems too with self-aware algorithms. One is the old "Gödelization" trick. Suppose an AI has the task of inspecting computer programs and warn if there is a possibility that they might "freeze" by going into an infinite computation. This is a famous unsolvable problem (the halting problem, and yes that is unsolvable, not unsolved), but we can instruct the AI that it has the right to answer "I don't know" if it can't figure out the answer, just as long as it is always right when it does answer.

This seems fine, but a tiny amount of self-awareness is all it takes to goof up. Suppose the internal organization of the AI is such that it might first prove that program X never freezes, then go on to do something else for a while, and finally when giving the answer, remember only that "I proved a moment ago that program X will never freeze". Then it seems reasonable for it to trust its own previous conclusion and answer that program X never freezes, rather than going through the proof again. After all, if it can't ever trust that what it did a moment ago was correct, how could it ever be confident in anything?

So somehow its self-awareness leads it to accept the principle that "When I have proved something, it must be true". And this is already way out on a slippery slope.

Because now the AI is asked to inspect a program that we might call Perfect Simulator. Perfect Simulator is basically a so-called universal Turing-machine, with capabilities of simulating any other computer program. It takes input in two parts, that we can write as $(A, B)$. Part $A$ is interpreted as a computer program written in some universal programming language, it doesn't matter which, and part $B$ is what we can call auxiliary input. Then it simulates what the program encoded in part $A$ would have done when run with $(A, B)$ as input. You might have thought it should simulate what program $A$ does when run with input $B$, because that would already make it a universal computer (meaning that it can do whatever any computer program can do, if only it gets the right input). But it's even funnier to let it simulate $A$ run with input $(A, B)$.  

In order to make Perfect Simulator safe in the sense of never freezing, we first let it inspect the code in part $A$ using the AI's own method, effectively simulating the AI. This is easy, since the AI's program code is available. If the simulated version of the AI says that the program $A$ is fine (will never freeze on any input), then Perfect Simulator goes on to run it with input $(A, B)$, otherwise it stops.

So what happens when the self aware AI inspects Perfect Simulator? Easy, Perfect Simulator is OK, since it obviously only runs programs that are safe. If it is asked to run a program that wouldn't get the approval of the AI, then it stops without simulating it. So provided the AI's method is correct, Perfect Simulator will never freeze. Notice that the AI can see that the method used by Perfect Simulator to inspect programs before running them is the same as the one the AI itself is using.

Being aware of its own correctness, the AI therefore concludes that Perfect Simulator will never freeze. Which is incorrect.

Because now we run Perfect Simulator with two times its own source code as input, schematically \[PerfectSimulator(PerfectSimulator, PerfectSimulator).\] PerfectSimulator begins by using the self-aware AI to certify that the program specified in the first part of the input (which happens to be Perfect Simulator itself) is safe. Then it starts simulating what that program (that is, Perfect Simulator), would have done when run with input $(PerfectSimulator, PerfectSimulator)$. And we are back to square one, because PerfectSimulator now starts simulating itself, and in that simulation it starts simulating itself, and so on. In short, it freezes.

If you haven't seen this argument before, it might seem a bit complicated. Today it's considered standard, but it took geniuses like Kurt Gödel and Alan Turing to figure it out in the first place. If you read Swedish you can take a look at a little story I wrote in some of my first blog posts, outlining how the game of rock-paper-scissors could have led to the discovery of Gödel's and Turing's theorems.

The strange moral here is that once the program becomes aware of its own correctness (the fact that it never concludes that a program is safe if it isn't), it becomes incorrect! Also notice that we have no reason to think that the AI would be unable to follow this argument. It is not only aware of its own correctness, but also aware of the fact that if it thinks it is correct, it isn't. So an equally conceivable scenario is that it ends up in complete doubt of its abilities, and answers that it can't know anything.

The Newcomb problem seems to be just one of several ways in which a self-aware computer program can end up in a logical meltdown. Faced with the two boxes, it knows that taking both gives more than taking one, and at the same time that taking one box gives $\$$1,000,000 and taking both gives $\$$1000.

It might even end up taking none of the boxes: It shouldn't take the small one because that bungles a million dollars. And taking only the large one will give you a thousand dollars less than taking both, which gives you $\$$1000. Ergo, there is no way of getting any money!?

The last paragraph illustrates the danger with erroneous conclusions. They diffuse through the system. You can't have a little contradiction in an AI capable of reasoning. If you believe that $0=1$, you will also believe that you are the pope (how many popes are identical to you?). 

Roughly a hundred years ago, there was a "foundational crisis" in mathematics triggered by the Russell paradox. The idea of a mathematical proof had become clear enough that people tried to formalize it and write down the "rules" of theorem proving. But because of the paradoxes in naive set theory, it turned out that it wasn't so easy. Eventually the dust settled, and we got seemingly consistent first order axiomatizations of set theory as well as various type-theories. But if we want robots to be "self-aware" and capable of reasoning about their own computer programs, we might be facing similar problems again.

Finally, what about my own status, do I one-box or two-box? Honestly I think it depends on the amounts of money. In the standard formulation I would take one box, because I'm rich enough that a thousand dollars wouldn't matter in the long run, but a million would. On the other hand if the small box instead contains $\$$999,000, I take both boxes, even though the problem is qualitatively the same.

There is a very neat argument in a blog post of Scott Aaronson called "Dude, it's like you read my mind", brought to my attention by Olle Häggström. Aaronson's argument is that in order for a predictor to be able to predict your choice with the required certainty, it would have to simulate you with such a precision that the simulated agent would be indistinguishable from you. So indistinguishable that you might as well be that simulation. And if you are, then your choice when facing the (simulated) Newcomb boxes will have causal effects on the payoff of the "real" you. So you should take one box.

Although I like this argument, I don't think it holds. If you are a simulation, then you are a mathematical object (that might in particular have been simulated before on billions of other computers in parallel universes), so the idea of your choice having causal effects on this mathematical object is just as weird as the idea of the "real" you causing things to have already happened. I don't actually dismiss this idea (and a self-aware robot will have to deal with being a mathematical object). I just think that Aaronson's argument fails to get rid of the problem (causality) with the original "naive" argument for one-boxing.

Moreover, the idea that you might be the predictor's simulation seems to violate the conditions of the problem. If you don't know that the predictor is following the rules, then the problem changes. If for instance you can't exclude being in some sort of Truman show where the thousand people subjected to the experiment before you were just actors, then the setup is no longer the same. And if you can't even exclude the idea that the whole universe as you know it might cease to exist once you made your choice (because it was just a simulation of the "real" you), then it's even worse.

So if I dismiss this argument, why do I even consider taking just one box? "Because I want a million dollars" is not a valid explanation, since the two-boxers too want a million dollars (it's not their fault that they can't get it).

At the moment I don't think I can come up with anything better than
I know it's better to take both boxes, I'm just tired of being right all the time. Now give me a million dollars!
As far as I know, Newcomb's problem is still open. More importantly, and harder to sweep under the rug, so is the quest for a consistent system for a robot reasoning about itself, its options, and its program code, including the consequences of other agents (not just Newcomb's hypothetical predictor) knowing that code.

Should it take one box because that causes it to have a program that the predictor will have rewarded with a million dollars? The form of that argument is suspiciously similar to the argument of the self driving car that running a red light and hitting another car will cause it to have software that forced it to do so.


Wednesday, April 6, 2016

Newcombs lådor: Är du en enb eller en tvåb?

Min gode vän och kollega Olle Häggström kom nyligen ut som enboxare. Då talar jag förstås om Newcombs paradox, ett tankeexperiment där du erbjuds två lådor, en liten med det kända innehållet tusen kronor, och en stor som antingen innehåller en miljon kronor eller ingenting.

Du får ta en låda, eller bägge två (eller ingen), och din uppgift är att försöka få så mycket pengar som möjligt. Ganska enkelt kan man tycka. Att ta båda måste ju vara minst lika bra som att ta bara den lilla, och bättre än att inte ta någon. Och rimligen måste det också vara bättre att ta båda än att bara ta den stora. För oavsett vad den stora lådan innehåller, får du tusen kronor mer om du tar båda än om du bara tar den stora. 

Fast här kommer kruxet: lådorna har preparerats av en superintelligent varelse med förmåga att analysera din hjärna och förutsäga ditt val. Denna varelse har simulerat din tankeprocess, och om den har kommit fram till att du kommer att ta båda lådorna, är den stora lådan tom, men om den har förutsagt att du kommer att ta bara den stora lådan, innehåller den en miljon. I förutsättningarna för tankeexperimentet ligger att den superintelligenta varelsen antas ha en ofelbar förmåga att förutsäga människors val, till exempel kan vi anta att du redan har sett den göra motsvarande korrekta förutsägelser för tusen andra personer. Eller rentav att du själv har utsatts för försöket tidigare (fast då måste du vara mångmiljonär för att ha tillräckligt med data, och då försvinner lite av spänningen).

Nu blev det busenkelt igen. Tar du båda lådona, får du tusen kronor, men tar du bara den stora får du en miljon. Så du tar givetvis bara en låda. Men innehållet i lådan är ju bestämt innan du gör ditt val. Antingen ligger det en miljon där, eller så gör det inte det. Om det ligger en miljon där, kan du lika gärna ta båda lådorna som en riktig glidare. Och gör det inte det kan du åtminstone gardera dig mot nesan att stå där med bara en tom låda och inget mer. Så två. Fast tar du båda, vet du ju att det bara blir tusen kronor. Så en låda. Men ditt velande kan inte ha kausala effekter bakåt i tiden, det strider mot relativitetsteorin. Så ta bägge. Eller kanske finns det några underliga kvantmekaniska effekter som får dig att navigera mellan parallella universa genom ditt val, och då är det bättre att styra in i ett universum där lådan innehåller en miljon. Så ta en. Fast det är ju idiotiskt...

Och så håller det på. Jag hade förresten tänkt skriva en bloggpost om Newcombs paradox för tre år sedan, men den gången kom det att handla om tresvansade katter och Gödels ofullständighetssats med mera. Men när vi nu diskuterar Newcombs paradox, passar vi på att citera Robert Nozick, som skrev om den 1969:  

"To almost everyone, it is perfectly clear and obvious what should be done. The difficulty is that these people seem to divide almost evenly on the problem, with large numbers thinking that the opposing half is just being silly."

Vissa människor, som Olle och datalogen Scott Aaronson, är enboxare. Andra, till exempel Isaac Asimov, trotsar prediktorn och tar båda. Och någon, man diskuterar fortfarande vem, förklarar sig vara tvåboxare med orden "take two" strax efter att John Lennon kixar i gitarrintrot till Revolution 1. Det låter inte som Paul McCartney. Möjligen är det ljudteknikern Geoff Emerick. Och kan vi verkligen räkna ut Lennon?

Jag tänker att vi kan slå två flugor i en smäll och samtidigt (1) lösa problemet med svenskans personliga pronomen, och (2) skapa en naturlig anledning att fråga främlingar om deras inställning till Newcombs paradox, vilket skulle ge en intressant inblick i hur de tänker och bidra till att lära känna dem.

Vi slopar de gamla fåniga "hon" och "han" ("hon" betyder ju ungefär "raring" på engelska), och inför i stället "enb" och "tvåb". Dessa ord är förkortningar av en-boxare och två-boxare, och ska alltså användas när man hänvisar till en person som tar en respektive två Newcombska boxar. Man kan till exempel säga om Olle att "enb redogör för sin ståndpunkt i en bloggpost". 

Det här innebär förstås att man när som helst kan behöva fråga folk, eller på annat sätt gissa, om de är enboxare eller tvåboxare, men det är som sagt bara en fördel. Kanske kan vi införa klädkoder som signalerar vilken sort man är. 

För att stimulera mångsidigheten i de filosofiska frukostbordsdiskussionerna skulle vi också kunna införa som norm, eller varför inte lagstifta om, att två människor får bli livspartners bara om en av dem är enboxare och den andra är tvåboxare.

Men vad gör vi med människor som inte tydligt tar ställning? De kanske inte ens känner till Newcombs paradox. Och hur gör vi när vi pratar om en obestämd person, någon som skulle kunna vara antingen eller? Robert Nozick tänkte sig att det finns en kontinuerlig övergång mellan enboxare och tvåboxare, och enligt Aaronson finns det tre sorters människor, inte två. Man kan tänka sig ett tredje pronomen, "wit", för "Wittgensteinare", en som hävdar att frågan är meningslös.

Fast det där tycker jag är ett icke-problem. Det går väl bra att bara säga "enb eller tvåb"? Och "wit" betyder ju typ smarthet på engelska, så det går ju inte. Wittgenstein dog förresten ungefär tio år innan Newcomb formulerade sin paradox, så vad hade han med saken att göra? Jag menar wit, nej jag menar enb eller tvåb...