An Intuitive Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem

by Luke Muehlhauser on December 18, 2010 in Eliezer Yudkowsky,How-To,Math,Resources

Richard Feynman once said that if nuclear war caused the human race to lose all its knowledge and start over from scratch, but he could somehow pass on to them just one piece of information, he would tell them this:

All things are made of atoms – little particles that move around in perpetual motion, attracting each other when they are a little distance apart, but repelling upon being squeezed into one another.

For Feynman, this was the single most helpful and important pieces of information we could pass on to a future human race that had lost all other knowledge.

It’s an excellent choice, especially since it entails reductionism.

After giving it some thought, I think maybe the second piece of information I would pass to a new society is Bayes’ Theorem.

Seeing the world through the lens of Bayes’ Theorem is like seeing The Matrix. Nothing is the same after you have seen Bayes.

But I’d rather not just give the equation and then explain its parts, because if you don’t understand the logic behind the equation, it’s hard to know how to apply it correctly. The goal of the tutorial below is not to teach you how to guess the teacher’s password and give the right responses on an exam. No, the goal of the tutorial below is to give you a true understanding of Bayes’ Theorem so that can apply it correctly in the complexities of real life that exist beyond the exam sheet. By the end of this tutorial you will not just be able to recite Bayes’ Theorem; you will feel it in your bones.

The most popular online tutorial on Bayes’ Theorem, Eliezer Yudkowsky’s “An Intuitive Explanation of Bayes’ Theorem,” opens like this:

Your friends and colleagues are talking about something called “Bayes’ Theorem” or “Bayes’ Rule”, or something called Bayesian reasoning. They sound really enthusiastic about it, too, so you google and find a webpage about Bayes’ Theorem and…

It’s this equation. That’s all. Just one equation. The page you found gives a definition of it, but it doesn’t say what it is, or why it’s useful, or why your friends would be interested in it. It looks like this random statistics thing.

So you came here. Maybe you don’t understand what the equation says. Maybe you understand it in theory, but every time you try to apply it in practice you get mixed up trying to remember the difference between p(a|x) and p(x|a), and whetherp(a)*p(x|a) belongs in the numerator or the denominator. Maybe you see the theorem, and you understand the theorem, and you can use the theorem, but you can’t understand why your friends and/or research colleagues seem to think it’s the secret of the universe. Maybe your friends are all wearing Bayes’ Theorem T-shirts, and you’re feeling left out. Maybe you’re a girl looking for a boyfriend, but the boy you’re interested in refuses to date anyone who “isn’t Bayesian”. What matters is that Bayes is cool, and if you don’t know Bayes, you aren’t cool.

Why does a mathematical concept generate this strange enthusiasm in its students? What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case? What is the secret that the adherents of Bayes know? What is the light that they have seen?

Soon you will know. Soon you will be one of us.

Eliezer’s explanation of this hugely important law of probability is probably the best one on the internet, but I fear it may still be too fast-moving for those who haven’t needed to do even algebra since high school. Eliezer calls it “excruciatingly gentle,” but he must be measuring “gentle” on a scale for people who were reading Feynman at age 9 and doing calculus at age 13 like him.

So, I decided to write an even gentler introduction to Bayes’ Theorem. One that is gentle for normal people.

There are times when Yudkowsky introduces new terms without defining or explaining them (“mean revised probability,” for example). Other times, he leaves you with a difficult problem without the resources you need to solve it (for example, the problem stated right before the phrase “mean revised probability”). That is where, I suspect, many non-mathematicians just give up and don’t come back. If you gave up on Yudkowsky’s introduction to Bayes’ Theorem, I hope you’ll try mine, below. It’s much gentler.

Because this article is gentler than Yudkowsky’s, it’s also longer. So I advise you tackle just one section per day. The table of contents to all sections is below.

This introduction also replaces Yudkowsky’s interactive elements with lots of pictures so that you can read it on a mobile device like the Kindle. Here: download the PDF (updated 01/04/2011).

I hope you find it useful!

Here's a map to guide you.

Table of contents:

The Bayesian within you.

You probably already use Bayesian reasoning without knowing it.

Consider an example I adapted from Neil Manson:

You’re a soldier in combat, crouching in a trench. You know for sure there is just one enemy soldier left on the battlefield, about 400 yards away. You also know that if the remaining enemy is a regular army troop, there’s only a small chance he could hit you with one shot from that distance. But if the remaining enemy is a sniper, then there’s a very good chance he can hit you with one shot from that distance. But snipers are rare, so it’s probably just a regular army troop.

You peek your head out of the trench, trying to get a better look.

Bam! A bullet glance off your helmet and you duck down again.

Okay, you think. I know snipers are rare, but that guy just hit me with a bullet from 400 yards away. I suppose it might still be a regular army troop, but there’s a seriously good chance it’s a sniper, since he hit me from that far away.

After a few minutes, you dare to take another look, and peek your head out of the trench again.

Bam! Another bullet glances off your helmet! You duck down again.

Oh shit, you think. It’s definitely a sniper. No matter how rare snipers are, there’s no way that guy just hit me twice in a row from that distance if he’s a regular army troop. He’s gotta be a sniper. I’d better call for support.

If that’s roughly how you’d reason in a situation like that, then congratulations! You already think like a Bayesian, at least some of the time.

But of course it will be helpful to be more precise than this, and it will be helpful to know when and how our reasoning departs from correct Bayesian reasoning. In fact, in the scientific study of human reasoning biases, a bias is defined in terms of a systematic departure from ideal Bayesian reasoning.

So, let us be Accidental Bayesians no more. Let’s learn to be consistent Bayesians.

Even doctors get it wrong.

We begin with a story problem, like the ones you had in high school. But I promise you, learning Bayes’ Theorem will be far more useful than almost anything you learned in high school.

Here’s the problem:

Only 1% of women at age forty who participate in a routine mammography test have breast cancer. 80% of women who have breast cancer will get positive mammographies, but 9.6% of women of women who don’t have breast cancer will also get positive mammographies. A woman of this age had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

If you’re struggling to figure out the answer, you’ll be relieved to know that only 15% of doctors give the correct answer.

And no, I didn’t make up that number. See: Casscells et. al. (1978), Eddy (1982), Gigerenzer & Hoffrage (1995).

But grab yourself a calculator and see if you can get the right answer. It’s simple math, but it’s tricky.

Okay. You at least gave it a try, right? You won’t learn Bayes’ Theorem just by reading. You can only learn Bayes Theorem by doing. So you really should try all the exercises.

Go ahead. Give it a try.

I’ll still be here when you’re done.

What answer did you get? Most doctors estimate between 70% to 80%, but that’s wildly incorrect.

Let’s try an easier version of the same problem. With this one, nearly half the doctors get the right answer.

100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammography. 950 out of 9,900 women without breast cancer will also get a positive mammography. If 10,000 women in this age group undergo a routine screening, about what fraction of these women with positive mammographies will actually have breast cancer?

Give it a try. What’s the answer?

The answer is 7.8%. Just 7.8% of the women with positive mammographies will have breast cancer!

Here, follow the logic…

Always begin by figuring out what you want to know. In this case, we want to know what fraction (or percentage) of the women with positive mammographies actually have breast cancer.

First, let’s figure out how many women have positive mammographies. That’s the denominator of our fraction.

The story above says that 950 of the 9,900 that do not have breast cancer will have a positive mammography. So that’s 950 women with a positive test result right there.

The story also says that 80 out of the 100 women who do have breast cancer will get a positive test result. So that’s another 80 women, and 950 + 80 = 1,030 women with a positive test result.

Good. We’ve got half our fraction. Now, how do we find the numerator? How many of those 1,030 women with a positive test result actually have breast cancer?

Well, the story says that 80 of the 100 women with breast cancer will get a positive test result, so 80 is our numerator.

The fraction of women with positive test results who actually have breast cancer is 80/1,030, which is a probability of .078, which is 7.8%.

So if one of these 40-year-old women got a positive mammography, and the doctor knew the above statistics, then the doctor should tell the woman she has only a 7.8% chance of having breast cancer, even though she had a positive mammography. That’s much less stressful for the woman than if the doctor had told her she had a 70%-80% chance of having breast cancer like most doctors apparently would!

Already, we can see that careful reasoning of this sort has real-world consequences. This is not “just math.”

What goes wrong?

Why do even doctors get that kind of problem wrong, so often? The math isn’t hard, so what goes wrong?

The most common mistake is to focus only on the women with breast cancer who get positive results, while ignoring the other important information, such as the original fraction of women who have breast cancer and the fraction of women without breast cancer who get false positives.

But you always need all three pieces of information to get the right answer.

To get a feel for why you always need all three pieces of information, imagine an alternate universe in which only one woman out of a million has breast cancer. And let’s say the mammography test detected breast cancer in 8 out of 10 cases, while giving a false positive only 10% of the time.

Now, I think you can see that in this universe, the initial probability that a woman has breast cancer is so incredibly low that even if a woman gets a positive test result, it’s still almost certainly true that she does not have breast cancer.

Why? Because there are a lot more women getting false positives (10% of 99.9999% of women) than there are getting true positive (80% of 0.0001% of women). So if a woman gets a positive result, it’s almost certainly a false positive, not a true positive.

An extreme example like this illustrates that the new data you get from the mammography test does not replace the data you had at the outset about how improbable it was that the woman had breast cancer. Instead, imagine that you start with the original probability that a woman has breast cancer, and then getting the new evidence of the mammography test moves the probability one direction or the other from that starting point, depending on whether the test is positive or negative. In this way, the mammography test slides the probability of the woman having breast cancer in the direction of the test result.

To illustrate this, consider the original problem again. In that story, 1% of 40-year-old women (100 out of 10,000) have breast cancer. 80% of women with cancer (80 out of 100) test positive, and 9.6% of women without cancer (950 out of 9,900) also test positive. When we did the math, we found that a positive test result slides a woman’s chances of having breast cancer from 1% upward to 7.6%.

You can’t replace the original probability with new information. You can only update it with new information, by sliding in one direction or another from the original probability. The original probability still matters, a fact which is obvious when the original probability is really extreme – for example in a universe where only one in every million women has breast cancer.

Remember that we always need all three pieces of information. We need to know the original fraction of women with breast cancer, the fraction of women with breast cancer who get positive test results, and the fraction of women without breast cancer who get positive test results.

To see why that last piece of information matters – the fraction of women without breast cancer who get false positives – consider a new test: mammography*. A mammography* has the same rate of false negatives as before: 20%. But it also has an alarmingly high rate of false positives: 80%!

Here’s the story problem:

1% of women have breast cancer. 80% of women with breast cancer will get a positive test result. 80% of women without breast cancer will also get a positive test result. A woman had a positive mammography*. What is the probability that she actually has breast cancer?

Go ahead; calculate the answer.

Got your answer?

Okay, let’s start by calculating what percentage of women will get a positive test result. 80% of the 1% of women with breast cancer will get a positive result, so that’s 0.8% of women right there. Also, 80% of the 99% of women without breast cancer will get a positive result, so that’s another 79.2%. And since 0.8% + 79.2% = 80%, that means 80% of women will get a positive test result.

Even though only 1% of women actually have breast cancer!

So already you can tell that third piece of information can make a huge difference.

But let’s finish the calculation. What fraction of women with positive mammography* results actually have cancer?

First, how many women will get a positive test result? That’s our denominator.

Well, there are two groups of women who will get a positive mammography* result: those with a positive result who do have cancer (0.8%), and those with a positive result who don’t have cancer (79.2%). Add those together, and our denominator is 80%.

Time to figure out our numerator. Out of those 80% of women who will get a positive result, how many actually have cancer? We already know the answer, because we already know what percentage of women will test positive and have breast cancer: 0.8%. So the fraction of women with positive mammography* results who actually have cancer is 0.8%/80%, which is 1%.

The woman started out with a 1% chance of having breast cancer, and after the test she still has a 1% chance of having breast cancer.

How did that happen? Didn’t the test tell us anything, either way?

Nope.

Why didn’t it tell us anything? Remember, the mammography* test had such a high rate of false positives that a woman was quite literally just as likely to get a positive result if she didn’t have breast cancer as if she did have breast cancer!

If she did have breast cancer, she had an 80% chance of testing positive. And if she didn’t have breast cancer, she also had an 80% chance of testing positive.

And that’s why the test didn’t tell us anything. So we updated her chances of having breast cancer by 0%. She was just as likely to get the same test result either way, so the test didn’t do anything to tell us which possibility was correct.

In such a case, the mammography* test is completely uncorrelated with incidences of breast cancer, because it gives the same results either way. In fact in this case, there’s no reason to call one result “positive” and another result “negative,” since neither result tells you to slide your probability in either direction.

Which means you might just as well have flipped a coin as your “test” for breast cancer. Flipping a coin would have been equally uncorrelated with incidences of breast cancer. If the woman has breast cancer, there’s a 50% chance the coin will turn up heads. If the woman doesn’t have breast cancer, there is also a 50% chance the coin will turn up heads.

Or, you could just as well have used a test that always gave the same result. Let’s say your test was adding-two-plus-two. If a woman had breast cancer, the result of the adding-two-plus-two test would have been 4. And if a woman hadn’t had breast cancer, the result of the adding-two-plus-two test would have been 4.

All these tests are equally worthless, because these tests give the same result the same percentage of the time whether or not the woman has breast cancer. In order for a test to give us information we can use to update the probability that the woman has breast cancer, the test has to be correlated with breast cancer in some way. The test has to be more likely to give some particular result when a woman does have breast cancer than when she doesn’t.

That’s what it means for something to be a “test” for breast cancer.

But remember: probability is in the mind, not in reality. Even a useful mammography test does not actually change whether or not the woman has cancer. She either has cancer or she doesn’t. Reality is not uncertain about whether or not the woman has cancer. We are uncertain about whether or not she has cancer. It is our information, our judgment, that is uncertain, not reality itself.

Some terms and symbols.

The original proportion of women with breast cancer is known as the prior probability. This is the probability that a woman has breast cancer prior to some new evidence we receive.

What about the proportion of women with breast cancer who get a positive test result, and the proportion of women without breast cancer who get a positive test result? These were the two conditions of our story, so these probabilities are called the two conditional probabilities.

Collectively, the prior probability and the conditional probabilities are known as our priors. They are the bits of information we know prior to calculating the result, which is called the revised probability or posterior probability.

What we showed above is that if the two conditional probabilities are the same – if a positive test is 80% likely if the woman has breast cancer, and a positive test result is 80% likely if the woman doesn’t have breast cancer – then the posterior probability equals the prior probability.

Where do we get our priors from? How do we know what the prior probability is, and what the conditional probabilities are?

Well, those are tested against reality like anything else. For example, if you think 100 out of 10,000 women have breast cancer, but the actual number is 500 out of 10,000, then one of your priors is wrong, and you need to do some more research.

There are also a few easy symbols you should know, because it’s common to work out these kinds of story problems with the help of some symbols.

To illustrate, Eliezer tells a story of some plastic eggs:

Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl?

Before you go about solving this problem, let’s introduce the symbols I was talking about. To say “the probability that a certain egg contains a pearl is equal to .4,” we write:

p(pearl) = .4

Now, more notation:

p(blue|pearl) = .3

What is that straight line between blue and pearl? It stands for “given that.” And here, the word blue stands for “is blue.” So we can read the above statement as: “The probability that a certain egg is blue, given that it contains a pearl, is .3.”

One more symbol is the tilde: ~

It means “not,” as in:

p(blue|~pearl) = .1

This reads: “The probability that a certain egg is blue, given that it does not contain a pearl, is .1.”

Now we are ready to express our three pieces of information from the story above, but in symbolic form:

p(pearl) = .4

p(blue|pearl) = .3

p(blue|~pearl) = .1

And of course what we’re looking for is:

p(pearl|blue) = ?

You should be able to read those four statements aloud:

The probability that a certain egg contains a pearl is .4.

The probability that a certain egg is blue, given that it contains a pearl, is .3.

The probability that a certain egg is blue, given that it does not contain a pearl, is .1.

The probability that a certain egg contains a pearl, given that it is blue, is… what?

That’s our problem. Stop and try to solve it without peeking below.

What’s the solution? We’re looking for the probability that an egg contains a pearl, given that it is blue. (This is like trying to figure out the probability that a woman has breast cancer given that she had a positive test result.)

40% of the eggs contain pearls, and 30% of those are blue, so 12% of the eggs altogether are blue and contain pearls.

60% of the eggs contain no pearls, and 10% of those are blue, so 6% of the eggs altogether are blue and contain no pearls.

12% + 6% = 18%, so a total of 18% of the eggs are blue.

We already know that 12% of the eggs are blue and contain pearls, so the chance that a blue egg contains a pearl is 12/18 or about 67%.

One famous case of a failure to apply Bayes’ Theorem involves a British woman, Sally Clark. After two of her children died of sudden infant death syndrome (SIDS), she was arrested and charged with murdering her children. Pediatrician Roy Meadow testified that the chances that both children died of SIDS was 1 in 73 million. He got this number by squaring the odds of one child dying of SIDS in similar circumstances (1 in 8500).

Because of this testimony, Sally Clark was convicted. The Royal Statistics Society issued a public statement decrying this “misuse of statistics in court,” but Sally’s first appeal was rejected. She was released after nearly 4 years in a woman’s prison where everyone else thought she had murdered her own children. She never recovered from her experience, developed an alcohol dependency, and died of alcohol poisoning in 2007.

The statistical error made by Roy Meadow was, among other things, to fail to consider the prior probability that Sally Clark had murdered her children. While two sudden infant deaths may be rare, a mother murdering her two children is even more rare.

Visualizing probabilities.

It can help to visualize what’s going on here. On Yudkowsky’s page for Bayes’ Theorem, there is an interactive tool that lets you adjust each of the three values independently and see what the result is. But it only works if you have Java, and not on Mac (not on mine, anyway) or on mobile devices like the Kindle, so I’m going to use images here. (Screenshots of Yudkowsky’s interactive tool, in fact.)

First, the original problem:

Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl?

The bar at the top, divided between pearl and empty, shows the prior probability that an egg contains a pearl. The probability is 40%, so the division between the two is just left of center. (Center would be 50%.)

The first conditional probability is p(blue|pearl) or “the probability that an egg is blue given that it contains a pearl.” The size of the right-facing arrow reflects the size of this probability.

The second conditional probability is “the probability that an egg is blue given that it does not contain a pearl.” The bottom row shows the probabilities that an egg either does or doesn’t contain a pearl, given that it is blue.

One thing that might be confusing right away is that the bars at the top and bottom of this drawing are not measuring the same collection of eggs, even though they look the same size in the drawing. The top bar is measuring all eggs, both blue and red. But the bottom bar is measuring only blue eggs. But don’t let this confuse you; drawing it this way allows us to most clearly illustrate the effects of each of our three priors: the prior probability, the first conditional probability, and the second conditional probability. At both the top and bottom of the drawing we are looking at the chances an egg has a pearl. It’s just that in the middle we “ran a test” and discovered that the egg we grabbed from the bin was blue, so that eliminated all the red eggs from the situation.

The slant of the line in the middle represents how we should update our probability that an egg contains a pearl after our first test (whether the egg is blue or not). At first, all we know is that if we grab an egg from the bin, it has a 40% chance of containing a pearl. But now let’s say we grab an egg from the bin and see that it is blue. Because we know that the chances it would be blue if it had a pearl were higher than the chances it would be blue if it didn’t have a pearl, we therefore know the egg now has a higher probability of containing a pearl than it did before. So we slide our probability that the egg contains a pearl in the upward direction. That’s why the line in the middle of the drawing is slanted to the right: we shifted up. Knowing how much to shift up our probability requires doing the math, of course.

Now, let’s look at the effect on the posterior probability if the prior probability is different. What if only 10% of all the eggs contained pearls? Now our drawing would look like this:

Our conditional probabilities didn’t change, so the relative slant of the line in the middle didn’t change. That is, the degree by which we have to update our probability that an egg contains a pearl after we discover it is blue – that degree of updating required did not change. However, the prior probability is now much lower, which means the posterior probability is correspondingly lower.

Remember what happened in the above story about women and breast cancer when our story said that only one woman in a million had breast cancer? If we showed that in a diagram like the one above, the slanted line would be slammed up against the left edge of the drawing, and the posterior probability that a woman had breast cancer would be extremely small no matter what the result of the test, given almost any set of conditional probabilities. (You’d have to have a really slanted line of conditional probabilities to update very far from the far left edge of the diagram.)

And what happens if we keep the conditional probabilities locked in place, but jump the prior probability up to 80%?

Now of course, the prior probability is much greater, so the posterior probability is also much greater.

And again, the degree of updating we need to do remains the same, so the line is still slanted mildly to the right. But notice, the exact amount of updating is not the same. The line is not quite as slanted as when the prior probability was 40%, or even as slanted as when the prior probability was 10%. Why is that?

That’s because the amount by which we need to update our probability after discovering the egg is blue depends not just on difference between the two conditional probabilities (in this case, 30% and 10%), but also on the prior probability. That effect just drops out of the math. And if you think about it, it makes sense. What if the prior probability that an egg contained a pearl was 99.999%, and the conditional probabilities remained the same? If it were the case that we updated by the same amount as before, then the probability that an egg was blue and contained a pearl would be greater than 100%! The slanted line would go off the right edge of the drawing!

If that happened, it would mean you were doing the math wrong. As it happens, pushing the prior probability to 99% just makes the amount of updating we need to do very small in absolute terms, because we still need to adjust upward, but the probability that the egg contains a pearl can’t get much higher than it already is:

Now, what if we return the prior probability to its original value of 40%, but change the first conditional probability?

Now the first conditional probability is exerting much more force on the posterior probability, and slanting our line more heavily to the right. That means we have to make a heavier update to our probability that the egg contains a pearl after discovering it is blue.

Why is that? Well, the first conditional probability is p(blue|pearl) or “the probability that an egg is blue, given that it contains a pearl.” What happens if that probability is far larger than the other conditional probability, “the probability that an egg is blue, given that it doesn’t contain a pearl”? If that happens, then there are going to be a lot more eggs that are blue and contain a pearl than eggs that are blue and empty. So once you have discovered the egg you picked from the bin is blue, you know there is now a much better chance than before that the egg you have in your hand contains a pearl… because there are more pearl-containing blue eggs than there are empty blue eggs.

So again, it’s the difference between the two conditional probabilities that determines by how much we need to update our probability as a result of our test. If the difference between the two conditional probabilities is small, we don’t need to update from the prior probability very much. If the difference between the two conditional probabilities is large, we need to update quite a bit from the prior probability.

Just to emphasize that it’s the difference between the two conditional probabilities that determines the degree by which we must update our probability, and not their absolute values, let’s look at what happens if both conditional probabilities are very high, but not very different:

Now the story is that if an egg contains a pearl, it is definitely blue. There are no red eggs with pearls. However, there is also a high degree of “false positives” in our “test,” because it’s also very common for an egg without a pearl to be blue. Because an egg is very likely to be blue whether it contains a pearl or not, the fact that an egg is blue doesn’t tell you much about whether or not it contains a pearl, and so finding that the egg is blue doesn’t allow us to update our probability very much. It’s the difference between the two conditional probabilities that tells us by how much we can update our prior probability.

Now what if the second conditional probability is larger than the first?

Now the second conditional probability is larger than the first one, so the line is slanted left, and we have to update our probability in the opposite direction. Again, this makes sense. Now the story is that “the probability that an egg is blue given that it doesn’t contain a pearl” is larger than “the probability that an egg is blue given that it does contain a pearl,” which means that in this story there are more eggs that are blue and empty than there are eggs that are blue and contain a pearl. So if we’ve grabbed a blue egg, it’s more likely to be empty than contain a pearl than was the case when we didn’t know it was blue (the prior probability). So we have to update our probability that the egg contains a pearl in the downward direction this time, no matter what the prior probability is.

And what if our two conditional probabilities are the same?

If our two conditional probabilities are the same, then they exert the same amount of force on our required update, which means we don’t update at all. If an egg is just as likely to be blue given that it contains a pearl as it is likely to be blue given that it doesn’t contain a pearl, then there are just as many eggs that are blue and contain a pearl as there are eggs that are blue and empty, and so the fact that the egg we picked is blue doesn’t give us any new information at all about whether or not it contains a pearl. Thus, we’re stuck with no new information, and we can’t update from the prior probability.

This is the case no matter what the conditional probabilities are, as long as they are the same:

The usual mistake in thinking about these kinds of problems is to simply ignore the prior probability and focus on the two conditional probabilities. But now you can see why all three of these pieces of information are required for calculating the posterior probability correctly.

Yudkowsky explains:

Studies of clinical reasoning show that most doctors carry out the mental operation of replacing the original 1% probability with the 80% probability that a woman with cancer would get a positive mammography.  Similarly, on the pearl-egg problem, most respondents unfamiliar with Bayesian reasoning would probably respond that the probability a blue egg contains a pearl is 30%, or perhaps 20% (the 30% chance of a true positive minus the 10% chance of a false positive).  Even if this mental operation seems like a good idea at the time, it makes no sense in terms of the question asked.  It’s like the experiment in which you ask a second-grader:  “If eighteen people get on a bus, and then seven more people get on the bus, how old is the bus driver?”  Many second-graders will respond:  “Twenty-five.”  They understand when they’re being prompted to carry out a particular mental procedure, but they haven’t quite connected the procedure to reality.  Similarly, to find the probability that a woman with a positive mammography has breast cancer, it makes no sense whatsoever to replace the original probability that the woman has cancer with the probability that a woman with breast cancer gets a positive mammography.  Neither can you subtract the probability of a false positive from the probability of the true positive.  These operations are as wildly irrelevant as adding the number of people on the bus to find the age of the bus driver.

Dumb without Bayes.

A man who knows he has a genetic predisposition to alcoholism can respond to this knowledge by avoiding alcohol more purposefully than he might otherwise. Likewise, if we can understand why our brains handle probabilities poorly, we may be able to plan ahead and counteract the reasoning mistakes our brains lean toward.

So why do human brains, even the brains of trained doctors, usually get these kinds of problems wrong?

Luckily, recent studies have shed some light on the problem.

It turns out that we get the problems right more or less often depending on how the problem is phrased.

The kind of problem we get wrong the most often is phrased in terms of percentages or probabilities: “1% of women…” and so on.

But we do somewhat better when the problem is phrased in terms of frequencies: “1 out of every 100 women have breast cancer…” and “80 out of every 100 women with breast cancer get a positive test result…” Apparently, this phrasing helps us to visualize a single woman in an empty space made to hold 100 women, or 80 women nearly filling a space made to hold 100 women.

We do best of all when the problem is phrased in terms of absolute numbers, which are called natural frequencies: “400 out of 1000 eggs contain pearls…” and “50 out of 400 pearl-containing eggs are blue…” This is closest of all to actually doing the experiment yourself and experiencing how often you pull a blue egg from the bin, and experiencing how often blue eggs contain pearls.

Yudkowsky remarks:

It may seem like presenting the problem in this way is “cheating”, and indeed if it were a story problem in a math book, it probably would be cheating.  However, if you’re talking about real doctors, you want to cheat; you want the doctors to draw the right conclusions as easily as possible.  The obvious next move would be to present all medical statistics in terms of natural frequencies.  Unfortunately, while natural frequencies are a step in the right direction, it probably won’t be enough.  When problems are presented in natural frequences, the proportion of people using Bayesian reasoning rises to around half.  A big improvement, but not big enough when you’re talking about real doctors and real patients.

A visualization of the eggs and pearls problem in terms of natural frequencies might look like this:

Because we’re looking at absolute numbers now instead of percentages, the bar at the top is bigger than the bar at the bottom, because the collection of all eggs is larger than the collection of just blue eggs.

The top bar looks the same as before, and the middle line has the same kind of slant, but now the bottom bar is much smaller because we’re only looking at blue eggs in the bottom bar. (The bottom bar is centered, as you can see.)

In this kind of visualization, we don’t see how much we’re updating our probability so much in the slant of the line, but rather in the difference in proportions between the top bar and the bottom bar. In the above example, you can see just by looking that the pearl condition takes up an even larger proportion of the bottom bar than it does in the top bar, which means we are updating our probability upward.

What does this kind of visualization look like if the conditional probabilities are the same?

In this case, we see that the proportions in the bottom and top bars are the same, so we aren’t updating, even though the line is slanted in this kind of visualization.

But the natural frequencies visualization shows something the probabilities visualization does not. The natural frequencies visualization shows that when we decrease two proportions by the same factor, the resulting proportions are the same. Discovering that the egg was blue decreased the number of pearl-carrying eggs we might be looking at, but it decreased the number of empty eggs we might be looking at by the same factor, and that’s why the probability that the egg we grabbed contains a pearl remains the same as before we discovered it was blue.

Now, let’s look at a natural frequencies visualization for the original problem about breast cancer. 1% of women have breast cancer, 80% of those women test positive on a mammography, and 9.6% of women without breast cancer also receive positive mammographies.

You can hardly see the condition on the left at all because the prior probability of breast cancer is so small: only 1%. And even though the mammography is fairly accurate (only a 20% rate of false negatives and a 9.6% rate of false positives), we still don’t have much reason to think the woman has breast cancer after a positive test, because the prior probability is so low. Even after adjusting upward because of the positive test result, the posterior probability that she has breast cancer is still only 7.6%.

Still, how does the test give us useful information? As the above visualization shows, the test eliminates more of the women without breast cancer than with breast cancer. The proportion of the top bar that represents women with breast cancer is small, but the test passes most of this on to the bottom bar, our posterior probability. In contrast, most of the section of the top bar representing women without breast cancer was not passed to the bottom bar. It’s this difference between conditional probabilities that gives us some information with which to update our prior probability to our posterior probability. The evidence of the positive mammography test slides the prior probability of 1% to the posterior probability of 7.8%.

Positive and negative results.

Next, Yudkowsky asks us to imagine a new kind of breast cancer test:

Suppose there’s yet another variant of the mammography test, mammography@, which behaves as follows.  1% of women in a certain demographic have breast cancer.  Like ordinary mammography, mammography@ returns positive 9.6% of the time for women without breast cancer.  However, mammography@ returns positive 0% of the time (say, once in a billion) for women with breast cancer.

Here is the graph:

Okay, this one is easy. If a woman gets a positive result on this test, what do you tell her?

If a woman gets a positive result on the mammography@ test, you tell her: “Congratulations! You definitely don’t have breast cancer.”

Mammography@ isn’t a cancer test; it’s a health test! As the visualization shows, very few women get a positive result from a mammography@ test, but no women with breast cancer get a positive result. So if a woman gets a positive mammography@ result, she definitely doesn’t have breast cancer!

What this shows is that what makes a normal mammography test a positive test for breast cancer (not for health) is not that somebody named the mammography test “positive,” but that the test has a certain kind of probability relation to breast cancer. Normal mammography is a “positive” test for breast cancer because a “positive” result of the test increases the chances that the tested woman has breast cancer. But in the case of mammography@, a “positive” result actually decreases the chances she has breast cancer. So mammography@ is not a positive test for breast cancer, but a positive test for the condition of not-having-breast-cancer.

Yudkowsky concludes:

You could call the same result “positive” or “negative” or “blue” or “red” or “James Rutherford”, or give it no name at all, and the test result would still slide the probability in exactly the same way.  To minimize confusion, a test result which slides the probability of breast cancer upward should be called “positive”.  A test result which slides the probability of breast cancer downward should be called “negative”.  If the test result is statistically unrelated to the presence or absence of breast cancer – if the two conditional probabilities are equal – then we shouldn’t call the procedure a “cancer test”!  The meaning of the test is determined by the two conditional probabilities; any names attached to the results are simply convenient labels.

Now, note that the mammography@ is rarely useful. Most of the time, it gives a negative result, which gives very weak evidence that doesn’t allow us to slide our probability (that the woman has cancer) very far away from the prior probability. Only on rare occasions (a positive result) does it give us strong evidence. But when it does give us strong evidence, it is very strong evidence, for it allows us to conclude with certainty that the tested woman does not have breast cancer.

How the quantities relate.

Let’s return to our original mammography story:

Only 1% of women at age forty who participate in a routine mammography test have breast cancer. 80% of women who have breast cancer will get positive mammographies, but 9.6% of women of women who don’t have breast cancer will also get positive mammographies. A woman of this age had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Let’s look at all the different quantities involved (taken from Yudkowsky’s page):

p(cancer): 0.01 Group 1: 100 women with breast cancer
p(~cancer): 0.99 Group 2: 9900 women without breast cancer
p(positive|cancer): 80.0% 80% of women with breast cancer have positive mammographies
p(~positive|cancer): 20.0% 20% of women with breast cancer have negative mammographies
p(positive|~cancer): 9.6% 9.6% of women without breast cancer have positive mammographies
p(~positive|~cancer): 90.4% 90.4% of women without breast cancer have negative mammographies
p(cancer&positive): 0.008 Group A: 80 women with breast cancer and positive mammographies
p(cancer&~positive): 0.002 Group B: 20 women with breast cancer and negative mammographies
p(~cancer&positive): 0.095 Group C: 950 women without breast cancer and positive mammographies
p(~cancer&~positive): 0.895 Group D: 8950 women without breast cancer and negative mammographies
p(positive): 0.103 1030 women with positive results
p(~positive): 0.897 8970 women with negative results
p(cancer|positive): 7.80% Chance you have breast cancer if mammography is positive: 7.8%
p(~cancer|positive): 92.20% Chance you are healthy if mammography is positive: 92.2%
p(cancer|~positive): 0.22% Chance you have breast cancer if mammography is negative: 0.22%
p(~cancer|~positive): 99.78% Chance you are healthy if mammography is negative: 99.78%

As you might imagine, it can be easy to mix up one of these quantities with another. p(cancer&positive) is the exact same thing as p(positive&cancer), but p(cancer|positive) is definitely not the same thing or the same value as p(positive|cancer). The probability that a woman has cancer given a positive test result is not the same thing as the probability that a woman will get a positive test result given that she has cancer. If you confuse those two, you’ll get the wrong answer! And of course p(cancer&positive) is entirely different from p(cancer|positive). The probability that a woman has cancer and will get a positive test result is not at all the same thing as the probability that a woman has cancer given that gets a positive test result.

Later, I’ll present Bayes’ Theorem, and if you stick to the formula, you won’t mix these up. But it helps to know what they all mean, and how they relate to each other.

To see how they relate to each other, consider the “degrees of freedom” between them. What the heck are “degrees of freedom”? Wikipedia sayeth unto you:

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

What does that mean? Let’s look at an example. There is only one degree of freedom between p(cancer) and p(~cancer) because if you know one of them, you know the other. Once you know one of the numbers, there is only one value the other can take; it has one degree of freedom. If p(cancer) is 90%, then p(~cancer) is 10%. If p(~cancer) is 45%, then p(cancer) is 55%. There is “nowhere else to go,” because

p(cancer) + p(~cancer) = 100%

And of course, the same goes for positive and negative tests:

p(positive) + p(~positive) = 100%

Another pair of values that has only one degree of freedom between them is p(positive|cancer) and p(~positive|cancer). Given that a woman has cancer (which is true of both those values), then she will either test positive or she will not test positive. There’s no third option, assuming (as our story does) that she is tested. So once you know one of these values, you know the other. If p(positive|cancer) = 20%, then p(~positive|cancer) must be 80%. Why? Because:

p(positive|cancer) + p(~positive|cancer) = 100%

Remember, it helps to always read these statements aloud. The above statement reads: “The probability that a woman tests positive given that she has cancer, plus the probability that she tests negative given that she has cancer, equals 100%.

And of course the same is true when looking at cancer or not cancer given a positive test result:

p(cancer|positive) + p(~cancer|positive) = 100%

That reads: “The probability that a woman has cancer given that she tests positive, plus the probability that she doesn’t have cancer given that she tests positive, equals 100%.” If you say it out loud, the truth of the equation becomes obvious.

And likewise:

p(positive|cancer) + p(~positive|cancer) = 100%

p(cancer|~positive) + p(~cancer|~positive) = 100%

However, consider the relation between p(positive|cancer) and p(positive|~cancer). It could be the case that, as in the original story, p(positive|cancer) was equal to 80%, while p(positive|~cancer) was equal to 9.6%. In other words, it could be that the probability that a woman would test positive given that she has cancer is 80%, while the probability that a woman would test positive given that she does not have cancer is 9.6%. These two values are independent of each other. It could just as well be that the chance of false negative is not 20% but instead 2%, while at the same time the chance of a false positive is still 9.6%. So these two values, p(positive|cancer) and p(positive|~cancer) are said to have two degrees of freedom. Both numbers could be different, independently.

Let’s take a triplet of values and consider the degrees of freedom between them. Our three values are p(positive&cancer), p(positive|cancer), and p(cancer). How many degrees of freedom are there between them? Since there are three values, there could be as many as three degrees of freedom between them. But let’s check.

In this case, we can calculate one of the other values by looking at the other two. In particular:

p(positive&cancer) = p(positive|cancer) × p(cancer)

Why must this be so? If we know the probability that a woman has cancer, and we know the proportion of those women with cancer that will test positive, then that tell us the probability that a woman has cancer and tests positive. We multiply the probability that a woman has cancer by the probability that a woman tests positive given that she has cancer, and that is the probability that a woman has cancer and tests positive.

Because we can use two of the values to calculate the third, there are only two degrees of freedom between these three values: p(positive&cancer), p(positive|cancer), and p(cancer).

The same is true for this arrangement:

p(~positive&cancer) = p(~positive|cancer) × p(cancer)

If we know the probability that a woman has cancer, and we know the probability of those women with cancer that will not test positive, then that just is the probability that a woman has cancer and does not test positive.

Let’s consider another triplet of values: p(positive), p(positive&cancer), and p(positive&~cancer). How many degrees of freedom are there between them? It should be rather obvious that:

p(positive&cancer) + p(positive&~cancer) = p(positive)

Every woman who tests positive either has cancer and tests positive or doesn’t have cancer and tests positive. Those two possibilities add up to account for 100% of the women who test positive. So you can use them to calculate the total percentage of women who test positive for breast cancer. So there are only two degrees of freedom between p(positive), p(positive&cancer), and p(positive&~cancer).

Now, consider this set of four values: p(positive&cancer), p(positive&~cancer), p(~positive&cancer), and p(~positive&~cancer). At first glance, it might look like there are only two degrees of freedom here, because you can calculate all the other values by knowing only two of them: p(positive) and p(cancer). For example: p(positive&~cancer) = p(positive) × p(~cancer).

But this is actually wrong! Notice that the above equation is only true if p(positive) and p(~cancer) are statistically independent. The above equation is only true if the probability of a woman having cancer has no bearing on her chances of testing positive for cancer. But that’s not the case! According to our story, she is more likely to test positive if she does have breast cancer than if she doesn’t have breast cancer.

But a simpler way of seeing why this is wrong may be to notice that these four values correspond to four groups of different women, and of course there could be different numbers of women in each group. We could have 500 women in the has cancer and tests positive group (group A, let’s call it), 150 women in the has cancer and tests negative group (group B), 50 women in the has no cancer and tests positive group (group C), and 900 women in the has no cancer and tests negative group (group D). And each of these values could be different, independent of all the others.

So now you’re thinking this set of four values has four degrees of freedom, and you’d be right if it wasn’t for the fact that all four add up to 100% of the women. That is, all four of these probabilities must add up to 100%. For example, in the above paragraph I put 500 women in group A, 150 women in group B, 50 women in group C, and 900 women in group D. That makes for a total of 1,600 women. Thus, the probability that a woman belongs to group A is 31.25%. That is, p(A) = 31.25%. Moving on, p(B) = 9.375%, p(C) = 3.125%, and p(D) = 56.25%. And, not surprisingly, those four percentages add up to 100%.

Because of this, we can use three of the values to calculate the fourth, because we know the total of all of them is going to add up to 100%, and this costs us one degree of freedom:

p(positive&cancer) + p(positive&~cancer) + p(~positive&cancer) + p(~positive&~cancer) = 100%

In fact, once you have all four group (A = cancer and tests positive, B = cancer and tests negative, C = no cancer and tests positive, and D = no cancer and tests negative), you can easily use them to calculate all the other values. For example:

p(cancer|positive) = A / (A + C)

The probability that a woman has cancer given a positive test result is, of course, the probability that she has cancer and tests positive (group A), divided by the probability that she tests positive (A + C):

p(positive) = A + C

And, the probability that a woman has cancer given that she tests negative is the probability that she has cancer and tests negative (group B), divided by the probability that she tests negative (B + D):

p(cancer|~positive) = B / (B + D)

p(~positive) = B + D

Finally, the probability that a woman has cancer is equal to the probability that she has cancer and tests positive (group A) plus the probability that she has cancer and tests negative (group B):

p(cancer) = A + B

Likewise, the probability that she doesn’t have cancer is equal to the probability that she doesn’t have cancer and tests positive (group C) plus the probability that she doesn’t have cancer and tests negative (group D):

p(~cancer) = C + D

If we translate the letters into probability symbols, we’ve just explained the following equations:

p(cancer|positive) = p(cancer&positive) / [p(cancer&positive + p(~cancer&positive)]

p(positive) = p(cancer&positive) + p(~cancer&positive)

p(cancer|~positive) = p(cancer&~positive) / [p(cancer&~positive) + p(~cancer&~positive)]

p(~positive) = p(cancer&~positive) + p(~cancer&~positive)

p(cancer) = p(cancer&positive) + p(cancer&~positive)

p(~cancer) = p(~cancer&positive) + p(~cancer&~positive)

And since we can calculate all the values we want if we have A, B, C, and D, and since A, B, C, and D have three degrees of freedom, it follows that all 16 values in the problem (see the colored table above) have three degrees of problem.

But that should not surprise you, since you already knew that you could solve these types of problems with just three pieces of information: the prior probability and the two conditional probabilities.

Now that you understand the relations between the 16 different quantities in this kind of problem, let’s try another kind of story problem offered by Yudkowsky:

Suppose you have a large barrel containing a number of plastic eggs.  Some eggs contain pearls, the rest contain nothing.  Some eggs are painted blue, the rest are painted red.  Suppose that 40% of the eggs are painted blue, 5/13 of the eggs containing pearls are painted blue, and 20% of the eggs are both empty and painted red.  What is the probability that an egg painted blue contains a pearl?

Try to solve it using the relations we discovered above.

What pieces of information do we have? We know that 40% of the eggs are painted blue:

p(blue) = 40%

We also know that 5/13 of the eggs containing pearls are blue:

p(blue|pearl) = 5/13

And we also know that 20% of the eggs are both empty and red:

p(~blue&~pearl) = 20%

The piece of information we want to solve for is:

p(pearl|blue) = ?

Okay, how do we get that posterior probability? Since you’re new to Bayes’ Theorem, you’re probably not sure what the fastest way to the answer is, so let’s just start filling in the most obvious values we can, among those 16 values of the problem:

  • p(blue) = 40% …given in the story
  • p(~blue) =
  • p(pearl) =
  • p(~pearl) =
  • p(pearl&blue) =
  • p(pearl&~blue) =
  • p(~pearl&blue) =
  • p(~pearl&~blue) = 20% …given in the story
  • p(blue|pearl) = 5/13 …given in the story
  • p(~blue|pearl) =
  • p(blue|~pearl) =
  • p(~blue|~pearl) =
  • p(pearl|blue) = ???
  • p(~pearl|blue) =
  • p(pearl|~blue) =
  • p(~pearl|~blue) =

How do we fill in more of the values? Well, let’s check all the relations between the values we have discovered, and see what we can solve for. Here are the rules we discovered when discussing the breast cancer story problem:

  • p(cancer) + p(~cancer) = 100%
  • p(positive) + p(~positive) = 100%
  • p(positive|cancer) + p(~positive|cancer) = 100%
  • p(cancer|positive) + p(~cancer|positive) = 100%
  • p(positive|cancer) + p(~positive|cancer) = 100%
  • p(cancer|~positive) + p(~cancer|~positive) = 100%
  • p(positive&cancer) = p(positive|cancer) × p(cancer)
  • p(~positive&cancer) = p(~positive|cancer) × p(cancer)
  • p(positive&cancer) + p(positive&~cancer) = p(positive)
  • p(positive&cancer) + p(positive&~cancer) + p(~positive&cancer) + p(~positive&~cancer) = 100%
  • p(cancer|positive) = p(cancer&positive) / [p(cancer&positive + p(~cancer&positive)]
  • p(positive) = p(cancer&positive) + p(~cancer&positive)
  • p(cancer|~positive) = p(cancer&~positive) / [p(cancer&~positive) + p(~cancer&~positive)]
  • p(~positive) = p(cancer&~positive) + p(~cancer&~positive)
  • p(cancer) = p(cancer&positive) + p(cancer&~positive)
  • p(~cancer) = p(~cancer&positive) + p(~cancer&~positive)

But now, let’s translate those rules into talking about blue and pearl, by replacing every occurrence of cancer (what we were trying to detect) with pearl (what we’re now trying to detect), and by replacing every occurrence of positive (our previous test) with blue (our current test):

  • p(pearl) + p(~pearl) = 100%
  • p(blue) + p(~blue) = 100%
  • p(blue|pearl) + p(~blue|pearl) = 100%
  • p(pearl|blue) + p(~pearl|blue) = 100%
  • p(blue|pearl) + p(~blue|pearl) = 100%
  • p(pearl|~blue) + p(~pearl|~blue) = 100%
  • p(blue&pearl) = p(blue|pearl) × p(pearl)
  • p(~blue&pearl) = p(~blue|pearl) × p(pearl)
  • p(blue&pearl) + p(blue&~pearl) = p(blue)
  • p(blue&pearl) + p(blue&~pearl) + p(~blue&pearl) + p(~blue&~pearl) = 100%
  • p(pearl|blue) = p(pearl&blue) / [p(pearl&blue + p(~pearl&blue)]
  • p(blue) = p(pearl&blue) + p(~pearl&blue)
  • p(pearl|~blue) = p(pearl&~blue) / [p(pearl&~blue) + p(~pearl&~blue)]
  • p(~blue) = p(pearl&~blue) + p(~pearl&~blue)
  • p(pearl) = p(pearl&blue) + p(pearl&~blue)
  • p(~pearl) = p(~pearl&blue) + p(~pearl&~blue)

Okay, so which rules can we use with the quantities we know already?

Well, the most obvious ones we can use right away are:

p(blue) + p(~blue) = 100%

and

p(blue|pearl) + p(~blue|pearl) = 100%

That gives us 60% for p(~blue) and 8/13 for p(~blue|pearl).

Here’s another. Remember that:

p(~blue) = p(pearl&~blue) + p(~pearl&~blue)

Well, we know what two of those values are, so:

60% = 20% + p(pearl&~blue)

Which means:

p(pearl&~blue) = 60% – 20% = 40%

So we can add that to our table, too. Now our table looks like this:

  • p(blue) = 40%
  • p(~blue) = 60% …because p(blue) + p(~blue) = 100%
  • p(pearl) =
  • p(~pearl) =
  • p(pearl&blue) =
  • p(pearl&~blue) = 40% …because p(~blue) = p(pearl&~blue) + p(~pearl&~blue)
  • p(~pearl&blue) =
  • p(~pearl&~blue) = 20%
  • p(blue|pearl) = 5/13
  • p(~blue|pearl) = 8/13 …because p(blue|pearl) + p(~blue|pearl) = 100%
  • p(blue|~pearl) =
  • p(~blue|~pearl) =
  • p(pearl|blue) = ???
  • p(~pearl|blue) =
  • p(pearl|~blue) =
  • p(~pearl|~blue) =

Now, what else can we do? Go through the value relations listed above and see if you can find one that you now have enough data to solve for.

Here’s one:

p(pearl|~blue) = p(pearl&~blue) / [p(pearl&~blue) + p(~pearl&~blue)]

Filling that in with the values we already have, we get:

p(pearl|~blue) = 40% / (40% + 20%)

So:

p(pearl|~blue) = 2/3

Which means we can solve for p(~pearl|~blue) also, because:

p(pearl|~blue) + p(~pearl|~blue) = 100%

Therefore:

p(~pearl|~blue) = 1/3

And now our table of values looks like this:

  • p(blue) = 40%
  • p(~blue) = 60%
  • p(pearl) =
  • p(~pearl) =
  • p(pearl&blue) =
  • p(pearl&~blue) = 40%
  • p(~pearl&blue) =
  • p(~pearl&~blue) = 20%
  • p(blue|pearl) = 5/13
  • p(~blue|pearl) = 8/13
  • p(blue|~pearl) =
  • p(~blue|~pearl) =
  • p(pearl|blue) = ???
  • p(~pearl|blue) =
  • p(pearl|~blue) = 2/3 …because p(pearl|~blue) = p(pearl&~blue) / [p(pearl&~blue) + p(~pearl&~blue)]
  • p(~pearl|~blue) = 1/3 …because p(pearl|~blue) + p(~pearl|~blue) = 100%

We are making progress! And here’s another equation we can now solve for:

p(~blue&pearl) = p(~blue|pearl) × p(pearl)

So:

40% = (8/13) × p(pearl)

And therefore:

p(pearl) = 40% / (8/13) = (2/5) / (8/13) = 13/20 = 65%

Now, that we know p(pearl), we can also solve this one from our list of known equations:

p(blue&pearl) = p(blue|pearl) × p(pearl)

So:

p(blue&pearl) = (5/13) × (13/20) = 1/4 = 25%

Updating out table of values, we now have:

  • p(blue) = 40%
  • p(~blue) = 60%
  • p(pearl) = 65% …because p(~blue&pearl) = p(~blue|pearl) × p(pearl)
  • p(~pearl) =
  • p(pearl&blue) = 25% …because p(blue&pearl) = p(blue|pearl) × p(pearl)
  • p(pearl&~blue) = 40%
  • p(~pearl&blue) =
  • p(~pearl&~blue) = 20%
  • p(blue|pearl) = 5/13
  • p(~blue|pearl) = 8/13
  • p(blue|~pearl) =
  • p(~blue|~pearl) =
  • p(pearl|blue) = ???
  • p(~pearl|blue) =
  • p(pearl|~blue) = 2/3
  • p(~pearl|~blue) = 1/3

The last bit is easy, because:

p(blue) = p(pearl&blue) + p(~pearl&blue)

Which gives us:

40% = 25% + p(~pearl&blue)

And therefore:

p(~pearl&blue) = 40% – 25% = 15%

So now we can finally solve for p(pearl|blue), because:

p(pearl|blue) = p(pearl&blue) / [p(pearl&blue + p(~pearl&blue)]

Which gives us:

p(pearl|blue) = 25% / (25% + 15%)

And therefore:

p(pearl|blue) = 25% / 40%

Which results in:

p(pearl|blue) = 62.5%

In fact, we didn’t need to calculate all the values we did. But it was good practice. :)

But now, let’s check our calculations. Do they make sense? Here’s the original story problem:

Suppose you have a large barrel containing a number of plastic eggs.  Some eggs contain pearls, the rest contain nothing.  Some eggs are painted blue, the rest are painted red.  Suppose that 40% of the eggs are painted blue, 5/13 of the eggs containing pearls are painted blue, and 20% of the eggs are both empty and painted red.  What is the probability that an egg painted blue contains a pearl?

Remember we found that p(pearl) is 13/20. That’s our prior probability: 65%. There’s a 65% chance that an egg has a pearl, even before we run the test of seeing what color it is.

What are our conditional probabilities? One of them was given to us in the story. The probability we would see a blue egg given that it contained a pearl, p(blue|pearl), is 5/13 – which doesn’t reduce nicely to a decimal. The other conditional probability, the probability that we would see a red egg if it contained a pearl, is p(~blue|pearl) = 8/13.

So if a certain egg has a pearl, it’s just barely more likely to be red than to be blue. So when we discover that the egg we picked from the bin is blue, that makes it slightly less likely to contain a pearl than was the case before we knew its color. So after running the test and discovering the egg we picked is blue, the probability that our egg contains a pearl slides down just a little.

And hey! That’s just what we see. Our prior probability that our chosen egg contains a pearl was 65%, and according to our calculations the posterior probability that our chosen egg contains an egg is slightly lower at 62.5%.

So yes, our math seems to fit what we would expect given what we’ve learned about how these types of situations work.

Likelihood ratios.

Having worked a few of these problems now, you might have noticed that strong but rare evidence pushing in one direction must be balanced by weak but common evidence pushing in the opposite direction. This is because, using the breast cancer story problem as an example:

p(cancer) = p(cancer|positive) × p(positive) +  p(cancer|~positive) × p(~positive)

This reads: “The probability that a woman has cancer is equal to [the probability that she has cancer given that she tests positive times the probability that she tests positive] plus [the probability that a woman has cancer given that she tests negative times the probability that she tests negative].”

Thus, if there is rare but strong evidence from one of the conditional probabilities, this must be balanced by common but weak evidence from the other conditional probability, because the two of them must add up to, in this example, p(cancer). Yudkowsky calls this principle the Conservation of Probability.

Now, one more term: likelihood ratio. The likelihood ratio has to do with the likelihood of getting a true positive vs. the likelihood of getting a false positive. Specifically, the likelihood ratio is the probability that a test gives a true positive, divided by the probability that a test gives a false positive. However, the likelihood ratio doesn’t tell us much about what we should do if we get a negative result.

For example, p(pearl|blue) is independent of p(pearl|~blue). Even if we know the likelihood ratio, and therefore know what to do if we get a positive result on our test, that doesn’t tell us what to do with a negative result from our test.

To illustrate this further, consider the following problem, again taken from Eliezer:

Suppose that there are two barrels, each containing a number of plastic eggs.  In both barrels, 40% of the eggs contain pearls and the rest contain nothing.  In both barrels, some eggs are painted blue and the rest are painted red.  In the first barrel, 30% of the eggs with pearls are painted blue, and 10% of the empty eggs are painted blue.  In the second barrel, 90% of the eggs with pearls are painted blue, and 30% of the empty eggs are painted blue.  [Assuming you like pearls, would you rather] have a blue egg from the first or second barrel?  Would you rather have a red egg from the first or second barrel?

This time, we need to calculate the probability that a blue egg from the 1st barrel contains a pearl, and compare it to the probability that a blue egg from the 2nd barrel contains a pearl. For the second question, we calculate the probability that a red egg from the 1st barrel contains a pearl, and compare it to the probability that a red egg from the 2nd barrel contains a pearl.

In both barrels, 40% of the eggs contain pearls. So our prior probability, p(pearl), is 40% for either barrel.

And if you’ve made it this far instead of skipping ahead, it might be intuitively obvious to you that we don’t care whether a blue egg comes from the first or second barrel, because:

In the first barrel, p(blue|pearl) / p(blue|~pearl) = 30/10

In the second barrel, p(blue|pearl) / p(blue|~pearl) = 90/30

…which is the same ratio. They both equal exactly three. And since the prior probability – specifically, p(pearl) – is the same for both barrels, and the ratio between the conditional probabilities is the same for both barrel, thus p(pearl|blue) is going to be the same for both barrels. So would we rather have a blue egg from the first or second barrel? We don’t care: p(pearl|blue) is the same in either case.

But what about a red egg? Would we rather have a red egg from the first or second barrel? That is: for which barrel is p(pearl|~blue) higher?

In the first barrel, 70% of the eggs with pearls are painted red, and 90% of the empty eggs are painted red. But in the second barrel, 10% of the eggs with pearls are painted red, while 70% of the empty eggs are painted red. Here, the ratio between the conditional probabilities is different for the first barrel than for the second barrel. Specifically:

In the first barrel, p(~blue|pearl) / p(~blue|~pearl) = 70/90

In the second barrel, p(~blue|pearl) / p(~blue|~pearl) = 10/70

Since the ratio of the conditional probabilities for barrel #1 is different than the ratio of the conditional probabilities for barrel #2, we can tell that we are going to prefer to get a red from one barrel over the other. And without doing the math, we can tell p(pearl|~blue) is going to be higher for barrel #1, so we’d rather get a red egg from barrel #1 than from barrel #2. The ratio of p(~blue|pearl) to p(~blue|~pearl) is higher for barrel #1 than for barrel #2.

Again, you must be reading these out loud: “The ratio of the probability of drawing a red egg given that it contains a pearl to the probability of drawing a red egg given that it is empty is higher for barrel #1 than for barrel #2, so we’d rather get a red egg from barrel #1 than from barrel #2 (assuming we want a pearl).”

This problem illustrates the fact that p(pearl|blue) and p(pearl|~blue) have two degrees of freedom, even when p(pearl) is fixed. For either barrel above, p(pearl) was the same, but that did not mean that the ratio of p(pearl|blue) to p(pearl|~blue) was the same for either barrel, because p(blue) was different between the two barrels. As Yudkowsky puts it:

In the second barrel, the proportion of blue eggs containing pearls is the same as in the first barrel, but a much larger fraction of eggs are painted blue!  This alters the set of red eggs in such a way that the proportions [between the conditional probabilities] do change.

Back to the breast cancer test:

The likelihood ratio of a medical test – the number of true positives divided by the number of false positives – tells us everything there is to know about the meaning of a positive result.  But it doesn’t tell us the meaning of a negative result, and it doesn’t tell us how often the test is useful.  For example, a mammography with a hit rate of 80% for patients with breast cancer and a false positive rate of 9.6% for healthy patients has the same likelihood ratio as a test with an 8% hit rate and a false positive rate of 0.96%.  Although these two tests have the same likelihood ratio, the first test is more useful in every way – it detects disease more often, and a negative result is stronger evidence of health.

The likelihood ratio for a positive result summarizes the differential pressure of the two conditional probabilities for a positive result, and thus summarizes how much a positive result will slide the prior probability…

Of course the likelihood ratio can’t tell the whole story; the likelihood ratio and the prior probability together are only two numbers, while the problem has three degrees of freedom.

Decibels of evidence.

The late great Bayesian master E.T. Jaynes once suggested that evidence should be measured in decibels.

Why decibels?

Decibels measure exponential differences in sound energy, just like the Richter scale measures exponential differences in the seismic energy released by earthquakes. On the Richter scale, a magnitude 7 earthquake doesn’t release merely a bit more energy than a magnitude 6 earthquake, but ten times more energy. And a magnitude 8 earthquake releases 100 times more energy than a magnitude 6 earthquake.

Likewise, if total silence is 0 decibels, then a whisper is about 20 decibels and a normal conversation is about 60 decibels. The normal conversation releases not three times as much energy as the whisper, but rather 10,000 times more energy, because it releases 40 decibels more energy.

To get the decibels of a sound:

decibels = 10 × log10(intensity)

Allow me a brief aside to make sure we all remember how logarithms work.

I’m sure we all remember how exponents work. If you have a base of 5 and its exponent is 3, that looks like this: 53. And that’s just a quick way of saying 5 × 5 × 5. This operation of taking a “base” and “raising it to the power of” an exponent is called exponentation. Well, taking the logarithm of a number is basically the inverse of exponentation. 53 asks “What is 5 to the 3rd power?” whereas log525 asks “5 to the what power equals 25?” Since 5 to the 2nd power equals 25, log525 = 2. If you “evaluate” the expression log525 (“logarithm, base 5, of 25″), your answer is the power to which you’d have to raise the base (5, in this case) in order to get the number of which you’re taking the logarithm.

Whenever I see the phrase logxn, I always read “x to the what power equals n?” So log464 read “4 to the what power equals 64? The answer, obviously, is that 4 to the 3rd power equals 64.

If this isn’t clear, watch the short video here.

So anyway:

decibels = 10 × log10(intensity)

…just reads “The decibel measure of a sound is equal to 10 times the log, base 10, of the intensity.”

Understanding logarithms will help us get the feel for what it means to think of evidence in terms of decibels (in terms of exponents).

Back to the medical story. Suppose we start with a 1% prior probability that a woman has breast cancer. Then we administer three different tests for breast cancer, and each test has a different likelihood ratio. The likelihood ratios of the tests are 25:3, 18:1, and 7:2.

If we were to take Jaynes’ advice literally and measure our prior probability in decibels, we’d get:

10 × log10(1/99) = -20 decibels of evidence that a woman has breast cancer

I don’t really care if you can calculate that. I always use a calculator for that kind of thing. What I’m hoping is that you can get the feel of working with “decibels” of evidence.

Now, let’s say we administer the first test, the one with a likelihood ratio of 25/3, and the woman tests positive. This gives us 9 positive decibels of evidence that she has breast cancer, because:

10 × log10(25/3) = +9 decibels of evidence that a woman has breast cancer

Next we administer the second test, and she tests positive again!

10 × log10(18/1) = +13 decibels of evidence that a woman has breast cancer

She also tests positive on the third test:

10 × log10(7/2) = +5 decibels of evidence that a woman has breast cancer

The poor woman started out with a very low probability of having breast cancer, but now she has tested positive on three pretty effective tests in a row. Things are not looking good! She started out with -20 decibels of evidence that she had breast cancer, but the three tests added 27 decibels of evidence (9+13+5) in favor of her having breast cancer, so we now have +7 decibels of evidence that she has breast cancer. On a linear scale, +7 bits of evidence might be small, but on an exponential scale, +7 decibels of evidence means that there is now an 83% chance that she has cancer!

Notice that +7 decibels of evidence is not as large as -20 decibels of evidence is small. The original -20 decibels of evidence meant it was 99% likely she did not have breast cancer, but +7 decibels of evidence means it is 83% likely she has breast cancer. Of course, +20 decibels of evidence would mean it was 99% likely she had breast cancer.

Now that you understand the exponential power that evidence has in probabilistic reasoning, try to estimate the answer to this problem - which I paraphrased from Yudkowskywithout writing out all the math:

In front of you is a bookbag containing 1,000 poker chips.  I started out with two such bookbags, one containing 700 red chips and 300 blue chips, the other containing 300 red chips and 700 blue chips.  I flipped a fair coin to determine which bookbag to show you, so your prior probability that the bookbag in front of you is the mostly-red bookbag is 50%.  Now, you close your eyes and reach your hand into the bag and take out a chip at random. You look at it’s color, write it down, and then put it back into the bag and mix the chips up with your hand. You do this 12 times, and out of those 12 “samples” you get 8 red chips and 4 blue chips. What is the probability that this is the mostly-red bag?

Stop here and think about the problem in your head and make a rough guess at the answer.

According to a study by Ward Edwards and Lawrence Phillips, most people faced with this problem give an answer between 70% and 80%. Was your estimate higher than that? If so, congratulations! The correct answer is about 97%.

Without doing the math, here’s how your intuition might have arrived at roughly the right answer. As stated in the problem, the likelihood ratio for the test result of drawing a red chip is 7/3, while the test result for drawing a blue chip is 3/7. Thus, a positive result for either test has the same degree of force in pushing our final probability in one direction or the other, but of course drawing a red chip pushes p(mostly-red bag) in the opposite direction as does drawing a blue chip.

If you draw one red chip, put it back, and then draw a blue chip, these two pieces of evidence have cancelled each other out – but only because the likelihood ratio for either test is exactly opposite. If you draw a red chip and then a blue chip and that’s all the evidence you have, then your probability that the bag in front of you is the mostly-red bag is back to 50%, right where it started.

You drew 12 chips, and got four more red chips than blue chips. That is several “decibels” of evidence in favor of the bag being the mostly-red bag, which is quite a lot of evidence. When you get rid of the red and blue chips that “cancel” each other, every single red chip you have left over pushes your probability that the bag is the mostly-red bag with the strength of a likelihood ratio of 7/3! So even without doing the math you know the final probability that the bag is the mostly-red bag is going to be pretty darned high.

If the likelihood ratio of your positive test is 7/3 and you have four more positive tests than negative ones, it turns out that you can calculate your odds like so:

74:34 = 2401:81

Which is about 30:1, near 97%.

Behold, Bayes' Theorem!

Okay. I think you’re starting to get a feel for how these types of probabilities work. Now, let’s work through one last problem:

You are a mechanic for gizmos.  When a gizmo stops working, it is due to a blocked hose 30% of the time.  If a gizmo’s hose is blocked, there is a 45% probability that prodding the gizmo will produce sparks.  If a gizmo’s hose is unblocked, there is only a 5% chance that prodding the gizmo will produce sparks.  A customer brings you a malfunctioning gizmo.  You prod the gizmo and find that it produces sparks.  What is the probability that a spark-producing gizmo has a blocked hose?

So we want to solve for p(blocked|sparks), and we already know:

p(blocked) = 30%

p(~blocked) = 70%

p(sparks|blocked) = 45%

p(sparks|~blocked) = 5%

Remember that:

p(sparks|blocked) × p(blocked) = p(sparks&blocked)

So:

p(sparks&blocked) = 45% × 30% =  13.5%

Also remember that:

p(sparks|~blocked) × p(~blocked) = p(sparks&~blocked)

So:

p(sparks&~blocked) = 5% × 70% = 3.5%

Finally, remember that:

p(blocked|sparks) = p(sparks&blocked) / [p(sparks&blocked + p(sparks&~blocked)]

And therefore:

p(blocked|sparks) = 13.5% / (13.5% + 3.5%)

And the answer is:

p(blocked|sparks) = 79.4%

Now, if we put the arithmetic you did for this problem into one equation, here’s what you just did:

The general form of this is:

That is Bayes’ Theorem.

And because:

p(E) = p(E|H ) × p(H) + p(E|~H) × p(~H)

We can reduce Bayes’ Theorem to the following:

That formulation is simpler, so you’ll see it more often, though it doesn’t give as clear a picture of what Bayes’ Theorem does as the earlier formulation. But they’re both correct.

Given some hypothesis H that we want to investigate, and an observation E that is evidence about H, Bayes’ Theorem tell us how we should update our probability that H is true, given evidence E.

In the medical example, H is “this woman has breast cancer” and E is a positive mammography test result. Bayes’ Theorem tell us what our posterior probability of the woman having breast cancer (H) is, given a positive mammography test (E).

Yudkowsky concludes:

By this point, Bayes’ Theorem may seem blatantly obvious or even tautological, rather than exciting and new.  If so, this introduction has entirely succeeded in its purpose.

So there you have it. Now you understand the famous theorem from Reverend Thomas Bayes. He is proud of you.

Why is this so exciting?

So why should you care? Why does Bayes’ Theorem matter?

Yudkowsky gives the example of someone who thinks mankind will avoid nuclear war for at least another 100 years. When asked why, he said, “All of the players involved in decisions regarding nuclear war are not interested right now.” But why extend that for 100 years? “Because I’m an optimist,” was the reply.

What is it that makes this kind of thinking irrational? What is it about saying “Because I’m an optimist” that gives us no confidence that the claim is correct? (Maybe the claim is true, but we wouldn’t believe so merely because someone says he’s an optimist.)

Yudkowsky explains:

Other intuitive arguments include the idea that “Whether or not you happen to be an optimist has nothing to do with whether [nuclear] warfare wipes out the human species”, or “Pure hope is not evidence about nuclear war because it is not an observation about nuclear war.”

There is also a mathematical reply that is precise, exact, and contains all the intuitions as special cases.  This mathematical reply is known as Bayes’ Theorem.

For example, the reply “Whether or not you happen to be an optimist has nothing to do with whether nuclear warfare wipes out the human species” can be translated into:

p(you’re an optimist | mankind will avoid nuclear war for another century) = p(you’re an optimist | mankind will not avoid nuclear war for another century)

Yudkowsky continues (he uses ‘A’ for the hypothesis and ‘X’ for the evidence, instead of H and E as I did above):

Since the two probabilities for p(X|A) and p(X|~A) are equal, Bayes’ Theorem says that p(A|X) = p(A); as we have earlier seen, when the two conditional probabilities are equal, the revised probability equals the prior probability.  If X and A are unconnected – statistically independent – then finding that X is true cannot be evidence that A is true; observing X does not update our probability for A; saying “X” is not an argument for A.

In this case, the evidence X (that you’re an optimist) does not make us update our probability for A (that nuclear warfare will wipe out the human species within 100 years).

But suppose the optimist says: “Ah, but since I’m an optimist, I’ll have renewed hope for tomorrow, I’ll work a little harder at my dead-end job, I’ll pump up the global economy a little, and eventually, through the trickle-down effect, I’ll send a few dollars into the pocket of the researcher who will ultimately find a way to stop nuclear warfare – so you see, the two events are related after all, and I can use one as valid evidence about the other.”

Not so fast:

In one sense, this is correct - any correlation, no matter how weak, is fair prey for Bayes’ Theorem; but Bayes’ Theorem distinguishes between weak and strong evidence.  That is, Bayes’ Theorem not only tells us what is and isn’t evidence, it also describes the strength of evidence.  Bayes’ Theorem not only tells us when to revise our probabilities, but how much to revise our probabilities.  A correlation between hope and biological warfare may exist, but it’s a lot weaker than the speaker wants it to be; he is revising his probabilities much too far.

Statistical models are judged against Bayesian method because, well, Bayesian statistics is as good as it gets: “the Bayesian method defines the maximum amount of mileage you can get out of a given piece of evidence, in the same way that thermodynamics defines the maximum amount of work you can get out of a temperature differential.” You’ll also hear cognitive scientists judging decision-making subjects against Bayesian reasoners, such that cognitive biases are defined in terms of departures from ideal Bayesian reasoning.

Yudkowsky concludes:

The Bayesian revolution in the sciences is fueled, not only by more and more cognitive scientists suddenly noticing that mental phenomena have Bayesian structure in them; not only by scientists in every field learning to judge their statistical methods by comparison with the Bayesian method; but also by the idea that science itself is a special case of Bayes’ Theorem; experimental evidence is Bayesian evidence.  The Bayesian revolutionaries hold that when you perform an experiment and get evidence that “confirms” or “disconfirms” your theory, this confirmation and disconfirmation is governed by the Bayesian rules.  For example, you have to take into account, not only whether your theory predicts the phenomenon, but whether other possible explanations also predict the phenomenon.

Previously, the most popular philosophy of science was probably Karl Popper’s falsificationism - this is the old philosophy that the Bayesian revolution is currently dethroning.  Karl Popper’s idea that theories can be definitely falsified, but never definitely confirmed, is yet another special case of the Bayesian rules; if p(X|A) ~ 1 - if the theory makes a definite prediction – then observing ~X very strongly falsifies A.  On the other hand, if p(X|A) ~ 1,  and we observe X, this doesn’t definitely confirm the theory; there might be some other condition B such that p(X|B) ~ 1, in which case observing X doesn’t favor A over B.  For observing X to definitely confirm A, we would have to know, not that p(X|A) ~ 1, but that p(X|~A) ~ 0, which is something that we can’t know because we can’t range over all possible alternative explanations.  For example, when Einstein’s theory of General Relativity toppled Newton’s incredibly well-confirmed theory of gravity, it turned out that all of Newton’s predictions were just a special case of Einstein’s predictions.

You can even formalize Popper’s philosophy mathematically.  The likelihood ratio for X, p(X|A)/p(X|~A), determines how much observing X slides the probability for A; the likelihood ratio is what says how strong X is as evidence.  Well, in your theory A, you can predict X with probability 1, if you like; but you can’t control the denominator of the likelihood ratio, p(X|~A) - there will always be some alternative theories that also predict X, and while we go with the simplest theory that fits the current evidence, you may someday encounter some evidence that an alternative theory predicts but your theory does not.  That’s the hidden gotcha that toppled Newton’s theory of gravity.  So there’s a limit on how much mileage you can get from successful predictions; there’s a limit on how high the likelihood ratio goes for confirmatory evidence.

On the other hand, if you encounter some piece of evidence Y that is definitely not predicted by your theory, this is enormously strong evidence against your theory.  If p(Y|A) is infinitesimal, then the likelihood ratio will also be infinitesimal.  For example, if p(Y|A) is 0.0001%, and p(Y|~A) is 1%, then the likelihood ratio p(Y|A)/p(Y|~A) will be 1:10000.  -40 decibels of evidence!  Or flipping the likelihood ratio, if p(Y|A) is very small, then p(Y|~A)/p(Y|A) will be very large, meaning that observing Y greatly favors ~A over A.  Falsification is much stronger than confirmation.  This is a consequence of the earlier point that very strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X.  This is the precise Bayesian rule that underlies the heuristic value of Popper’s falsificationism.

Similarly, Popper’s dictum that an idea must be falsifiable can be interpreted as a manifestation of the Bayesian conservation-of-probability rule; if a result X is positive evidence for the theory, then the result ~X would have disconfirmed the theory to some extent.  If you try to interpret both X and ~X as “confirming” the theory, the Bayesian rules say this is impossible!  To increase the probability of a theory you must expose it to tests that can potentially decrease its probability; this is not just a rule for detecting would-be cheaters in the social process of science, but a consequence of Bayesian probability theory.  On the other hand, Popper’s idea that there is only falsification and no such thing as confirmation turns out to be incorrect.  Bayes’ Theorem shows that falsification is very strongevidence compared to confirmation, but falsification is still probabilistic in nature; it is not governed by fundamentally different rules from confirmation, as Popper argued.

So we find that many phenomena in the cognitive sciences, plus the statistical methods used by scientists, plus the scientific method itself, are all turning out to be special cases of Bayes’ Theorem.  Hence the Bayesian revolution.

Welcome to the Bayesian Conspiracy.

Bonus links:

Bonus problem:

(courtesy of Beelzebub)

Morpheus’s left hand has 7 blue pills and 3 red pills and his right hand has 5 blue pills and 8 red pills. You close your eyes and choose a pill that turns out to be red, but you don’t know which hand you took it from. What is the probability that you took it from his right hand?

Previous post:

Next post:

{ 108 comments… read them below or add one }

Alexandros Marinos December 18, 2010 at 6:23 am

Just linked to this over on LessWrong, keep the great material coming!

  (Quote)

Taranu December 18, 2010 at 6:51 am

Thank you Luke for this amazing post. I am one of those people who tried to understand Eliezer’s post but couldn’t follow all the way through. I have started to read yours and I have great expectations since I know how good you are with turning a complicated explanation into a simple one.

  (Quote)

Luke Muehlhauser December 18, 2010 at 6:58 am

Taranu,

Please don’t shy from letting me know if you get stuck somewhere. I got stuck on Eliezer’s post originally and it took me 4 hours to push through, which is what inspired me to write this post. So let me know if I can make certain parts clearer.

  (Quote)

John D December 18, 2010 at 7:07 am

I haven’t read through all of this but it looks impressive. One obvious point that emerges at a quick glance is that your general form of Bayes’ theorem is wrong. The second set of terms under the line should be: Pr(X| not-A) x Pr(not-A)

  (Quote)

Silas December 18, 2010 at 7:11 am

“Eliezer’s explanation of this hugely important law of probability is probably the best one on the internet, but I fear it may still be too fast-moving for those who haven’t needed to do even algeba since high school. Eliezer calls it “excruciatingly gentle,” but he must be measuring “gentle” on a scale for people who were reading Feynman at age 9 and doing calculus at age 13 like him.”

  (Quote)

Taranu December 18, 2010 at 7:14 am

Thank you Luke. Will do.

  (Quote)

Luke Muehlhauser December 18, 2010 at 7:32 am

Damn. I suck at latex and rendered Bayes’ Equation itself incorrectly the first time around! Make sure to grab the new copy of the PDF if you grabbed it earlier!

  (Quote)

Luke Muehlhauser December 18, 2010 at 7:36 am

Silas,

Damn! Where was spell-check on that one, eh?

Thanks. :)

  (Quote)

Luke Muehlhauser December 18, 2010 at 7:37 am

BTW, does anyone know how to reduce spacing at a certain point in a Latex equation? The latex for Bayes’ Theorem above is:

p(A|X) = \frac{p(X|A) \times p(A)}{p(X|A \times p(A) + p(X|\sim A) \times p(\sim A)}

Which I popped into this editor to get the PNG.

But to my eye that leaves too much space between the ~ symbol and the thing it is negating. Does anyone know how to fix that?

  (Quote)

poban December 18, 2010 at 7:39 am

Thanks Luke for the pdf, hope to finish it off in few days(I am very sluggish reader even though I am very quick on finding words on a dictionary).
I am stuck on “950 out of 9,900 women without breast cancer will also get a positive mammography” I dont know how you got “950 out of 9,900 women without breast cancer will also get a positive mammography”from out of 10,000 40 year old women, 80 out of 100 women aged 40 having cancer have mammography. I dont know if I will understand the statement after I go further.

  (Quote)

John D December 18, 2010 at 7:40 am

I think it looks fine. I wouldn’t worry about it.

  (Quote)

poban December 18, 2010 at 7:45 am

Sorry for the dumb request, 9,6% of the (total women – women having cancer) also have the mammography.

  (Quote)

Luke Muehlhauser December 18, 2010 at 7:51 am

poban,

Not sure if I understand your question, but the 950 is just given in the story, it’s not calculated from something else.

Let me know if you still need clarification…

  (Quote)

Luke Muehlhauser December 18, 2010 at 8:01 am

Thanks, Alexandros.

  (Quote)

NC December 18, 2010 at 8:09 am

@Luke: http://www.personal.ceu.hu/tex/math.htm#spacing

Also, you missed a closing parenthesis in the denominator. Speaking of which, should you mention that the denominator is just P(X)?

  (Quote)

ildi December 18, 2010 at 8:11 am

I suck at latex

(heh heh heh)

does anyone know how to reduce spacing at a certain point in a Latex equation?

(oh)

  (Quote)

Patrick December 18, 2010 at 8:11 am

I’m only halfway through at this stage, so I can’t give a full review- no time.

But I’ve found that one really good way to explain Bayes to people who don’t like math is to start by explaining the fallacy of Affirming the Consequent. I explain what it is, get them to see why its a fallacy, then point out that even though this argument is bad, it probably feels like it contains the pieces of a good argument. Then I explain about creating a probability based argument that’s almost exactly like the affirmation of the consequent, and explain that Bayes is the tool we use to make these arguments precise, and to put actual numbers on them if we so choose.

  (Quote)

Luke Muehlhauser December 18, 2010 at 9:00 am

NC,

Why do I not check the equation before rendering and uploading it? Stupid…

Thanks for the spacing link! After playing with it, I think I’ll leave it like it is.

  (Quote)

Luke Muehlhauser December 18, 2010 at 9:26 am

Patrick,

Thanks.

I will probably add additional tutorials on Bayes later, from many different approaches.

  (Quote)

Luke Muehlhauser December 18, 2010 at 9:29 am

NC,

Oh, and as for reducing the denominator, I’ll leave that off of this tutorial just because the lemmas between the reduced version and the conditional probabilities discussed earlier are not as clear.

  (Quote)

chrispa December 18, 2010 at 10:31 am

poban,

950 out of 9900 is simply (about) the 9.6% he was talking about in the more complicated example before.

  (Quote)

Mocha Tea December 18, 2010 at 2:56 pm

Luke,

Thanks for this great post.
Any math books you can recommend? Especially ones that would help understand Bayes and others.

  (Quote)

Luke Muehlhauser December 18, 2010 at 3:16 pm

Well, depends what you want to do with your knowledge. Are you wanting to learn statistics? Or Bayesian data analysis? Or the use of Bayes in decision theory and rationality research?

  (Quote)

Mocha Tea December 18, 2010 at 3:50 pm

Bayesian data analysis and Bayes in decision theory.

  (Quote)

Luke Muehlhauser December 18, 2010 at 3:58 pm

There are lots of books on Bayesian data analysis but I haven’t read them so I don’t know which ones are good. A good place to start with regard to decision theory and rationality in general (including some parts on Bayes) is Dawes’ Rational Choice in an Uncertain World.

  (Quote)

sly December 18, 2010 at 7:11 pm

Thanks Luke! This was a very enjoyable post.

  (Quote)

Josh December 18, 2010 at 10:09 pm

Jesus Christ I am seriously going to stop using Bayesian statistics in my research because of this shit.

  (Quote)

Josh December 18, 2010 at 10:32 pm

@Mocha,

Here’s a link to a good set of lecture notes on Bayesian statistics. http://www.cs.berkeley.edu/~jordan/courses/260-spring10/lectures/index.html

  (Quote)

PorNotP December 19, 2010 at 3:35 am

Here is Hurley’s interactive Bayes treatment. It is in section 9.3, part 10;

http://www.wadsworthmedia.com/philosophy/Learning_Logic/c9.swf

Mediocre, but the interactiveness is at least slightly useful.

  (Quote)

Taranu December 19, 2010 at 2:08 pm

Luke, I don’t know if during this break from blogging and podcasting you will read and respond to comments, but even if you don’t do so now I guess you will once the break is over.
I am reading your post on Bayes and I have a question regarding the following lines:
“Why? Because there are a lot more women getting false positives (10% of 99.99999% of women) than there are getting true positive (80% of 0.00001% of women). So if a woman gets a positive result, it’s almost certainly a false positive, not a true positive.”

How did you get 99.99999% and 0.00001%? Whenever I do the calculations I get 99.9999% and 0.0001%.

  (Quote)

Peter December 19, 2010 at 3:43 pm

Great post. I think one of the things missing from the ‘why is Bayes so important’ is that it is a consequence of thinking of probability as an extension of logic (where things can be true false AND UNCERTAIN GIVEN THE INFORMATION WE HAVE)
Another nit picky thing is that i believe that the way you posed the casino question makes the 70-80% seem a bit more reasonable because you are drawing from the same bag and replacing what you observed, which means that you are more likely to draw what you put back, rather than a ‘new’ chip. So the information is not independent with each draw

  (Quote)

Michael December 19, 2010 at 4:21 pm

Luke,

you need to become a lecturer. You make the unreachable reachable. You are amongst the greatest communicators. This is excellent.

  (Quote)

Beelzebub December 19, 2010 at 5:02 pm

This exposition is confusing:

Got your answer?

Okay, let’s start by calculating what percentage of women will get a positive test result. 80% of the 1% of women with breast cancer will get a positive result, so that’s 0.8% of women right there. Also, 80% of the 99% of women without breast cancer will get a positive result, so that’s another 79.2%. And since 0.8% + 79.2% = 80%, that means 80% of women will get a positive test result.

Even though only 1% of women actually have breast cancer!

So already you can tell that third piece of information can make a huge difference.

But let’s finish the calculation. We already know that 1% of the 80% of women who test positive actually have breast cancer. That’s 0.8%. Now, since 0.8% is 1% of 80%, that means that when a woman gets a positive mammography*, she still has exactly a 1% chance of having breast cancer!

She started out with a 1% chance of having breast cancer, and after the test she still has a 1% chance of having breast cancer.

How did that happen? Didn’t the test tell us anything, either way?

Nope.

The only way you know that 1% of the positives actually have cancer is because we’ve already divided the .8% true positives by the 80% (79.2+.8) total positives. You can’t just assume from the problem description that 1% of the positives actually have cancer simply because 1% of the population does. If the test were better then far more than 1% of the positives would be true positives. The way you’ve worded it, it looks like you mean to get the 1% from the problem description, which is wrong.

  (Quote)

MarkD December 19, 2010 at 10:29 pm

Fun stuff. Reducing the denominator to the standard form (per other comments) might help align the exposition with other sources.

I’m looking forward to detailed treatments of maximum entropy (since Jaynes is mentioned), AIC, BAC, and Solomonoff in the future.

  (Quote)

Luke Muehlhauser December 20, 2010 at 8:01 am

Taranu,

Thanks for the correction.

  (Quote)

Luke Muehlhauser December 20, 2010 at 8:04 am

Peter,

I’ve added a clarification re: casino chips. Thanks.

  (Quote)

Luke Muehlhauser December 20, 2010 at 8:04 am

Glad you find it approachable, Michael!

  (Quote)

Luke Muehlhauser December 20, 2010 at 8:09 am

Beelzebub,

Oops, yes, I see my mistake. Thanks!

  (Quote)

Luke Muehlhauser December 20, 2010 at 8:27 am

MarkD,

Yeah, I’m definitely planning to do a tutorial post on Kolmogorov Complexity, at least.

  (Quote)

Tyrrell McAllister December 20, 2010 at 8:59 am

BTW, does anyone know how to reduce spacing at a certain point in a Latex equation? The latex for Bayes’ Theorem above is:

p(A|X) = \frac{p(X|A) \times p(A)}{p(X|A \times p(A) + p(X|\sim A) \times p(\sim A)}

Which I popped into this editor to get the PNG.

But to my eye that leaves too much space between the ~ symbol and the thing it is negating. Does anyone know how to fix that?

Replacing \sim with \mathord{\sim} works.

  (Quote)

Luke Muehlhauser December 20, 2010 at 10:05 am

Tyrrell,

Great! Thanks.

  (Quote)

Beelzebub December 20, 2010 at 2:55 pm

Sorry, I know this is annoying, but switch the “don’t” with the “do” in:

Well, there are two groups of women who will get a positive mammography* result: those with a positive result who don’t have cancer (0.8%), and those with a positive result who do have cancer (79.2%). Add those together, and our denominator is 80%.

  (Quote)

Beelzebub December 20, 2010 at 3:26 pm

Okay, time for a little comic relief:

Seeing the world through the lens of Bayes’ Theorem is like seeing The Matrix. Nothing is the same after you have seen Bayes.

Morpheus’s left hand has 7 blue pills and 3 red pills and his right hand has 5 blue pills and 8 red pills. You close your eyes and choose a pill that turns out to be red. What is the probability that you took it from his right hand?

  (Quote)

Luke Muehlhauser December 20, 2010 at 3:30 pm

Beelzebub,

Not annoying at all! Thanks again!

  (Quote)

Luke Muehlhauser December 20, 2010 at 3:33 pm

Beelzebub,

Nice.

  (Quote)

Zeb December 20, 2010 at 7:46 pm

Beelzebub, is it 27%?

  (Quote)

Beelzebub December 20, 2010 at 9:04 pm

I actually hadn’t done the calculation, but I just did it, and I think your number is way low. Notice I’m not saying what I got :-)

  (Quote)

Patrick December 21, 2010 at 5:13 am

The first question is how we calculate the odds of picking a given pill. Since there are 23 pills, are the odds 1 in 23? Or do you have a 50/50 chance of picking a given hand, and then a random chance of getting any particular pill in that hand? Because in that case you have a .05 chance of picking a given pill from the left hand, and about a .038 chance of picking a given pill in the right hand.

I think the latter option makes more sense (when you reach out to pick a pill, there’s a 50/50 chance of getting a given hand, and then afterward you have a randomized chance of getting a given pill within that hand) but its really a question of how the story problem is intended.

  (Quote)

Andrew T December 21, 2010 at 8:14 am

There seems to be an arithmetic error (9 in place of 6) here:

—–
Now, let’s say we administer the first test, the one with a likelihood ratio of 25/3, and the woman tests positive. This gives us 9 positive decibels of evidence that she has breast cancer, because:

10 × log10(25/3) = +6 decibels of evidence that a woman has breast cancer

Next we administer the second test, and she tests positive again!

10 × log10(18/1) = +13 decibels of evidence that a woman has breast cancer

She also tests positive on the third test:

10 × log10(7/2) = +5 decibels of evidence that a woman has breast cancer

The poor woman started out with a very low probability of having breast cancer, but now she has tested positive on three pretty effective tests in a row. Things are not looking good! She started out with -20 decibels of evidence that she had breast cancer, but the three tests added 27 decibels of evidence (6+13+5) in favor of her having breast cancer, so we now have +7 decibels of evidence that she has breast cancer.
—–

Fantastic post.

  (Quote)

Luke Muehlhauser December 21, 2010 at 9:07 am

Andrew T,

Thanks. Hopefully I’ll have time to update this before Christmas.

Keep the corrections coming, folks!

  (Quote)

Zeb December 21, 2010 at 10:25 am

Patrick, since Beelzebub stated that we picked a red pill, don’t we disregard the blue pills? There are 11 red pills, 3 of them being in the right hand, so the chance of having chosen a right hand pill given that is is red is 3/11, 27%.

  (Quote)

Zeb December 21, 2010 at 10:36 am

Wait, nevermind. The chance of choosing right+red is 1/2 (for right) x 3/10 (for red) = 0.15. The chance of choosing left+red is 1/2 (for left) x 8/13 (for red) = 0.31. So the chance of having chosen right, given red, is 0.15/(0.15+0.31) = 0.33.

  (Quote)

Beelzebub December 21, 2010 at 11:06 am

You guys aren’t applying Bayes theorem. The hypothesis is H = Rh (right hand), and the event E = red (a red pill was chosen).

  (Quote)

Luke Muehlhauser December 21, 2010 at 11:47 am

Andrew T,

Fixed, thanks.

  (Quote)

Henry December 21, 2010 at 12:57 pm

Luke — great post, as usual. I worked all of the problems in my head and got them all right, so I just skimmed most of the detail until the bookbag problem, which I realized I got right for the wrong reasons. I reasoned that since the experiments had yielded 67% red chips and this was close to the expectation of 70% red chips if I had the majority-red bag, then I could be roughly 96-97% certain because that was roughly the ratio between 67% and 70%. I suspected this was too easy, so I read the detail and realized that if I had picked 10 chips and 7 had been red, by my reasoning I would have estimated that I had the red-majority bag with 100% certainty, whereas the answer is exactly the same as the given example — 97%. So thanks for that insight.

By the way, I believe you have a couple of typos in the logarithm discussion. Twice 5(superscript)3 is rendered as just 53 (at least it appears that way in my browser). Also, the log5 of 25 is 2, not 3. You probably meant log5 of 125, which does equal 3.

  (Quote)

Patrick December 21, 2010 at 1:42 pm

Beezlebub- Don’t we have to work out the method of picking the pill first? Without knowing the method by which the pill selection was randomized, we can’t know the chance of selecting any given pill. And if we don’t know that, we can’t find the chance of P(a red pill is chosen).

  (Quote)

Beelzebub December 21, 2010 at 3:11 pm

That’s true. The problem as written could be ambiguous. To be more explicit, I mean that when you close your eyes you choose randomly from all the pills in either of Morpheus’s hands without knowing which hand it was. The effect would be the same as, say Morpheus sneezes and one of the pills from one of his hands (unobserved by you, of course) drops on the floor, and you pick it up. If it’s red, what’s the probability that it came from his right hand?

  (Quote)

Luke Muehlhauser December 21, 2010 at 3:11 pm

Henry,

Thanks for the corrections!

  (Quote)

Luke Muehlhauser December 21, 2010 at 3:29 pm

Hey everyone; I just updated the PDF of this post with all the latest fixes, so if you’re reading offline, make sure to grab the latest PDF.

  (Quote)

Beelzebub December 21, 2010 at 4:17 pm

Actually, Bayes formula can be used to solve either formulation of the problem, except that when you have a 50:50 hand selection probability then you use P(right hand) = P(left hand) = .5, and when you choose randomly from all the pills you must adjust the hand probabilities accordingly. I used Bayes with 50:50 and got the same result as Zeb — except that he switched left and right. The answer I got was 67%

So what’s the answer for the other version of the problem?

  (Quote)

Taranu December 22, 2010 at 12:36 am

Luke,
when you say, “1 out of every 100 women have breast cancer” and “400 out of 1000 eggs contain pearls…”aren’t you using natural frequencies in both cases? From what you wrote we are better at solving problems phrased in the former way and we do best of all when they are phrased in the latter. Why the difference?

  (Quote)

Taranu December 22, 2010 at 6:33 am

I hope I don’t become too annoying with my questions but I really want to understand Bayes theorem.

You said:
“Now, note that the mammography@ is rarely useful. Most of the time it , and only on rare occasions does it give us strong evidence. But when it does give us strong evidence, it is very strong evidence, for it allows us to conclude with certainty that the tested woman does not have breast cancer.”

Could you please provide an example where mammography@ gives very weak evidence that doesn’t allow us to slide our probability very far away from the prior probability?

  (Quote)

Luke Muehlhauser December 22, 2010 at 7:38 am

Taranu,

Good question. “1 out of every 100 women” means that there might be more than 100 women, and for every set of 100 women, there is 1 that has cancer. But “400 out of 1000 eggs” means there are exactly 1000 eggs. That’s the difference.

  (Quote)

Luke Muehlhauser December 22, 2010 at 7:41 am

Taranu,

Glad you want to understand! I reworded that paragraph. Does it make more sense now?

  (Quote)

svenjamin December 22, 2010 at 9:34 am

As far as I can tell this enthusiam for Baye’s is another case of non-mathematicians feeling smug for thinking they are ‘in the know’ despite having minimal mathematical background. Like all those people who used to nod knowingly to me when they discovered I was a double anth of religion and math major and pronounce something about the connections between religion and “higher” mathematics or quantum mechanics. Good God.

Here’s a hint: Bayes theorem will account for a fraction of the points of one’s first exam in a college math department introduction to probability and statistics course. Further stats courses will likely be developed with a frequentist model, with Bayesian statistics mentioned in a textbook appendix. It’s cool, but not that big of a deal guys. People who think Bayes is particularly profound should stay away from the following: hardcore statistics, abstract algebra, mathematical analysis, and differential equations. Those all warped my thinking pretty thoroughly.

And as someone who reads quite a few stats heavy academic journal articles (I’m not claiming to understand all of the stats involved, mind), my current opinion is that any Bayesian revolution is only in the minds of the internet intellegentsia.

  (Quote)

Taranu December 22, 2010 at 12:37 pm

Thank you Luke. Now I understand what you mean. I was confused by all the “positive result”, “negative result”, “cancer test”, “health test” talk.

  (Quote)

Beelzebub December 22, 2010 at 2:40 pm

Ok, so I’ll answer my own question. According to my calculation, the answer is 72.7%, unless I blew it somewhere. Just from looking at the problem you know the answer has to be way over 50% since there are roughly the same number of pills in each hand, yet there are many more red pills in his right hand than left.

  (Quote)

Beelzebub December 22, 2010 at 2:49 pm

My answer to svenjamin is that this isn’t a case of trying to be smug. Bayes theorem is routinely mentioned in theological arguments (for instance, by Craig). If you don’t even know enough to feed your own BS detector, you’re dead in the water. I recall seeing a debate between Craig and Ehrman where Craig flashed Bayes theorem on the screen and pontificated on it. Since Ehrman had no knowledge of it at all, there was absolutely nothing he could say about it. Craig could have been saying total gibberish.

It’s really more a case of defending yourself against other people who know a little and think they can pull the wool over your eyes.

  (Quote)

DaVead December 22, 2010 at 4:54 pm

svenjamin:

I agree that the “Bayesian revolution is only in the minds of the internet intellegentsia”. It takes time for ideas to fall from the ivory tower to the internet, and when they do they can make a pretty big splash.

It might also be important to point out that Bayesian accounts of justification and epistemology are still just candidate players among many. It might seem nice, tidy, and ideal when you infer unknown probabilities from known ones, but when you apply it to tricky philosophical questions and start inferring unknown probabilities from unknown probabilities based on subjective estimations based on limited sets of data, it gets messy.

  (Quote)

Josh December 22, 2010 at 10:38 pm

svenjamin gets it right. Though I find Bayesian statistics a BIT more interesting than s/he apparently does, it’s ridiculous the way that “internet intelligentia” (and even professional philosophers) have decided that it’s awesome.

  (Quote)

Luke Muehlhauser December 22, 2010 at 10:47 pm

Josh,

You’ll come around. :)

  (Quote)

Josh December 22, 2010 at 11:07 pm

Josh,You’ll come around. :)  

I dunno what to come around to =p. Like I said, I frequently use Bayesian statistics. I’m just not religious about it. In that regard, I like to imagine I take after Michael Jordan (the statistician, not the basketball player…).

  (Quote)

Luke Muehlhauser December 23, 2010 at 9:08 am

Josh,

What to come around to? Good question. Basically, E.T. Jaynes’ Probability Theory: The Logic of Science.

But anyway, let me hereby acknowledge what is probably obvious to most; you’re almost certainly a much better statistician than I am!

  (Quote)

svenjamin December 23, 2010 at 7:39 pm

“smug” may or may not have been too harsh. I like Josh’s assessment: it’s just not as awesome as people are making it out to be. Add philosophers like Craig who use it as a tool of obfuscation in debates (much like his other forays into mathematics…) and it makes me cranky. It’s an incredibly simple theorem, it doesn’t really require all that much explanation:

This is the circle representing the set of outcomes we consider to be category A. This other circle is the set of known outcomes B that correspond to cases of A or not A. The former is the intersection of the circles. When computing the probability that A is the case, we look at the probability of A occurring out of the events that are known to have occurred. It seems like common sense to me. It doesn’t really provide any special contributions to epistemology as far as I can tell. But I don’t really have the patience to read through long explanations of something so simple, so maybe I am missing something.

On the other hand, its great that at least some people are finding some enthusiasm for mathematics. Here are some things more interesting than Bayes:

The nature of axiomatic systems, especially geometries.

Godel’s Incompleteness Theorem actually DOES have serious philosophical ramifications…but is also bastardized by philosophy and social science.

The profundity of the motherfucking Central Limit Theorem.

Group Theory is REAL metaphysics: there are only 17 possible wallpaper patterns and 320 possible ways to tile 3-space. Studies of abiogenesis should really be looking at that. They may, maybe I haven’t looked hard enough.

http://en.wikipedia.org/wiki/Space_group

Or maybe I’m just cranky right now.

  (Quote)

Luke Muehlhauser December 23, 2010 at 9:43 pm

svenjamin,

Here’s why some people are excited. Some have taken Bayes’ Theorem to specify not whether we are “justified” in believing X, nor whether X counts as “knowledge”, nor whether we are “rational” to believe in X, but exactly how strongly we should believe X given the evidence we have, and how strongly we should believe X after certain pieces of evidence are provided. That’s pretty fundamental, and some people have taken it to be the basic theorem defining the core goals of epistemology. That’s a pretty big deal.

The nature of axiomatic systems, the philosophical implications of Godel’s Incompleteness Theorem, the endless uses of the Central Limit Theorem… all these are quite interesting and profound, but they do not sum up a foundational area of philosophy in a single theorem.

Thoughts?

  (Quote)

svenjamin December 24, 2010 at 11:38 am

Thoughts?

1) I guess I really just don’t understand the alleged relations between conditional probability and core goals of epistemology. It seems to me that you can’t even apply conditional probabilistic interpretation until you have some sort of facts to begin with-and that is where the foundational epistemology comes in. Think of the Ehrman-Craig debate. Craig tried to apply Bayes theorem…but basically made up his priors. You know, priors that ought to have been argued around the kind of core epistemological principles you describe in blogging through Theism and Explanation. Which you should send me a copy of, ’cause I’m broke.

2) I hate to say it, because I’m sure this write-up of yours was the result of a lot of time and thought, but your explanation is to me is even less intuitive than Yudhowskys. The nice thing about mathematical symbols is that they allow one to communicate ideas in a concise symbolic form…which can then be explained if necessary. The longer the explanation continues without the appearance of a concise formulathe more confused I get! Here, take a look at this much shorter and more concise explanation: http://oscarbonilla.com/2009/05/visualizing-bayes-theorem/

3) I think Yudhowky’s assessment of scientific criterion of falsifiability as last being Popperian is.. more than a bit off.

4) It seems that Yudhowsky is claiming both that this “Bayesian reasoning” thing is very counterintuitive to humans, and that “more and more cognitive scientists suddenly noticing that mental phenomena have Bayesian structure in them.” What?
But I’d be interested in some resources describing this rise of Bayesianism in different disciplines. Actually describing it, not just saying that it exists.

  (Quote)

svenjamin December 24, 2010 at 12:13 pm

Beelzebub’s problem:

Assuming equal likelihood of picking either hand:

P(Red|Right)=1/2 x 8/13
P(Red|Left)=1/2x 3/10

So we have P(Right|Red) = (8/26)/ (8/26 + 3/20) = .672

I do this intuitively (though not without drawing myself a diagram), but here it is in Baye’s Theorem:

P(Bk|A)= P(Bk) P(A|Bk)
Sum(1,m) (P(Bi) P(A|Bi))

Here the indexed Bi represent the possibility of multiple hand choices..Bk is in this case “right hand” and P(Bk) is the .5 for the equal likelihood of picking a given hand out of two options. P(A|Bk) is the prob of choosing a red pill given that the right hand was chosen, which is 8 / (5+8) or 8/13.
The denominator is the sum of the products of the probabilities of choosing a given hand times the probability of getting a red pill from that hand. In this case, 8 out of 13 pills in the right hand are red, and 3 out of 10 in the left hand are red. Both cases have 1/2 probability of occurring, so we end up with the numeric expression I wrote above:

(1/2 x 8/13) /
[(1/2 x 8/13) + (1/2 x 3/10)]

=what I wrote earlier.

  (Quote)

Beelzebub December 24, 2010 at 2:24 pm

For me the interesting thing is to think about that problem description and just what the priors signify. When I initially wrote it down, I was under the impression that all 23 pills would be picked from randomly. However, realistically, from the problem description, even with eyes closed, or perhaps especially with eyes closed, you would no doubt find one hand or the other first, then choose a pill from it.

But thinking about it even more, I realize that more than likely the thing that would determine which hand you picked would be either your left or right handedness, Morpheus’s handedness, Morpheus’s psychological preference for moving one hand or another toward your hand, and so on. All of these factor would, of course, change the probability drastically.

Just as in the breast cancer problem, it may well be that the initial probability of one hand being offered over another might dominate the final probability no matter what the conditional probabilities are.

  (Quote)

Luke Muehlhauser December 24, 2010 at 11:00 pm

svenjamin,

Email me about Dawes.

1) Priors, yes, remains a debated issue.

2) Yes, different people will benefit from different approaches.

4) Some of our thinking is intuitively Bayesian, much of it is not.

As for your last question, do you mean you’d be interested to read literature describing the Bayesian revolution of the sciences, the reasons for its success, and so on?

  (Quote)

svenjamin December 25, 2010 at 10:50 am

Beelzebub:
on handedness: yup! I thought about mentioning that too. You could experimentally determine the general tendencies of people or an individual to pick one way or the other and introduce that to equation pretty easily. Which I should have written:
P(Bk|A)= P(Bk) P(A|Bk)/
[Sum(i=1,m) (P(Bi) P(A|Bi)]

Here the P(Bi) would be the probabilities of picking a different hand. So if right hand was chosen with frequency q, left hand would have prob 1-q.

I can post some other conditional probability problems for your christmas pleasure if y’all like.

Luke:
I’m willing to entertain that there might be something to the Bayesian stuff that I am missing, whether philosophically or with respect to some academic revolution; I’d just like to see actual examples of such cases. So yeah, I’d be interested in reading some literature along those lines. Short literature. I am very busy as a grad student! And studies of our ‘natural’ cognitive heuristics are always interesting.

  (Quote)

JoeK December 26, 2010 at 8:03 am

This is the circle representing the set of outcomes we consider to be category A. This other circle is the set of known outcomes B that correspond to cases of A or not A. The former is the intersection of the circles. When computing the probability that A is the case, we look at the probability of A occurring out of the events that are known to have occurred.

Yes, this. I am not a statistician by any means, but any time I notice that I am confused about Bayes’ Theorem, visualizing the Venn diagram svenjamin describes un-confuses me immediately. My “intuitive explanation of Bayes’ Theorem” would consist of that diagram along with labels of the relevant partitions and proportions.

  (Quote)

Taranu December 26, 2010 at 8:37 am

Luke,
you said:
“Now, consider this set of four values: p(positive&cancer), p(positive&~cancer), p(~positive&cancer), and p(~positive&~cancer). At first glance, it might look like there are only two degrees of freedom here, because you can calculate all the other values by knowing only two of them: p(positive) and p(cancer). For example: p(positive&~cancer) = p(positive) × p(~cancer).”

Are you sure this is right because it seems to me you should have said: p(positive&~cancer) = p(positive|~cancer) × p(~cancer)
And if this is correct than you should make changes in what you wrote until the section about degrees of freedom ends.

  (Quote)

Taranu December 27, 2010 at 6:41 am

Luke,
you said about likelihood ratios:
“For example, a mammography with a hit rate of 80% for patients with breast cancer and a false positive rate of 9.6% for healthy patients has the same likelihood ratio as a test with an 8% hit rate and a false positive rate of 0.96%. Although these two tests have the same likelihood ratio, the first test is more useful in every way – it detects disease more often, and a negative result is stronger evidence of health.”

Why is a negative result stronger evidence of health when considering the first test?

  (Quote)

Luke Muehlhauser December 27, 2010 at 8:06 am

Taranu,

The next paragraph says that “p(positive&~cancer) = p(positive) × p(~cancer)” is actually not a true claim.

  (Quote)

Luke Muehlhauser December 27, 2010 at 8:30 am

Taranu,

A negative result on the first test is stronger evidence of health because p(cancer|~positive) is smaller for the first test than for the second test. Intuitively, the number that jumps out is 8% hit rate for women with breast cancer in the second test. Because only a tiny fraction of the women with breast cancer will get a positive result on the second test (only 8% of them!), that means a huge number of women have breast cancer but do not get a positive result. So with the second test, a negative result still leaves you with pretty good chances that the woman has breast cancer. But with the first test, the hit rate for a positive result is much higher (80%), so fewer women with breast cancer are going to get a negative result. Thus, a negative result with the first test is stronger evidence of health than is a negative result with the second test.

Does that makes sense?

  (Quote)

Taranu December 27, 2010 at 9:13 am

Oh silly me! Now I see :)

  (Quote)

Taranu December 27, 2010 at 12:49 pm

I know I must be pretty annoying with all these questions but I’ve almost finished reading so please bare with me a little longer. You said:
“+7 decibels of evidence means that there is now an 87% chance that she has cancer” How did you arrive at 87%? I keep getting 84%.

  (Quote)

Luke Muehlhauser December 27, 2010 at 2:55 pm

Taranu,

Thanks again; that’s my bad. It is indeed about 83% or 84%.

  (Quote)

Peter December 28, 2010 at 3:23 am

svenjamin,

Mention in your post “….Further stats courses will likely be developed with a frequentist model, with Bayesian statistics mentioned in a textbook appendix. It’s cool, but not that big of a deal guys…”
I do agree that the frequentist school is currently the dominant one which is taught, and that Bayes’ theorem is not a big deal in isolation (in fact it is basically a “re-arranged” statement of the product rule P(A and B) = P(A) P(B given A)=P(B) P(B given A)). What is a big deal to me is how one gets to Bayes theorem. The requirements or assumptions which imply Bayesian inference is the optimal one can be found in the following (technical) article by Kevin S Van Horn http://www.stats.org.uk/cox-theorems/VanHorn2003.pdf. Most of these requirements are “existence” ones (i.e. there exists a function such that….), and these are loosely defined (but easier to understand) in Jaynes’ book (2003 probability theory: the logic of science) as:
1) probabilities are represented by a single real number
2) qualitative correspondence with “common sense” (which I take to mean deductive logic)
3) consistent reasoning, defined in three ways:
i) if a conclusion can be reasoned in many ways, they must all give the same answer
ii) taking account of all available information. Does not add information that isn’t there, and does not throw information away
iii) in two situations where one has the same state of knowledge, then one must assign them equal probabilities

Basically these requirements require a Bayesian approach to statistical inference. Or to put it another way, if someone is not using a Bayesian approach (implicit or explicit) to statistical inference, then they will be violating at least one of the above requirements. Now imagine a situation where you have to convince someone else that your method of inference was good while it was violating one of these criteria, I think you would have a hard time. This is the “big deal” as far as I’m concerned. The “frequentist” methods are thus only consistent with the above requirements if they are numerically equivalent to the Bayesian method. “frequentist” methods often fail on requirement 3 and in my view rely too heavily on “big sample size” and “statistical independence” arguments.

  (Quote)

Taranu December 28, 2010 at 12:52 pm

Luke,
when we want to evaluate the probability that a hypothesis is true using Bayesian inference must the prior probability also include the past explanatory success of the hypothesis? I mean if we want to evaluate a supernatural hypothesis must we take into account how successful supernatural hypotheses have been thus far?

Also, I finished reading the post. Thank you for writing it. I found it helpful, interesting, challenging at times and a lot easier to understand than Eliezer’s.

  (Quote)

Luke Muehlhauser December 30, 2010 at 7:10 pm

Taranu,

Checking traditional explanatory virtues against Bayes’ Theorem is another topic, and a massive one. No time now. I think Richard Carrier’s upcoming book has quite a bit on this.

  (Quote)

Zeb February 23, 2011 at 9:44 am

Typo in your initial statement of the breast cancer problem:

but 9.6% of women of women

I’m taking another crack at this article, trying to get past just knowing how to do the math to the intuitive understanding. What does this sentence mean?

So if we’ve grabbed a blue egg, it’s more likely to be empty than contain a pearl than was the case when we didn’t know it was blue (the prior probability).

How can the egg be more likely to contain a pearl? It either does or doesn’t. I can’t even see how we’re more likely to correctly predict whether we’ll find a pearl inside; either we will or we won’t. Is this just a prediction that if we independently replicate the situation of picking a blue egg, a certain proportion of those situations will yield a pearl. That I can see, but I am genuinely having difficulty interpreting the meaning of “increased likelihood” to a given present situation (even though of course in practice I do it all the time, on faith I guess).

  (Quote)

Zeb February 23, 2011 at 10:09 am

If an egg is just as likely to be blue given that it contains a pearl as it is likely to be blue given that it doesn’t contain a pearl, then there are just as many eggs that are blue and contain a pearl as there are eggs that are blue and empty, and so the fact that the egg we picked is blue doesn’t give us any new information at all about whether or not it contains a pearl.

The first part of this sentence is incorrect. As your diagram indicates, 40% of blue eggs contain pearls, and 60% are empty – there are not just as many of each.

  (Quote)

Zeb February 23, 2011 at 2:01 pm

The top bar looks the same as before, and the middle line has the same kind of slant, but now the bottom bar is much smaller because we’re only looking at blue eggs in the bottom bar.

It is purely coincidental that the middle line has the same slant. You sort of acknowledge that in the next diagram, but I found it confusing.

  (Quote)

Peter February 24, 2011 at 7:44 am

Zeb, you are right here in what you say

How can the egg be more likely to contain a pearl? It either does or doesn’t.

But, Bayes theorem is about describing information, rather than determining reality. Before you look into the egg, you simply don’t know whether it’s there or not. The probability can be thought of like “implicability” in the sense of a deductive logic. So it basically goes: “the truth of the blue egg and the truth of 40% of pearls being blue implies the truth of there being an pearl to degree 0.67″

What will happen after you look inside the egg, the “truth value” is determined, and your probability will reflect this:

Pr( Pearl | Blue & “Pearl”) = Pr( Pearl | Blue) * Pr( “Pearl” | Blue & Pearl) / Pr ( “Pearl” | Blue) = 1

I enclose “pearl” in quotes so you can see what’s going on. Blue always stays on the right of | , pearl and “pearl” switch.

Pr( Pearl | Blue & ~Pearl) = Pr( Pearl | Blue) * Pr( ~Pearl | Blue & Pearl) / Pr ( ~Pearl | Blue) = 0

The likelihood Pr( ~Pearl | Blue & Pearl) is zero because the conditions contradict the statement.

So what you describe about the pearl being “in” or “not” is exactly what Bayes theorem tells you to do, but only after you actually observe what’s inside the egg.

  (Quote)

Zeb February 24, 2011 at 8:46 am

Thanks for your reply Peter, but there is a lot there that I don’t understand. I don’t know what difference you meant between “pearl” and pearl. In fact I’m not sure what you were getting at with those equations anyway.

What does “implicability” mean, or “implies the truth…to a degree of 67%”? I think you understand my question, but to rephrase it, “What do I know when I know that the blue egg I hold has a 67% chance of containing a pearl?” I kind of get that Bayes is not talking about the egg I have, but about the information I have. I just don’t quite get what it is telling me. Is it a misstatement to say, “I know this egg has a 67% chance of containing a peal,” and instead I should say, “67% of the blue eggs in this barrel conain pearls, but I don’t know about this one,” or “If I predict ‘This blue egg contains a pearl’ a bunch of times, my prediction will be correct 67% of the time, but I don’t know about this time.”? Those are true statements, but is the first one, “This blue egg has a 67% chance of containing a pearl” also true?

  (Quote)

Luke Muehlhauser February 24, 2011 at 12:18 pm

Zeb,

Thanks! Hopefully some day after a March 15th deadline I’ve got, I’ll have time to go through all these comments and update the tutorial.

  (Quote)

Peter February 24, 2011 at 7:11 pm

Zeb,

There are many different ways you can interpret “implicability”. And you are spot on when you say that the philosophy of “Bayes theorem” is to describe what you know. For this is the only thing you can do. It is impossible to describe “reality” in an absolute sense: you are only able to describe what you know about reality. The only way to say something “absolute” about reality is to assume something about reality. But this is not a “physical property” of reality – just what you know based on your assumption, which is in turn based on your knowledge (which includes observation). e.g. I assume X: this implies that X is true if and only if X is true – can you see the circularity?

And just to clarify, I mean “implicability” in exactly the same way deductive logic uses the word “implies”. Probability just “fills” in the space between 0 and 1.

  (Quote)

Loup Vaillant March 25, 2011 at 7:01 am

I think I know how to fix your visualization of natural frequencies. right now, you centre the smaller bar around 50%, which is convenient, but arbitrary.

I suggest you just keep the “update slope” meant by the conditionals, and centre the smaller bar around that. That way, your diagram would keep track of both the quantities and the strength of the observed evidence. In your cancer example, that would shift the smaller bar dramatically to the left.

  (Quote)

MAC May 6, 2011 at 3:36 am

Hello, Thank you for the illuminating liberation your article provides, and necessarily so. Nonetheless, it is still not idiot-proof or alternatively there are mistakes, perhaps even deliberate. For example: “Because we can use two of the values to calculate the third, . . . p(~positive&cancer) = p(~positive|cancer) × p(cancer).
p(cancer) = 01%
p(~positive|cancer) = 20%
p(~positive&cancer) = 20%

This doesn’t work., e.g., were the % incidence of women with BC 05% in a universe of 10,000 women, and the probability of false negatives = 20% of these women, then according to this explanation the probability that a women tested has BC and will test negative is 100%, which is impossible. Were the incidence of BC in a large population, say, 10%, and the probability of receiving a false negative test result also 20% of the 10%, then accordingly the probability of a patient having BC and testing negatively would be 200%. Is it me who’s bonkers?

  (Quote)

gigi May 9, 2011 at 2:10 pm

MAC,
I thought about your comment and quite frankly I don’t know what to say. I think this is a good time to ask for help.
Luke, a mathematician, anyone?!

  (Quote)

MAC May 9, 2011 at 2:34 pm

To gigi and others, Thank you very much for your reply. Inadvertently, you have confirmed my existence, which is a comfort. Thanks again. FALSE ALARM – it turns out that it is actually me who is bonkers. You’re excused. In the paragraph following the text referred to Luke wrote: “. . . ! Notice that the above equation is only true if p(positive) and p(~cancer) are statistically independent.” There we have it. Nonetheless, I have learned a lot, Best regards, MAC

  (Quote)

el ninio May 10, 2011 at 5:46 am

Luke,
how do you compare two hypotheses, using Bayes, if they are not mutually exclusive. For instance if you have to compare H1 YHWH raised Jesus from the dead and H2 Xenu raised Jesus from the dead. Do you first compare H1 to the hypothesis YHWH did not raise Jesus from the dead and H2 to the hypothesis Xenu did not raise Jesus from the dead and than compare the results? Or is there a way to compare both H1 and H2 directly?

  (Quote)

Micheál Anthony Conroy May 23, 2011 at 4:17 pm

Luke, you wrote: ” . . . but on an exponential scale, +7 decibels of evidence means that there is now an 83% chance that she has cancer!” I have followed every word and enjoyed the feeling of a victory, That is, until now __ uggg! How on earth do you get 83%? Best regards, MAC

  (Quote)

Leo July 4, 2011 at 7:21 pm

Correction:

“p(positive&cancer), p(positive|given), and p(cancer).”

That would be: p(positive&cancer), p(positive|cancer), and p(cancer).
Wouldn’t it ?

  (Quote)

Peter July 5, 2011 at 2:09 am

MAC, here’s how to get from +7db of evidence to 83% probability

“evidence” e(H|X) are related to “probability” p(H|X) by:

e(H|X) = 10*log10(p(H|X)/[1-p(H|X)]) = +7db

“log10″ means base 10 logarithm (I don’t know how to make the equations “pretty” here). This can be inverted to show that the odds are:

p(H|X)/[1-p(H|X)] = o(H|X) = 10^(0.1*e(H|X)) = 10^0.7 = 5.01

And odds are related to probabilities by:

p(H|X) = o(H|X)/[1+o(H|X)] = 5.01/[1+5.01] = 5.01/6.01 = 0.83

You can also check the correctness of this result by plugging it back into the formula for the evidence:

e(H|X) = 10*log10(0.83/0.17) = 10*log10(5.01) = 7

  (Quote)

Luke Muehlhauser July 5, 2011 at 7:33 am

Yup! Updated.

  (Quote)

Abe December 9, 2011 at 9:49 am

Thanks for doing this Luke! I am brand new to LW and Bayes. Your article here should be paired with every link on LW that sends users to Yudkoswky’s piece. I also wanted to share two resources that helped me out immensely.

To help me with the visualization of Bayes I found this article incredibly helpful. (http://lesswrong.com/lw/2b0/bayes_theorem_illustrated_my_way)

Also, this video by Richard Carrier on youtube helped me break down what the different parts of Bayes’ theorem actually meant. (http://www.youtube.com/watch?v=HHIz-gR4xHo)

  (Quote)

Leave a Comment

{ 7 trackbacks }