numbers for people.

You’re probably polluting your statistics more than you think

In a recent post, Gabriel Rossman comes up with a simple example of why statistics are hard to do correctly with an easy example.

  • If good looks and smarts are distributed normally, and
  • If good looks and smarts have nothing to do with each other, and
  • If movie producers want both smarts and looks
  • Then, by observing employed actors we’ll assume that looks and smarts have a negative correlation
  • Even though we constructed this experiment with no correlation

Here’s a graph of 250 randomly generated points (with no correlation). With the red circles representing “actors who are smart and good looking enough to get a job (looks+smarts>2), and lighter blue x’s representing “people who wanted to be actors”:

Clearly if we only look at actors with jobs, we’ll see a clearly negative correlation between smarts and good looks. In fact, some brilliant actors are less attractive than an average person, and some gorgeous actors are dumber than an average person. Even more interesting though, is that if we try to rule out bias by looking at aspiring but unsuccessful actors as well, we’ll find that they exhibit a similar correlation. Here are the lines of best fit for both:

that both groups would exhibit a negative correlation is more obvious if you mentally split the groups on looks+smarts=0

This effect is particularly nefarious in that it’s distribution agnostic. For instance, assume for mathematicians:

  • Experience and brilliance are uniformly distributed
  • With experience, comes somewhat more brilliance (I’ve introduced a positive 20% correlation)
  • Only the top fifth of mathematicians (as measured by experience+brilliance) ever get anywhere, and the rest drop out to do something easier
  • It’s very easy to conclude that experience kills brilliance, and that a mathematicians best work will be done by 40 – a phantom negative correlation

In a general sense (the proof being left as an exercise for the reader):

  • Given two measurements xi in X and yi in Y on a set of points p1…n in P, if the value of xi+yi increases the chance that pi will be sampled, it will introduce a phantom correlation between X and -Y

Kind of scary, eh?

Disclaimer: Although the author is ostensibly a mathematician, he has never been a very good one (he did the graphs in Excel, what’s that about?). All theorems should be proven from first principles before attempting to use at home. Vote no on the axiom of choice. source

Comments are closed.