Monday, December 04, 2006

Simpson's Paradox

Lemme talk about something which I thought was interesting. The best way to go about it would probably be through an example.
The subject for todays study is cricket. To be more specific batting averages and judging who is the better player. A particular statistical anomaly of interest is as follows. Sometimes we might find that the success rates of individual players over a period of time (individuals seasons), might be significantly better compared to the seasons combined. This seems like a mathematical impossibility! FOr instance, when compared in isolation or season by season, if Sachin Tendulkar had a lower average in both the 2004 and 2005 season as compared to say Mike Hussey, then how is it possible that When you combine both the 2004 and 2005 seasons, Sachin has a better average??

Let me illustrate with a rather simplistic example.

Sachin

2004 Season - Average 50
2005 Season - Average 40

Hussey

2004 Season - Average 60
2005 Season - Average 45

A simple examination would indicate that Hussey seems to be the better player and combining the 2004 and 2005 seasons would still imply that Hussey has the better average. But this need not be the case. Think about this.

Sachin

2004 Season - 100 games scoring 5000 runs at an average of 50 runs
2005 Season - 10 games scoring 400 runs at an average of 40 runs

Total combined average for both seasons ==> 5400/110 = 49 runs per game

Hussey

2004 Season - 10 games scoring 600 runs at an average of 60
2005 Season - 100 games scoring 4500 runs at an average of 45

Total combined average for two seasons ==> 5100/110 = 46 runs per game

Therefore despite the fact that the averages for each individual year are higher for Hussey, combining the two seasons leads to a reversal in inference. The key here is the difference in weights placed when calculating the averages. This gives us important insights into how to think about statistics.

5 comments:

Suresh Sankaralingam said...

Yeah, I have had that realisation too. The formula is actually (m1n1+m2n2)/(n1+n2) instead of (m1+m2)/2 [ the only time the converse will be true is when m1=m2]...

Mad Max said...

pretty much i guess...well i just thought it was interesting becoz statistics is pretty much like part of our life these days...was reading cricinfo the other day and they have this stats column which talks about intersting numbers..that kind of sparked the thought

Survivor said...

Hmm..weighted averages are interesting indeed..

BrainWaves said...

I wonder why people don't use it? Not very difficult to calculate.

Mad Max said...

@ brainwaves: it is not about the difficulty to calculate but rather what we call cognitive effects...i think meera can possibly add more here coz she is a psychology major...the concept i would related to is "limited attention"...the idea is all ppl are shirkers and do not want to put in any effort...therefore when u have some numbers served on the table, u just consume them as is...never questioning if the numbers actually mean what they intend to convey...mebbe we can think of some situations where this is helpful...The stock markets, where one profits from the others losses would be a good place i think where investor fixation on some baseline numbers create so much opportunity for other folks to make money...the world is a mad house ayee...