Wednesday, September 30, 2009

Trailing Digit Distribution Charts

On September 25 & 26, 2009. Nate Silver (fivethirtyeight.com) suggested that we could examine the distribution of trailing digits in the leading two percentages from political candidate/issue polling to verify the veracity of certain pollsters. His first assertion that we might compare this digit distribution to something approching a uniform distribution struck me as off so I decided to apply a more thorough methodology.

First, let me be clear about where these trailing digits come from. A poll from pollster X says candidate A leads candidate B 45-41. We take this poll as one data point and count one 5 and one 1. This poll is saying 45% prefer A, 41% prefer B, and implies the remaining 14% are undecided with a spread of 4% between candidates. We can transform this input to focus on the distribution of the undecided and spread figures.

The distribution of the undecided figure is likely to be different between different types of polls and pollsters themselves, as the way the interview is conducted effects how likely a respondent is to state a preference. Also, pollsters may specialize in types of polling which is more or less likely to have a larger undecided figure.

The distribution of the spread between the leading and trailing candidate is also likely to very from pollster to pollster as some pollsters may be more or less interested in polling close races/issues.

In both cases, however, these distributions should be very smooth. We can work out distributions which approximate the data sets we have. We can then use those distributions to predict the natural distribution of trailing digits.

Using a closely fitting Gamma Distribution with parameters alpha = 3 and beta = 2.5 for the % undecided, and a Gamma Distribution with parameters alpha = 1 and beta = 7 for the % spread I use a Monte Carlo method to develop an expected trailing digit distribution for Strategic Vision:


This assumes each poll has 1200 interviews with an whole number of persons favoring A, B, or undecided. I randomly conduct 2772 polls and count the trailing digits. Then I repeat the process 1000 times and take the mean frequency of each digit. Visually, we see this comport somewhat with the actual distribution observed with Strategic Vision, but with a Chi-squared distribution we see that only 0.011% of pollsters with this undecided/spread distribution would have a trailing digit distribution this strange.

In contrast, when we examine Quinnipiac's undecided/spread distribution and allow the undecided % to be Gamma Distributed with parameters alpha = 3.5 and beta = 5 , and the spread % to be Gamma Distributed with parameters alpha = 1 and beta = 6 we find something much closer.
After applying the same Monte Carlo method we find that 18.022% of pollsters would be expected to be this dissimilar from the predicted distribution, well within the range of reasonably likely events.

Looking back at the Strategic Vision dissimilarities, the thing that jumps out at me is what I take to be the replacement of 6's with 5's and 1's with 0's in the trailing digit. If we modify our methodology to incorrectly round numbers like 45.5% down to 45 instead of up to 46% and 30.5% down to 30% instead of up to 31%. (i.e. in all cases where the percentage rounded to the tenths place ends in 5.5% or 0.5% we incorrectly round down we get the following distribution:

In this scenario, we can expect 12.57% of pollsters with the same undecided/spread distributions to produce actual distributions this dissimilar. Were this the case, we would say that this pollster was using a seriously flawed rounding methodology.

If this is not the case, we have strong evidence of fraud.