As the 2020 election has passed, I thought that it may be a good time to write a short article relating to polling and sampling theory. As most of you all know, I spent 30+ years as a researcher and statistician for the State of Illinois. As such, I have consulted on a large number of sampling issues.
Often, one would approach me and describe what they wanted me to help them accomplish.. Generally, they would ask, “how large a sample do I need to get the answer to a specific question?” The first question that I would ultimately ask was “what error are you willing to accept?” Inevitably, almost everyone would respond with “none.” I would then tell them “then , I cannot help you as you cannot use sampling!” Sampling theory dictates that you establish a sampling error that you can live with. It is the plus or minus in a result. (e.g., 48 plus or minus 3 %). The 3% is called the margin of error.
Now, let’s try to answer the following question:
How do pollsters determine what sample size to use to determine the proportion of the population that favors Candidate A in their polls?
As this is only for illustration purposes, I will keep it very simple. Although pollsters do employ much more sophisticated and complex sampling methodologies (stratified sampling techniques), this article will provide the basics of selecting a simple random sample.
There are four elements that we must consider in determining sample size.
- The universe (population) of voters,
- the estimated proportion of voters that will vote for Candidate A,
- the standard error that we are willing to accept (tolerance level), and
- the confidence level of our selected sample.
Now let’s take a closer look at each of the four elements.
Universe – this is the group or list that our sample will be selected from. In this case, it will be the universe of registered voters. This is frequently the place where pollsters frequently get into trouble. This is often called the sampling frame. A good example was the Dewey vs Truman presidential race. In that presidential election, pollsters used the telephone directory as the sampling frame. Of course, not everyone had a telephone and thus a large portion of the country’s voters was not included in the process. The bottom line was that Truman voters were significantly under-sampled.
There are several potential issues to consider. One problem is that not all registered voters will actually vote. A second problem is that not all selected persons will choose to share his/her voting preference. A third potential problem is the wording of a question or how that question is verbally delivered to the voter.
Proportion of voters that will vote for a particular candidate. – In a binomial situation, a voter has obviously two choices: 1) to vote for or 2) not vote for a particular candidate. The proportion voting for and not voting for must add to 1. Since we do not know what proportion will vote for Candidate A, we will set that proportion at .5. Why do we do this? The reason why is because .5 yields the greatest variability. (Think about it this way, there can be no more diverse situation than 50% on one side and 50% on the other. Thus, .5 is used as it represents the maximum variability in a binomial situation.
To verify this statement, let’s take a look at all the possibilities:
- (.5 times .5) =..25),
- (.4 times .6) = .24) or (.6 times .4) =.24,
- (.3 times .7) =..21) or (.7 times .3) =.21,
- (.2 times .8) = .16) or (.8 times .2) =.16
- (,1 times .9) =..09) or (.9 times .1) =.09
Thus, the maximum possible variability is .25.
Bound on the error of estimation. – This is the place where the statistician must decide how much error he can tolerate. This is called the tolerance level and very often is set at 03 or 05. This is also referred to as the plus or minus.
Confidence level – This is reflected by the number of standard deviations associated with the confidence level. This is usually set at two or three. Two standard deviations (actually 1.96) from the mean in a normal distribution will contain approximately 95% of that distribution while three standard deviations will contain 99% plus of the distribution. The 95% confidence level tends to be the one most frequently used.
The confidence level can be stated in this manner. In repeating sampling, the obtained proportion would fall within the range of the proportion (p) + or minus the set range 19 of 20 times (95% of the time).
The formula that I have always used for determining sample size is:
n= (Npq) / ((N-1)D +(pq))
where n is the sample size that we are solving for,
N is the universe (population) of voters,
p is the proportion of voters that will vote for Candidate A, and q = 1-p
and D is the bound on the error of estimation or tolerance level squared, i.e., (the accepted + or minus error squared) divided by the number of standard deviations associated with either a 95 or 99 percent confidence interval squared. D=b2/(22). We will use 2 standard errors, i.e. 95% as the confidence level where mathematically, D= b2 / 4
It really is not as complex as it looks and probably will make more sense using real numbers.
Let’s assume that 125,000,000 people are on the voting rolls. As we do not know what proportion of the voters will vote for Candidate A, we will be very conservative and set p at .5 as this proportion represents the maximum variability in a binomial situation. The D in the formula is calculated by dividing the square of the bound on the error of estimation by the square of the standard deviations from the mean which is usually set at two (95 percent confidence level) or 3 (99+ percent confidence level). Let’s set our tolerance at .05 and our confidence level at 95% (two standard deviations from the mean). Now let’s perform the calculation and then explain what it means.
n= ((125,000,000) (.5x.5)) / ((124,999,999)(.05x.05) / 4) + (.5x.5))
n=31,250,000 / 78,125
n=400
The chart below shows the sample sizes required for the tolerance levels of .01, .02, .03, .04, and .05.
Tolerance Sample Size
.05 | 400 |
.04 | 625 |
.03 | 1,111 |
.02 | 2,500 |
.01 | 10,000 |
This chart illustrates that with a sample size of 400 that if the outcome for that particular sample is that if Candidate A gets 50 percent, the actual result will fall somewhere between .45 and .55 with a 95 percent confidence.
To calculate this tolerance, we use the formula (2) times the standard deviation / square root of the sample size or (2) (.5) / (20) = .05 Note that standard deviation is always defined as the the square root of the variance.
If a candidate gets 40 percent in a sample of 400, the actual result will fall in the range between 40 percent plus or minus 4.9 or (2) ((.49) / (20)) = 4.9. This means that that if Candidate A gets 40 percent, the actual result will fall somewhere between .35.1 and 44.9 with a 95 percent confidence.
We continue …………..
……………
At 10 percent, the tolerance for a sample size of 400 would fall to 3%. One can readily see that at the same sample size, the tolerance decreases as the variance decreases.
Hopefully you have a little better understanding as I have tried to keep the explanation very simple.
What went so very wrong in 2020?
It boils down to sampling bias or pollster credibility. I would like to think it is the former caused by a lack of cooperation of the respondents to a poll. Through subject replacement, Trump voters were greatly under-sampled. (Each time a subject fails to respond and is replaced, sampling bias is introduced into the results.) Hopefully, most statisticians are honest? However. I would like to think that a credible statistician would have known the results would be significantly biased when so many subjects refused to respond to their poll. I certainly would like for any “reputable” statistician to explain exactly how they handled the replacement component of the sample and why they proceeded knowing that the results were biased!