Friday, January 14, 2011

The non-confidence interval

The rise of the opinion poll to pre-eminent importance has made us all familiar with statements like "These results are accurate within three percentage points, 95 times out of 100." This is a statement of what is known in sampling theory as a confidence interval.

Usually the result to which this statement refers is an estimate of the percentage of the population holding a certain opinion. The statement of the confidence interval implies that if 100 more samples of the same size had been drawn from the same population, the percentages estimated from 95 of those samples would have been within three percentage points of the percentage in the entire population.

The mathematics used to reach this conclusion is quite elegant, and the validity of confidence intervals is inescapable as long as certain conditions are met. What rarely is mentioned, though, is that these conditions are not met very often.

First, the sample has to be a random sample from the population. In opinion polling, this assumption is never met. For one thing, most people don't co-operate with poll takers. They hang up the phone, they don't return the questionnaire in the prepaid envelope, they don't stop for the people in malls with the clipboards. At best, polling samples are random samples from that minority of the population which agrees to be sampled.

Second, there is no one standard confidence interval for a poll. The confidence interval varies with the size of the percentage being estimated. This issue is rarely mentioned by researchers of any kind. In general, the confidence interval of a percentage becomes smaller as the percentage differs from 50%. The decrease can be important with smaller samples. For example, an estimate of 50% based on a sample of 200 has a 95% confidence interval of 6.9%, given certain assumptions. An estimate of 80% based on a sample of the same size has a 95% confidence interval of 5.5%.

Finally, the formulas for the confidence interval assume that you are measuring something reliable. These formulas were derived originally for problems in the natural sciences, where the items being sampled have solidity and consistency. In opinion polling, on the other hand, the items being sampled tend to be ethereal and ephemeral. Today I may feel like voting for the Vegetarian Party, but by the time I get behind the screen and pick up the little pencil I may well have decided that yesterday's scandal involving the executive committee of the Vegetarian Party, a seedy restaurant, and fish cakes disguised as tofu has pretty well demonstrated the unfitness of the Vegetarian Party to govern.

At best, a poll, or any similar survey, is a measure of what the minority of people who take part in polls think at a specific hour and minute of a specific day. As a guide to action, polls need to be supplemented by other information about the issues which they investigate.

Originally published at Actual Analysis

The Non-confidence Interval © 1995, John FitzGerald

Thursday, January 13, 2011

The TRUTH about means and medians

One of the many bees I have in my bonnet is buzzing about how people talk about the mean and the median. I just read a research report in which the mean was described as the average and the median was just called the median. It was a pretty good report, so I suspect the author was trying to help her non-statistically-trained readers. I still think this can be misleading, though.

The mean of a set of scores is simply the result of adding them up and dividing by the number of scores. The median, on the other hand, is the score which has equal numbers of other scores above and below it.

The average of a set of scores is its midpoint. That is, the median is the average. The mean is an estimate of the median. We use it because it can be manipulated more effectively and profitably than the median can, usually without affecting the validity of conclusions.

Most people know that skew may make the mean inaccurate. However, most do not know that there are criteria for deciding if use of the mean should be reconsidered. I reconsider if the skewness coefficient (which is provided by spreadsheets as well as statistical software) is greater than 1 or less than -1.

Above all, do not do what some people do and throw out your highest and lowest scores, or the two highest and two lowest etc., as a protection against skew. While that probably does little harm, it doesn't help either. If you're using your data for descriptive purposes, the best solution is to use the median. That way you get the benefit of all your data.

If you're using the data for inferential purposes, you should of course be using statistical tests. I suggest you compare the result of a test of differences between means with the result of a test of differences between medians (this is often a good idea even with unskewed data). If you're comparing either means or medians without a test, you are wasting your time. You need to know how likely a difference is to happen by accident before you can decide how important it is.

I should note that there are occasions when discarding data before calculating the mean can be useful; however, these are occasions that are best handled by people trained to deal with them. If you ever find yourself having to estimate the location parameter of a Cauchy distribution, discarding data from the tails before calculating a mean can be helpful, but even more helpful is having someone do who's been trained to do it and does it a lot.

Main site

The Truth about Means and Medians © 2011, John FitzGerald