Actual Analysis blog

Tuesday, April 26, 2011

Inside the information cult (1)

In Canada we're in the final week of a federal election campaign. So far press coverage has been dominated by coverage of poll results. Unlike, it seems, most people I am skeptical of the utility of election poll results, and here I'll explain why.

Polls of people’s opinions can be a useful exercise if they’re done properly and the results are interpreted carefully. The polls that the press publishes, however, often fail to satisfy these criteria (at least in the form in which they are presented in the press). For one thing, they usually ask at most a handful of questions and don’t attempt to assess how meaningful the responses are . The results of this type of poll are not information but pseudo-information. Electoral poll results are obviously uninformative simply because they don’t predict election results accurately enough. Informative data must be valid, and in general they are not valid.

If you consult the excellent PollingReport website you will find a summary of final estimates by eighteen polls of the popular vote in the 2000 United States presidential elections. Fourteen of those predictions had George W. Bush winning the popular vote, which in fact was won by Al Gore. Sure, the election was close, but isn’t that when you most need a good prediction? These poll results are not information, but rather some devout information cultists’ simulation of information.

In 2004, 15 of 22 polls predicted that Mr. Bush would win the popular vote, but five still predicted that John Kerry would win (two predicted a saw-off, as did two in 2000). This election wasn’t quite as close as the one in 2000, but the difference between the two candidates’ support was only about 2 percentage points. Even if the polls do give you a good idea of how people are going to vote, sampling error wipes out any utility they may have when a vote is close, which is a lot of the time. Furthermore, why would we expect polls to be all that valid as measures of what the population as a whole intends to do?

First, there’s their questionable sampling to consider. Poll results often come with statements, derived from sampling theory, saying that given the size of the sample they polled, their results will be accurate within so many percentage points of the actual percentages 95% or 99% of the time; in Canada the press is required to provide such estimates. These estimates are derived from sampling theory. However, sampling theory assumes that the samples polled are representative (that is, that they are random samples of the population). That is not true of any political poll.

A random sample is one in which each member of a population has a known probability of appearing. If you draw a simple random sample of 10% of a jar containing 2,000 jelly beans, each jelly bean will, if you draw the sample properly, have a 10% chance of appearing in the sample. However, let’s say that you want to draw a sample of 10% of the members of a club with 2,000 members so that you can ask them (the members of your sample) some questions about the club. All of a sudden you don’t know the probability that each member has of appearing in the sample, for a very simple reason.

The simple reason is that people can refuse to take part in your sample. If you mail them your questionnaire, others will forget to complete it, and some of the procrastinators will never get round to it. Some just won't be interested. The problem is that you can’t tell the people who won’t return the questionnaire from the ones who will.

You will end up drawing a random sample from the population of club members who complete questionnaires. The same is true of samples in political polls. Most people, in fact, refuse to take part in political polls. Secondly, people have to be home to answer the phone before they can consent to take part in the poll. It’s likely that some large subgroups of the population (the young, for example, or the employed) are less likely to be at home than others. Thirdly, the questions have to be asked in a language the person polled understands; people who can’t understand the language of the poll well enough have to be excluded.

For these and other reasons the sample you get in a political poll is never representative of the population as a whole but rather of that minority of the population that is both able and willing to take part in polls. If that minority thinks like the majority, then your results will apply to the majority as well. If the majority doesn’t think like the minority, then the results won’t apply. The catch, of course, is that you have no idea how closely the thinking of the minority corresponds to the thinking of the majority.

Even if you were able to get a representative sample, you would still have the problem that people sometimes don’t have too accurate an idea of what they’re going to do. Sometimes they change their minds between the time they take part in the poll and the time they actually vote. Sometimes they don’t know how they’re going to vote till they get in the booth. Sometimes they don’t vote. And even if they do know how they’re going to vote, why should we assume that they’ll tell us the truth?

Election polling is a cargo cult practice. We know that examining samples has been a productive practice in science, so we draw a few samples of our own to examine. However, just as the control towers at cargo cult airstrips in Melanesia don't have the crucial operating characteristics of real control towers, the samples drawn in election polls don;t have the crucial operating characteristics of samples from which estimates of statistics (the percentage of people likely to vote for a political party, for example) can be reliably derived. And even if they did, the mutability of human intentions would probably keep them inaccurate.

Actual Analysis website

Inside the information cult (1) © 2011, John FitzGerald

Friday, April 8, 2011

Lady Luck is actually very democratic

A commercial for a poker site is advising us that Lady Luck hangs out with the better players. In fact, she demonstrably doesn't.

The only meaningful conception of luck that I'm aware of is the statistical one. I am identifying luck with the statistical concept of error, which, as we shall see, is well suited to be a conception of luck. Anyway, any result (winning a poker game, for example) can be statistically analyzed as the consequence of an effect (poker-playing skill, say) and error. Error is the sum of all those things that affect the result but aren't related to poker-playing skill — the specific cards you get, how alert you are, and so on.

Error is randomly distributed with a mean of zero (these characteristics follow from the mathematics required to distinguish effects from error). Since the effects of the variables that produce the error are not correlated with poker-playing, that means the mean error score for good players is zero, and the mean score for poor players is zero. And after all, there should be nothing about being a good player that makes you more likely to be dealt a pair of aces.

Saturday, April 2, 2011

Accuracy is not enough

Data are not necessarily information. They are informative only to the extent that they reduce uncertainty. If you want to know what programs are on television tonight, knowing yesterday's television schedule will not help you. Yesterday's schedule is full of data, but the data are no longer informative.

In psychometric terms, informative data are those which are valid – which predict events of interest to you. To be valid data must be accurate; in fact, the validity of information is limited by its accuracy. Of course, inaccurate data cannot be valid, and the maximum possible validity of accurate data is equal to the square root of its reliability coefficient.

The minimum validity of accurate data, however, is always zero. Sometimes data are not valid simply because they are distributed in a way ill-suited to the statistics which are used to assess validity; often the distribution can be modified through a mathematical transformation and validity restored. Sometimes the data are simply irrelevant or poorly defined.

In Canada a federal election campaign is under way. As usual the press commentary about it includes frequent presentation of poll results. At the moment the poll results are the unverifiable opinions of the 30% of the population that takes part in polls about what they think they'll be doing a month from now. These people probably differ significantly from people who don't take part in polls. They've probably got more time on their hands for a start, which means they're likely older, better off, and so on. That is, they are probably not even accurate estimates of unverifiable opinions. If you check the excellent Polling Report website you'll find that American polls have been dependably incompetent at predicting the results of American presidential elections, which are simple two-candidate races. In Canada, with three national parties and a big regional party they are likely to be even less effective.

Anyway, if you depend on any type of database, it should be checked regularly to ensure not only accuracy but also relevance and utility.

Accuracy is not Enough © 2001, 2011 John FitzGerald

More articles from www.ActualAnalysis.com

Friday, February 4, 2011

Mohan Srivastava, crime-fighting statistician

Click here for an interesting piece from Wired about gaming the lottery.

Friday, January 14, 2011

The non-confidence interval

The rise of the opinion poll to pre-eminent importance has made us all familiar with statements like "These results are accurate within three percentage points, 95 times out of 100." This is a statement of what is known in sampling theory as a confidence interval.

Usually the result to which this statement refers is an estimate of the percentage of the population holding a certain opinion. The statement of the confidence interval implies that if 100 more samples of the same size had been drawn from the same population, the percentages estimated from 95 of those samples would have been within three percentage points of the percentage in the entire population.

The mathematics used to reach this conclusion is quite elegant, and the validity of confidence intervals is inescapable as long as certain conditions are met. What rarely is mentioned, though, is that these conditions are not met very often.

First, the sample has to be a random sample from the population. In opinion polling, this assumption is never met. For one thing, most people don't co-operate with poll takers. They hang up the phone, they don't return the questionnaire in the prepaid envelope, they don't stop for the people in malls with the clipboards. At best, polling samples are random samples from that minority of the population which agrees to be sampled.

Second, there is no one standard confidence interval for a poll. The confidence interval varies with the size of the percentage being estimated. This issue is rarely mentioned by researchers of any kind. In general, the confidence interval of a percentage becomes smaller as the percentage differs from 50%. The decrease can be important with smaller samples. For example, an estimate of 50% based on a sample of 200 has a 95% confidence interval of 6.9%, given certain assumptions. An estimate of 80% based on a sample of the same size has a 95% confidence interval of 5.5%.

Finally, the formulas for the confidence interval assume that you are measuring something reliable. These formulas were derived originally for problems in the natural sciences, where the items being sampled have solidity and consistency. In opinion polling, on the other hand, the items being sampled tend to be ethereal and ephemeral. Today I may feel like voting for the Vegetarian Party, but by the time I get behind the screen and pick up the little pencil I may well have decided that yesterday's scandal involving the executive committee of the Vegetarian Party, a seedy restaurant, and fish cakes disguised as tofu has pretty well demonstrated the unfitness of the Vegetarian Party to govern.

At best, a poll, or any similar survey, is a measure of what the minority of people who take part in polls think at a specific hour and minute of a specific day. As a guide to action, polls need to be supplemented by other information about the issues which they investigate.

Originally published at Actual Analysis

Thursday, January 13, 2011

The TRUTH about means and medians

One of the many bees I have in my bonnet is buzzing about how people talk about the mean and the median. I just read a research report in which the mean was described as the average and the median was just called the median. It was a pretty good report, so I suspect the author was trying to help her non-statistically-trained readers. I still think this can be misleading, though.

The mean of a set of scores is simply the result of adding them up and dividing by the number of scores. The median, on the other hand, is the score which has equal numbers of other scores above and below it.

The average of a set of scores is its midpoint. That is, the median is the average. The mean is an estimate of the median. We use it because it can be manipulated more effectively and profitably than the median can, usually without affecting the validity of conclusions.

Most people know that skew may make the mean inaccurate. However, most do not know that there are criteria for deciding if use of the mean should be reconsidered. I reconsider if the skewness coefficient (which is provided by spreadsheets as well as statistical software) is greater than 1 or less than -1.

Above all, do not do what some people do and throw out your highest and lowest scores, or the two highest and two lowest etc., as a protection against skew. While that probably does little harm, it doesn't help either. If you're using your data for descriptive purposes, the best solution is to use the median. That way you get the benefit of all your data.

If you're using the data for inferential purposes, you should of course be using statistical tests. I suggest you compare the result of a test of differences between means with the result of a test of differences between medians (this is often a good idea even with unskewed data). If you're comparing either means or medians without a test, you are wasting your time. You need to know how likely a difference is to happen by accident before you can decide how important it is.

I should note that there are occasions when discarding data before calculating the mean can be useful; however, these are occasions that are best handled by people trained to deal with them. If you ever find yourself having to estimate the location parameter of a Cauchy distribution, discarding data from the tails before calculating a mean can be helpful, but even more helpful is having someone do who's been trained to do it and does it a lot.

Main site

Friday, December 17, 2010

The future belongs to the Swiss Federal Institute of Technology

The Swiss Federal Institute of Technology in Zurich has a plan to simulate the entire world. According to this report the plan is to "gather data about the planet in unheard of detail, use it to simulate the behaviour of entire economies and then to predict and prevent crises from emerging."

Back when I was being trained in research, this type of project was always held up to us as an example of what not to do. Unfortunately, many people today believe that if you collect massive amounts of data the Truth will emerge from it. Dream on.

There are may problems with this approach. One of the chief ones is that the process is entirely inductive. Relations are identified in data from the past, and then extrapolated to the future. These relationships may hold in the future, or they may not. And if they do hold in the future, they may not hold forever.

I often think that it would help decision-makers if they spent some time handicapping horse races. Believe me, if you see that horses on the rail have been winning all week, that is no guarantee that they are going to win for you today. There is no reason they should.

The basic problem here is the lack of a theoretical approach. Ordinarily you start with a theory, or at least a hypothesis, about how something in the world works and then you collect the data necessary to establish whether your theory or hypothesis stands up to empirical test. If you confirm the theory you then try to modify it to increase its explanatory power. What the Swiss Federal Institute of Technology is trying to do is to get the data to think up the theory for them. But data don't have brains. The theories they come up with are going to lead you down a lot of paths that go nowhere. Any dataset is full of relationships, many — and sometimes all — of which are spurious, products of random variation or of systematic bias. Spurious relationships are not a foundation of successful forecasting.

Another important problem is that typically the relationships you find between the variables of the type that are going to be collected for this project are weak. Most of the variation in them is due to the effects of other variables that you usually don't have measures of. That means that predictions from your model will be at best only grossly approximate. Among other things, the developers of this model of everything want to predict economic bubbles and collapses. However, since these predictions are almost certain to be only grossly approximate, they will offer little guide to policy. If you remember last winter, you remember the extremes governments went to after an H1N1 pandemic was predicted, and how unnecessarily expensive (and ineffective) they were.

But the European Union is sinking a billion euros into this venture. I'd wish them good luck, but I'm confident that even with the best luck possible this project is going to fail, and fail miserably.

Link to the first in a series of related articles at the main site