Thursday, January 13, 2011

The TRUTH about means and medians

One of the many bees I have in my bonnet is buzzing about how people talk about the mean and the median. I just read a research report in which the mean was described as the average and the median was just called the median. It was a pretty good report, so I suspect the author was trying to help her non-statistically-trained readers. I still think this can be misleading, though.

The mean of a set of scores is simply the result of adding them up and dividing by the number of scores. The median, on the other hand, is the score which has equal numbers of other scores above and below it.

The average of a set of scores is its midpoint. That is, the median is the average. The mean is an estimate of the median. We use it because it can be manipulated more effectively and profitably than the median can, usually without affecting the validity of conclusions.

Most people know that skew may make the mean inaccurate. However, most do not know that there are criteria for deciding if use of the mean should be reconsidered. I reconsider if the skewness coefficient (which is provided by spreadsheets as well as statistical software) is greater than 1 or less than -1.

Above all, do not do what some people do and throw out your highest and lowest scores, or the two highest and two lowest etc., as a protection against skew. While that probably does little harm, it doesn't help either. If you're using your data for descriptive purposes, the best solution is to use the median. That way you get the benefit of all your data.

If you're using the data for inferential purposes, you should of course be using statistical tests. I suggest you compare the result of a test of differences between means with the result of a test of differences between medians (this is often a good idea even with unskewed data). If you're comparing either means or medians without a test, you are wasting your time. You need to know how likely a difference is to happen by accident before you can decide how important it is.

I should note that there are occasions when discarding data before calculating the mean can be useful; however, these are occasions that are best handled by people trained to deal with them. If you ever find yourself having to estimate the location parameter of a Cauchy distribution, discarding data from the tails before calculating a mean can be helpful, but even more helpful is having someone do who's been trained to do it and does it a lot.

Main site

The Truth about Means and Medians © 2011, John FitzGerald

Friday, December 17, 2010

The future belongs to the Swiss Federal Institute of Technology

The Swiss Federal Institute of Technology in Zurich has a plan to simulate the entire world. According to this report the plan is to "gather data about the planet in unheard of detail, use it to simulate the behaviour of entire economies and then to predict and prevent crises from emerging."

Back when I was being trained in research, this type of project was always held up to us as an example of what not to do. Unfortunately, many people today believe that if you collect massive amounts of data the Truth will emerge from it. Dream on.

There are may problems with this approach. One of the chief ones is that the process is entirely inductive. Relations are identified in data from the past, and then extrapolated to the future. These relationships may hold in the future, or they may not. And if they do hold in the future, they may not hold forever.

I often think that it would help decision-makers if they spent some time handicapping horse races. Believe me, if you see that horses on the rail have been winning all week, that is no guarantee that they are going to win for you today. There is no reason they should.

The basic problem here is the lack of a theoretical approach. Ordinarily you start with a theory, or at least a hypothesis, about how something in the world works and then you collect the data necessary to establish whether your theory or hypothesis stands up to empirical test. If you confirm the theory you then try to modify it to increase its explanatory power. What the Swiss Federal Institute of Technology is trying to do is to get the data to think up the theory for them. But data don't have brains. The theories they come up with are going to lead you down a lot of paths that go nowhere. Any dataset is full of relationships, many — and sometimes all — of which are spurious, products of random variation or of systematic bias. Spurious relationships are not a foundation of successful forecasting.

Another important problem is that typically the relationships you find between the variables of the type that are going to be collected for this project are weak. Most of the variation in them is due to the effects of other variables that you usually don't have measures of. That means that predictions from your model will be at best only grossly approximate. Among other things, the developers of this model of everything want to predict economic bubbles and collapses. However, since these predictions are almost certain to be only grossly approximate, they will offer little guide to policy. If you remember last winter, you remember the extremes governments went to after an H1N1 pandemic was predicted, and how unnecessarily expensive (and ineffective) they were.

But the European Union is sinking a billion euros into this venture. I'd wish them good luck, but I'm confident that even with the best luck possible this project is going to fail, and fail miserably.

Link to the first in a series of related articles at the main site

The future belongs to the Swiss Federal Institute of Technology © 2010, John FitzGerald

Friday, December 3, 2010

Gross national happiness

Canada is apparently considering joining the group of countries that assess their gross national happiness. Happiness is one of those concepts that has always interested me because a) so many people think it's extremely important, and b) so few people even attempt to define it. Love is a similar concept.

In fact the definition of gross national happiness is vague. The project seems chiefly to be an attempt to link population characteristics to feelings of well-being. Why you'd want to do that is a mystery to me. Sure, they've found that countries with low rates of infant mortality have happier citizens, but surely we don't justify fighting infant mortality as necessary to keep the public happy.

Similarly, an assessment of the adequacy of a country's economy has been proposed as an indicator of happiness, but if people are happy with an unsound economy are we to take that as a Good Thing? That approach doesn't seem to have worked too well in the United States, where many people were astonishingly proud of their economy until it came crashing down a few years ago.

I'm certain — or at least I'd like to be certain — that our government doesn't intend the assessment of happiness to be a guide to policy. If that's their intent, they should logically end up doing things like legalizing marijuana — that makes lots of people happy. If they don't intend it to be a guide to policy (and there's no good reason they should), there's no good reason to assess national happiness at all.

Gross national happiness © 2010, John FitzGerald

Actual Analysis website

Thursday, November 11, 2010

Lightning, Lotteries, and Probability

People often claim you have more chance of being struck by lightning than of winning the lottery. The argument appears to be that one Canadian in 5 million is struck by lightning every year, while your chances of winning the standard 6/49 lottery are about one in 14 million, and one in 5 million is a higher probability than one in 14 million. However, this reasoning is unsound.

The problem is that these two probabilities are not comparable. The estimate of the probability of being hit by lightning is an empirical one, derived from observation, and applies to an entire year's worth of thunderstorms. The estimate of the probability of winning the lottery is a mathematical one, derived from a formula which applies to a single drawing of the lottery.

We could derive from the first estimate the probability of being struck by lightning at the time the lottery number is drawn, which would provide a fairer comparison (and one which would favour the lottery), but the more important issue is why we would want to do that. The frequency of an event relative to electrocution by lightning is not a standard of worth. For example, the probability that an individual Canadian will become prime minister in the next year is lower then the probability that he or she will be struck by lightning, but no one would conclude that that difference in probabilities tells us anything about the value of the Canadian political system.

More articles at the Actual Analysis site

Lightning, Lotteries, and Probability © 2001, John FitzGerald

Tuesday, November 9, 2010

Mayor of all Toronto except part of it

In yesterday's post I came up with some hypotheses about the vote in the recent Toronto mayoral election. Since then I've refined them a bit and tested them.

I simplified them by reducing the independent variables to two – section of the city and household income, and by hypothesizing only about the vote for the winner, Rob Ford. Hypothesizing about all three major candidates just complicates analysis, and examination of the effects of the independent variables on their votes could be done post hoc to elucidate the effects on Mr. Ford's vote.

I had originally planned to analyze the results by subdivision, but that increased the power of the statistical test so much that almost any difference would have been statistically significant. So I analyzed the results by ward; that decision gave me a nice little sample of 44.

Income was defined as the quartile in which median household income in the ward fell. The sections of the city were the outer suburbs (those wards for whom the city limits were part of their land boundaries), the inner suburbs (other wards outside the old City of Toronto as it was before amalgamation in 1998), east Toronto (roughly the old City of Toronto east of Yonge St.), and west Toronto (roughly the old City of Toronto west of Yonge St.).

So my new null hypotheses were that Mr. Ford's vote would be affected by neither of the independent variables. I was hoping, though, that they'd be affected the section of the city but not by income. Specifically, I was hoping his vote would be highest in the outer suburbs,

Mr. Ford's vote was not correlated with the total vote in a ward (r = .25; p > .05), so I didn't correct for differences in the number of votes (if they had been correlated, I would have removed the effect of total votes with regression analysis and analyzed the residual vote).

My hopes were dashed. A two-way analysis found that Mr. Ford did do best in the outer suburbs, but not significantly better than in the inner suburbs. The big difference was between the pre-1998 City of Toronto and the rest of the current city. Mr. Ford won 31% of the vote in the old City of Toronto, and 59% elsewhere.

This analysis also found a weak effect of income, but further analysis suggested this was an artefact of random variation in the number of votes cast. Analysis of the residual vote I described earlier found no differences related to median household income.

Analysis of Mr. Smitherman's and Mr. Pantalone's votes confirmed they were the candidates of the pre-1998 City of Toronto. They did better there (and Mr. Smitherman did better only in east Toronto). Ward income was not related to the votes they received.

In general, then, different sections of the city voted differently but income had little if anything to do with the results. Mr. Smitherman, the chief competitor for Mr. Ford, failed to appeal outside the oldest part of the city. Perhaps another popular explanation of the results is correct – Mr. Ford just ran by far the best campaign.

Monday, November 8, 2010

Mayoral strongholds

Torontonians seem to have concluded about their recent mayoral election that the winner was the candidate of the suburbs. I thought a little more detail might help. Here we will look at the wards in which his support, and the support for the other two major candidates, was the strongest.

I did some exploratory analysis examining the percentages each candidate won of the vote in subdivisions, then confirmed it with sorts of the percentages of votes cast in each ward. I came up with four hypotheses I will be testing further:

1. Mr. Ford's support was strongest on the outskirts of the city. His support was strongest in wards 1, 2, 4, 31, and 49, all of which are pretty far from City Hall. All have the city limits as a boundary. Mr. Ford won 67% or more of the vote in these wards.

2. Joe Pantalone was the candidate of the west end of the old city of Toronto. His strongholds -- wards 14, 17, 18, and 19 -- clustered together in the west end. Mr. Pantalone took 20% of more of the vote in these wards.

3. George Smitherman was the candidate of money. Mr. Smitherman had strong support in both Forest Hill (wards 21 and 22) and Rosedale (ward 28).

4 Mr. Smitherman was also the candidate of the east end of the old City of Toronto. His support was strong in wards 30 and 32, which lie side by side along the eastern harbour and the lake. Mr. Smitherman took 50% or more of the vote in the wards in which he was strongest.

As I said, these are just hypotheses so far. I'll be souping up my data file, and then I'll be testing these hypotheses. More soon.

Actual Analysis website

Tuesday, November 2, 2010

Religion and mayoral choice in ward 26

People have been speculating about the effect of religion in the Toronto mayoral elections a week ago. The idea is that members of some religions would be less likely to vote for George Smithermen, who is gay and married to another man.

We saw in the last post that voters in Jewish neighbourhoods in Ward 21 were in fact most likely to vote for Mr. Smitherman instead of the other candidates. In this post we'll look at a Muslim neighbourhood, Thorncliffe Park in Ward 26.

As it turned out, Mr. Smitherman did finish second in the polls in Thorncliffe Park. However, he finished frst in the rest of the ward. He received 33% of the vote in Thorncliffe Park, and 44% in the rest of the ward. A powerful chi-square test finds this difference to be significant, while the weaker median test I described in the last post doesn't. However, the powerful test estimated an infinitesimal probability that the dfference was random, and the weak test estimated that the probability was less than .09, so I'm considering this difference statistically sgnificant.

However, of the eleven percentage points that went missing for Mr. Smitherman in Thorncliffe Park, Mr. Ford picked up only four. Most of the vote Mr. Smitherman lost went to three candidates with Muslim names, none of whom, however, made an issue of their being Muslim. One was an anti-poverty advocate, another a civil-rights advocate (and not the kind that thinks civil rights mean other people should shut up about their -- the advocate's -- religion), and one has campaigned before as an anti-unemployment candidate. They could simply have been taking a greater part in the public life of Thorncliffe Park than the other candidates.

As I concluded before, if religion affected the mayoral vote, it was probably weakly, and in interaction with other variables.

Main Actual Analysis site