Friday, December 16, 2011

Secrets of the truth cult!!

For much of their history human beings have taken part in rituals in which an authority informs other people of what is supposed to be the Truth. I call this the pulpit model of information. For centuries Europeans went to church and an authority got up in the pulpit and told them what to believe about the world (and other places).

This model was later adopted by the schools, no doubt because the schools were established by churches. Whatever the reason, schooling until recently consisted of listening to an authority tell you what to believe about the world (in universities, it still often consists of this). In school, though, you were even tested to make sure you’d learned the approved view of things.

In school you also acquire the idea that Truth is something that can be found on the printed page. Consequently we come to accept something that has been published as true, without verifying that it is.

It’s not surprising that we come to look on the truth as something that is dispensed by authorities. Consequently, we look around for people who look like authorities, and treat what they say as information. Furthermore, we treat the methods they use to come up with things to say as methods that can be used to define information. We are often wrong.

Given the track record of authorities (remember all those biological weapons that, according to authorities, Iraq was just itching to use against the West?), depending on them to tell us the truth is a questionable approach. Another problem with this approach is that there is considerable doubt as to whether we need to know the truth, anyway.

Here’s something that’s true: Churchill, Manitoba, is named for John Churchill, first governor of the Hudson’s Bay Company. That’s a fact. Despite being a fact, though, it doesn’t help me get served when I drop in to the local branch of his company.

Every day we are bombarded with truths. The newspaper tells us things like what the temperature was yesterday in Beijing and what celebrities have (or had) their birthdays today. I remember once reading in the paper that it was the late Alfred Hitchcock’s birthday and thinking “I can’t really send him a card, can I?”

Better than mere truth is information. Information is confused with many things that are not informative, though.

Facts, as we have just seen, are not necessarily informative. Unless I’ve made a bet about what the high temperature in Beijing was going to be, that fact cannot be said to inform me of anything.

Furthermore, many items of information are not factual. The idea of intelligence, for example, cannot be said to be a fact, since there is widespread disagreement about just what intelligence is. However, the concept of intelligence is informative because in speculating about it we discover useful things. We have even discovered some of the shortcomings of the idea of intelligence.

As we have also seen, authoritative statements are not necessarily informative. Another reason they're not necessarily informative is that they disagree with each other. In fact, many of them work according to decision models which encourage disagreement as a way of establishing crucial issues that need to be tested. Courts of English law, for example, require two or more highly trained professionals to argue for exactly opposite points.

People also often assume that a logically sound argument is informative. However, it need not be. We can reason as soundly as it’s possible to reason and still be wrong.

Deductive reasoning starts with a general premise or principle. It then applies that premise to a specific piece of evidence and draws a conclusion about that piece of evidence. For example, we might reason like this:

  • All Canadians are British subjects. (general principle)
  • John FitzGerald is a Canadian. (evidence)
  • Therefore, John FitzGerald is a British subject. (conclusion)
Well, that conclusion is true. However, let’s suppose we reason like this:
  • All Canadians have French first names.
  • John FitzGerald’s first name is not French.
  • Therefore, John FitzGerald is not a Canadian.
That conclusion is not true, although the reasoning is entirely sound. Since my first name is not French, the conclusion that I am not Canadian follows logically from the general principle that all Canadians have French first names. The problem, of course, is that the general principle is wrong. Consequently, all statements that follow logically from it are most likely to be wrong. That example is a bit artificial, but people draw sound conclusions from erroneous premises all the time.

For example, many people reasoned out thoroughly logical arguments that on January 1, 2000 the world would be thrown into chaos. I say their beliefs were serious because they acted on them. They stockpiled food, for example, they bought portable electric generators, and some even created fortified shelters to protect themselves from people who hadn’t stockpiled food or bought generators.

As we saw on January 1, 2000, though, the computers didn’t fail. Some of the premises in those thoroughly logical arguments had been unsound. Logic is a tool. Logic does not guarantee that your arguments will stand up any more than a hammer guarantees that the bookcase you build with it will stand up.

Information is often confused with consensus. The supposed existence of a consensus among scientists about global warming is supposed to imply that the consensus opinion is highly likely to be true. Well, a hundred years ago a consensus of scientists would have told you that other races were inferior to whites.

The issue of consensus about global warming seems to have been raised initially as a red herring. That is, people argued against taking action against global warming because there was no scientific consensus about what caused it.

However, consensus has nothing to do with it. At one time there was a scientific consensus that the sun revolved around the earth. That point seems to have escaped the people who are opposed to taking action against global warming, though. Now they complain that this consensus they considered so desirable is being forced on them.

What is informative about an idea is its ability to predict events. The chief value of consensus seems to be coming up with a plan that everyone, or at least everyone important, is willing to go along with. To me, that seems a lot like what lemmings do.

Information cannot be defined by its source. If an expert meteorologist says tomorrow will be sunny, clouds don’t decide to go somewhere else just because a respected source says they will. Information is defined by its effect. Information increases the probability that we will act in effective ways. If it never rains on days when the weather forecast calls for rain, you’re going to end up lugging around a useless umbrella. If it always rains on days your bunions hurt, though, your bunions are a mine of information.

The Truth Cult © 2007, John FitzGerald

Tuesday, December 13, 2011

Another dubious sports statistic

I believe it is against Canadian law for a televised hockey game to be completed without the announcer mentioning, somewhere amid his (sic) endless recitation of players' hometowns, that getting the first goal is all-important, since the team that gets the first goal wins such a high percentage of games.

This belief seems to have come from a study of all major league baseball games between 1966 and 1987 which found that 66% of the games were won by the team that scored first. That’s an interesting finding because in baseball the visiting team is more likely to score first (since it bats first). However, the home team was still more likely to win, so the importance of the first run was still questionable. In 1998 Tom Ruane published an article in which he showed that teams scoring the first run were less likely to win than teams who were the first to score each of the second through ninth runs. The first run, it seemed, was actually the least important run to score. How can that be?, you may be asking. How can a run associated with 66% of victories be unimportant?

The reason it’s unimportant is most likely that the winning team scores more runs than the losing team. Consequently, it’s more likely to score the first run. So even if scoring the first run has no effect on the chances of winning a game, the winning team is still more likely to score the first run.

To examine this possibility I chose data from another sport in which teams don’t alternate offensive and defensive sessions. I collected scores from 110 National Hockey League games played from November 30, 2006 to December 14, 2006. I included games settled by shootout, but gave no credit to the winning team for the goal awarded for the shootout. The team scoring the first goal won 70% of these games (77 of the 110). However, the winning team also scored 68% of the goals (439 of 649). So, if scoring the first goal did not improve a team’s chances of winning a game, you’d still expect the winning team to score the first goal in 68% of the games, or 75 games. The improvement here is all of two percentage points.

But is it an improvement? You can’t reasonably expect that teams scoring 68% of the goals will necessarily win exactly 68% of the games. Other factors have some effect on the outcome, so you’d expect them to win a number around 75. Fortunately, we can estimate the probability that:
  • if scoring the first goal does not increase a team’s chances of winning and
  • if winning teams score 68% of the goals then
  • the team scoring the first goal will win 77 games.

That probability is 44%. Conventional standards of statistical signficance would reject the idea that the first goal is of any importance when the percentage is that high. However, arguing that the probability of the difference being real is still greater than 50% is entirely reasonable. But if we look at the difference that way, we still have to conclude that there is only a 56% chance that scoring the first goal increased the likelihood of winning a game, and that if it did increase the probability of winning a game, it increased it by only 2 percentage points (aka one chance in 50). Either way, that first goal doesn’t seem all that important.

I propose an alternative to the Law of the All-Important First Goal/Run. I modestly call it FitzGerald's Law: the first team to score the winning goal will win. My law has as much explanatory value as the Law of the Fatal First Goal/Run, but is logically more elegant. It also reminds me of another statistical topic which baffles me: why, in a baseball game which finishes with a score of 11-10, can the player who drove in the first run for the winning team get credit for the game-winning RBI? Hm?

Another dubious sports statistic © 1995, 2006 John FitzGerald

More articles at ActualAnalysis.com

Friday, December 9, 2011

Better living through multiple linear regression analysis

I probably say somewhere on the main ste that multiple regression analysis is overused, and indeed it is. Nevertheless, it does have valuable uses which I don't want to frighten people away from, so here's an article about one of them.

I regularly use regression analysis to clarify for a client the factors affecting satisfaction with training and rehabilitation programs the client offers. It started with a review of a program about which the client knew rhat the more enthusiastic about the program consumers were on entry, the more satisfied they were at the end. The question was whether final satisfaction or dissatisfaction with the programs was simply a self-fulfilling prophecy – did consumers say they were satisfied or dissatisfied with the programs simply to justify their initial attitudes?

The client also collected information about consumers' opinions of various characteristics of their programs. This information was not correlated with initial attitude, nor were different types of this information correlated with each other. It was therefore easy, using multiple linear regression analysis, to estimate what proportion of final satisfaction could be explained by initial attitude toward the programs, and then see if characteristics of the programs explained the remainder of the final satisfaction (the residual, as it's known in regression analysis). It turned out that characteristics of the programs were twice as important as initial attitude in determining satisfaction with the programs.

So not only did multiple linear regression analysis determine that satisfaction with the programs was not a self-fulfilling prophecy, it also estimated the relative importance of initial attitude and of the actual characteristics of the programs. The analysis was made easier by the lack of correlation between the different types of information collected, but correlated information can be analyzed with more complicated designs. The possible existence of correlation, though, is the chief reason you shouldn't try this at home. Statistical and database software make it easy to do multiple linear regression analysis, but if you don't know how to deal with correlated variables or how to identify outliers (extreme observations which distort the results), you'll often get the wrong results when you use that software.

We have since gone on to use this technique to determine whether what consumers say are the important factors in determining their satisfaction are in fact the most important. We have frequently found that a simple count of the most popular explanations is contradicted by the multiple linear regression analysis. This is not surprising, since counting explanations, even if they are valid, gives us only a very rough estimate of the importance of different factors. The multile linear regression analysis clarifies the issue.

Of course, it is also important that you use a proper hypothesis-testing design. Just turning multiple linear regression loose on a set of data is almost certain to produce a large proportion of unhelpful or misleading results.

Better Living through Multiple Linear Regression Analysis © 1999, 2011 John FitzGerald

Monday, December 5, 2011

The myth of information technology

The term information technology implies to many people that the technology to which it refers creates information, transmits it, or stores it. The technologies we group together as information technology, however, rarely perform any of these functions. They are called information technology because they use information, not because they transmit it. A cellphone, for example, converts coded electrical signals into a facsimile of a person speaking. What the person is saying, though, may be balderdash.

Information consists of data which reduce uncertainty. The technology which we refer to as information technology is blithely unaware of whether the data it deals with reduce uncertainty or not.

The data provided by "information technology" may not be informative simply because they are irrelevant. For example, if I go looking for the box score of a particular baseball game in the newspaper, the other box scores, informative as they are, simply make it more difficult for me to find the one I'm interested in. These days, though, people use their information technology to collect large amounts of information which are of no relevance to the decision they're going to make.

Then again the data may not be informative because they are not intelligible. While Turkish newspapers are informative to Turks, they are not informative to me, because I don't speak Turkish. I deal with this problem by not subscribing to Turkish newspapers. However, people often use their information technology to collect large amounts of data which they can no more interpret than I can interpret Turkish newspapers. Turning data mining software loose on the data is not guaranteed to turn it into information, either, for reasons which are discussed in other articles on the main site.

The fact is that we make information, not technology. Even those rare items of software which perform analytical functions were created by human minds. Most of what we call information technology is actually nothing more than data technology. It gives us the capacity to collect large masses of data, but it is up to us to find or define the information in it.

Few people believe everything they read in the newspaper or see in the television. Few believe that every item that appears in the newspaper or on the television is relevant to their concerns. It's time for the same discernment to be shown in dealing with databases.

We hear a lot these days about the problem of information overload. In fact, it is data we are overloaded with, not information. If we set out to collect data, we will drown in data. If we condescend, though, to use our analytical abilities, and set off in search of the data that we need, we will find that you can never be overloaded with information.