Actual Analysis blog: January 2012

Tuesday, January 31, 2012

The raw database

When you construct an analytical database, you're better off to analyze your data after they're in the database rather than before. That is, the data in your database are most useful if they are what are known as raw data – that is, individual scores rather than summary data like statistics (such as percentages) or ranges (such as age ranges).

For example, let's consider a database which consists of the names of twenty cities and their unemployment rates (which are, of course, percentages). If you want to work out the unemployment rate for all the cities or for a subset of them, you can't, because you don't know how many people are in the labour force in each city. If, however, the database consists of the names of the cities, the number of people in the labour force in each city, and the number of unemployed in each city, you can easily work out those figures as well as any you could have worked out with the other database.

That example is a simple one for illustrative purposes, but problems like the one in the example are not rare. Databases constructed with range data rather than raw data are also common. Often, for example, people's ages are entered according to an arbitrary range into which they fall – a 28-year-old might be entered as a 25-to-34-year-old, for example. You can discover useful relationships with data like that, but you can also miss relationships that you would find if you entered the actual ages. If you entered the actual ages you would still be able to investigate your age categories, as well as alternatives to them which might be more useful.

A database of raw data is a much more powerful analytical tool than one of summary data. In compiling a database of summary data you are essentially drawing conclusions about the nature of the data before they have even been entered. Keeping your options open is much the better strategy.

The Raw Database © 2000, John FitzGerald
Originally published at ActualAnalysis.com

Tuesday, January 17, 2012

How the British Geological Survey overcame bad data management - after 165 years

I have gone on at length in this blog and on its parent website about how data aren't informative until they're organized in some useful way. The first step in organizing them is making them accessible - putting them in a database, giving them a unique identifier, and organizing them so that you can use them in the way you intended.

In 1846, John Hooker, a botanist, collected 314 slides of botanical samples for the British Geological Survey. Then he had to rush off on a trip to the Himalayas and didn't get around to entering the samples in the specimen register. In April, 2011, Howard Falcon-Lang, a paleontologist, was poking around in a cabinet in a dark corner of the BGS and found the drawers of Hooker`s slides. He pulled one out, shone his flashlight on it, and read the label "C. Darwin, Esq." (Click here for a news report)

It turned out that Hooker's slides were from Darwin's expedition on the Beagle, and that Dr. Falcon-Lang was apparently the first person in 165 years to recognize what they were. Dr. Falcon-Lang expects that examination of the the samples will contribute to contemporary science. Imagine what contemporary science would be like, though, if these examples had been examined in the 1840s and 50s.

The data most of us use aren't likely to be as significant as Darwin's, but if we can't use them they're as useless as Darwin's were for 165 years. A big problem I have run into with databases is that some data just don't get entered. People entering data omit fields they consider unimportant or too difficult to collect. Often this ends up producing huge amounts of missing data, especially when data are being entered from the field by several people, and if huge quantities of data are missing the data are useless. If you want to use the data, ensure that a complete record must be entered. If you don't use the data, don't collect them. If you don't collect unnecessary data you'll probably make fewer errors in entering the necessary data.

And don't enter summaries of data. For example, enter people's exact ages, not an age range. If you use pre-defined age ranges you may end up with all the ages clumped in one or two categories, which severely limits the analysis you can do. If you enter the exact age, you can define age ranges whose categories have roughly equal numbers of people in them, which makes it easier to find differences between the categories (click here, here, and here for more about these issues.

Similarly, instead of a test score or a rating, enter the individual test and rating items. First of all, that makes it easier to clean the data - to find erroneously recorded items or scores. More importantly, it gives you the ability to assess the adequacy of the test or rating (a PDF you can download from my website describes some things you can do with ratings; click here for the PDF).

And enter the data in a format appropriate for the type of analysis you want to do. Most statistical packages for example, want records entered as rows.

Data will only talk to you if you care for them. Be nice to your data. Only collect the ones you need, and treat the ones you need with respect.

Monday, January 9, 2012

Effect and cause as a clue to the meaning of science

The December 16, 2011, issue of WIRED has a piece by Jonah Lehrer called "Trials and Errors: Why Science is Failing Us" (click here to read it). Mr. Lehrer's argument seems to be that some phenomena are too complex for scientific method to be able to discover what causes them. In his conclusion he writes:

And yet, we must never forget that our causal beliefs are defined by their limitations. For too long, we’ve pretended that the old problem of causality can be cured by our shiny new knowledge. If only we devote more resources to research or dissect the system at a more fundamental level or search for ever more subtle correlations, we can discover how it all works. But a cause is not a fact, and it never will be; the things we can see will always be bracketed by what we cannot. And this is why, even when we know everything about everything, we’ll still be telling stories about why it happened. It’s mystery all the way down.

The comments following the piece do a good job of of pointing out the flaws in the reasoning by which Mr. Lehrer reaches this conclusion. However, one issue is omitted. That issue is that science is not about causes.

Science is about effects. At its simplest, an effect is a non-random relationship between two variables. Scientific experimentation investigates effects by varying one of the variables (the indendent variable) and seeing what happens to the other variable (the dependent variable). The goal is to explain the effect - that is, become more effective in predicting the dependent variable. This model can be expanded to handle large numbers of variables. For example, one of the things I do in evaluating satisfaction with a program is to investigate simultaneously the relative importance of several variables in accounting for satisfaction. What you typically find when you do this correctly is that only a few of the variables have any relationship to satisfaction. What you often find, too, is that the variables that account for their satisfaction are different from the reasons particpants report when asked why they like the program.

The methods I use are correlational, so they cannot attribute causation. What they tell you is that as one thing varies, so does another. Furthermore, the analyses of satisfaction I do are non-experimental, so I can't even be sure that the estimates of the correlations are all that exact. What I can do, though, is make a recommendation that changes be made to see if dealing with the the variables identified by the data analysis will improve satisfaction.

The same considerations apply to a lot of health research, and that consideration alone goes a long way to accounting for the examples Mr. Lehrer adduces. What health researchers do is develop their own recommendations for further research that will test whether their conclusions are correct. In fact, the supposed failure Mr. Lehrer describes is in fact a demonstration of the success of science - a hypothesis was developed from prior research to test whether a drug was effective, and the test failed to find evidence that it was effective. That failure by itself is informative - it tells us not to prescribe the drug.

One of the commenters at the link above (urgelt) goes into the issue of the adequacy of research in more detail. My post of January 5 (click here) provides another example of this type of difficulty. What is clear is that error is inherent in the process of scientific experimentation, and that the foundation of scientific method includes a recognition that error is inherent. Reports of statistical analysis of research results typically include many estimates of the error involved in the relationships estimated by the statistical techniques.

As for Mr. Lehrer's remarks about the mythical nature of causes, scientific method has long allowed explanatory variables that have no real existence (intelligence, for example, cannot be directly measured but only inferred from behaviour). Variables like this are called explanatory fictions. The reason they are allowed is that the point of science is to explain an effect, not to find out what its actual cause is. If a fictional variable can explain the effect where something tangible and real can't, so much the better. Furthermore, even a small improvement in accuracy of prediction will often produce large benefits. Obviously, something which improves accuracy only a small amount is unlikely to be a cause in any meaningful sense, but it can still play an important role in practice.

Complex systems often frustrate scientific research simply because there are so many potential effects to examine, not because scientists are naive about the nature of causes, which anyway they aren't looking for. Mr. Lehrer freely acknowledges that science has been spectacularly successful with some complex systems (the health of large populations, for example), so concluding that failures to be successful with others mean that science has failed to solve the problem of causation is not only questionable and hasty but irrelevant as well.

I am confident that the scientific research of 100 years from now will be superior to today's research. I am also confident that the reason for its superiority will not be that it has solved the problem of causation.

Website
Twitter

Research, cause, and effect © 2012, John FitzGerald

Why information overload is a myth

Everybody’s heard of information overload – a Google search I just did for information overload (in quotation marks) produced over 4 million results. In fact, though, it is data we are overloaded with, not information.

Information consists only of data that reduce uncertainty. A weather forecast is only informative if it predicts the weather accurately. If it doesn't predict the weather accurately, we could end up leaving our umbrellas at home on rainy days. Similarly, if we base corporate decisions on data that don’t predict the results we want to achieve, we could end up being embarrassed and out of pocket.

As the Schumpeter blog in the Economist said on December 31: “As communication grows ever easier, the important thing is detecting whispers of useful information in a howling hurricane of noise.” It’s that overload of noise we must fear.

How do you reduce an overload of noise?

By not collecting data that are irrelevant to the decisions you make.
By not collecting data that are nearly identical to informative data you already collect.
By not collecting more data than you need.
By not combining pieces of information in ways in ways which produce an uninformative total score (by weighting them, for example).

But how do you avoid doing these things? Chiefly by analysing your data with sound statistical methods. For example, you can estimate the relevance of data to a decision with methods like the correlation coefficient. You can use principal components analysis to find variables that are telling you the same story. You can use sampling theory to decide how much data you need to collect. You can use psychometric analysis to combine pieces of information into a single score effectively. The battle against uninformative data has not been won, but you can win that part of it that takes place in your office.

Website
Twitter

Thursday, January 5, 2012

Cognitive decline research: Questions the CBC didn't ask

Today's CBC news report (click here) of a study of cognitive decline is pretty standard science reporting. I'm sure that other news sources provided much the same story. Anyway, it confines itself to reporting the results the researchers reported, results which were fairly stated.

However, there is other information the CBC might have provided, but didn't. First, it doesn't provide a link to the study (or hadn't when I posted a comment asking for one). I, for one, was interested in learning what "a 3.6% decline in mental reasoning" was. Does a decline of that size have a serious effect on people's functioning?

So I looked for the link and found it (here). It's an open access article that can be downloaded free in a PDF. The article doesn't provide a quick answer to the question of how serious the declines observed are, but the authors do suggest that further attention might be paid to people whose declines are greater than the mean in the study. That suggests to me the mean declines are not that serious, although I readily admit I may be reading something into the authors' suggestion that isn't there.

What I also found, though, is that the researchers did not control for health. Since older people tend to be less healthy, were these cognitive declines due to changes in brain function or to the fatigue resulting from poor health? Information about medical risk factors was collected, but the article does not report that it was incorporated in the statistical analyses.

None of this is intended to question the adequacy of the research. Being able to carry on a rigorous study of over 10,000 people for 24 years is proof enough of the researchers' competence. What this is intended to question is the value of a news report that simply reports results without examining them. I'm sure that if the researchers had been asked about the relationship of the health information they collected to cognitive decline they could have explained it fully. I'm sure if they'd been asking questions like that the journalists would have enjoyed their jobs more, too.