Tuesday, January 17, 2012

How the British Geological Survey overcame bad data management - after 165 years

I have gone on at length in this blog and on its parent website about how data aren't informative until they're organized in some useful way. The first step in organizing them is making them accessible - putting them in a database, giving them a unique identifier, and organizing them so that you can use them in the way you intended.

In 1846, John Hooker, a botanist, collected 314 slides of botanical samples for the British Geological Survey. Then he had to rush off on a trip to the Himalayas and didn't get around to entering the samples in the specimen register. In April, 2011, Howard Falcon-Lang, a paleontologist, was poking around in a cabinet in a dark corner of the BGS and found the drawers of Hooker`s slides. He pulled one out, shone his flashlight on it, and read the label "C. Darwin, Esq." (Click here for a news report)

It turned out that Hooker's slides were from Darwin's expedition on the Beagle, and that Dr. Falcon-Lang was apparently the first person in 165 years to recognize what they were. Dr. Falcon-Lang expects that examination of the the samples will contribute to contemporary science. Imagine what contemporary science would be like, though, if these examples had been examined in the 1840s and 50s.

The data most of us use aren't likely to be as significant as Darwin's, but if we can't use them they're as useless as Darwin's were for 165 years. A big problem I have run into with databases is that some data just don't get entered. People entering data omit fields they consider unimportant or too difficult to collect. Often this ends up producing huge amounts of missing data, especially when data are being entered from the field by several people, and if huge quantities of data are missing the data are useless. If you want to use the data, ensure that a complete record must be entered. If you don't use the data, don't collect them. If you don't collect unnecessary data you'll probably make fewer errors in entering the necessary data.

And don't enter summaries of data. For example, enter people's exact ages, not an age range. If you use pre-defined age ranges you may end up with all the ages clumped in one or two categories, which severely limits the analysis you can do. If you enter the exact age, you can define age ranges whose categories have roughly equal numbers of people in them, which makes it easier to find differences between the categories (click here, here, and here for more about these issues.

Similarly, instead of a test score or a rating, enter the individual test and rating items. First of all, that makes it easier to clean the data - to find erroneously recorded items or scores. More importantly, it gives you the ability to assess the adequacy of the test or rating (a PDF you can download from my website describes some things you can do with ratings; click here for the PDF).

And enter the data in a format appropriate for the type of analysis you want to do. Most statistical packages for example, want records entered as rows.

Data will only talk to you if you care for them. Be nice to your data. Only collect the ones you need, and treat the ones you need with respect.

No comments:

Post a Comment