Actual Analysis blog: 2012

Thursday, October 4, 2012

More ways to sabotage selection

Yesterday we saw how weighting the different measures you combine to rate applicants for jobs or promotions or school placements or grants can end up undermining your ratings. The measures to which you assign the highest weight end up having almost all the influence on selection, while the other measures end up with none.

There are times, though, when people don't intend to weight their measures but end up weighting them inadvertently anyway. For example, if you measure one characteristic on a scale of 10 and another on a scale of 5, the measure with a maximum score of 10 will end up having more influence (barring extraordinary and very rare circumstances).

That problem's easy to deal with: just make sure that all your measures have scales with the same maximum score. The second is a little more difficult. It is that differences in variability can accidentally weight the measures.

Some of your measures will almost always vary over a wider range than others. The statistic most widely used to assess variability is the standard deviation. The bigger the standard deviation, the more variable the scores. An example will demonstrate the problem differences in variability create.

Let's suppose that a professor gives two tests in a course, each of which is to count for 50% of the final mark. The first test has a mean of 65 and a standard deviation of 8, while the second has a mean of 65 and a standard deviation of 16. The problem with these statistics is that two students can do equally well but end up with different final marks. We'll look at two students' possible results.

The first student finishes one standard deviation above the mean on the first test and right at the mean on the second. That is, her marks were 73 and 65, and her final mark is half of 73 + 65, or 69. The second student finishes at the mean on the first test and one standard deviation above the mean on the second. That is, her marks are 65 and 81, and her final mark is (65 + 81)/2, or 73. So, even though each student finished at the mean on one test and one standard deviation above the mean on the other, one ended up with a higher mark than the other.

To eliminate this bias you can calculate standard scores. You simply subtract the mean from each applicant's score and divide by the standard deviation. That gives you a standard score with a mean of zero; applicants with scores above the mean will have positive standard scores and applicants with scores below the mean will have negative ones. If that sounds complicated, it's not. Spreadsheets will do it for you; in Excel you use the AVERAGE function to get the mean and the STDDEV function to get the standard deviation (there is a STANDARDIZE function, but since it requires you to enter the mean and standard deviation it it's no faster than writing a formula yourself)).

Even if that still seems like a lot of work to you, the choice is clear: either you do the work or you sabotage your ratings. If you sabotage your ratings you sabotage your selection, and if you sabotage your selection you sabotage your organization (and maybe others, if you're doing something like selecting outside applicants for grants).

For more information about standardization click here for the first of a series of brief articles. Alternatively, the next time you're compiling ratings you can involve staff with statistical training or a consultant.

Wednesday, October 3, 2012

The hidden danger in selection procedures

When you’re selecting people for jobs, students for university, projects to fund, or making any one of the many significant choices we often find ourselves faced with, you’re often advised to decide what characteristics you want the successful candidate to have, rate the characteristics numerically, weight them according to the importance you think each should have, then add up the weighted ratings.

For example, if you’re rating three characteristics, and you think one is twice as important as each of the other two, you would take 50% of the rating of the most important characteristic and 25% of the ratings of each of the other two, then add them together.

The problem with that procedure, though, is that the in the final analysis the weight of the most important characteristic will be far higher than you had intended. We can see why this happens by looking at the logic of ratings.

Let’s say you’re selecting students for a program. Your rating scale, then, is intended as a measure of ability to succeed in studying the domain the program covers. You are assessing five characteristics, and assigning weights of 50%, 30% 10%, 5%, and 5%.

If the several measures of ability to succeed are all measuring the same concept, then they will be highly correlated – people who score high on each measure will also score high on the others. When this is true there is no reason to weight the measures – that is, if they are measures of the same thing there is no justification for making one more important than the others. The statistics of test design provides clear criteria for determining if all of a group of measures are measuring the same thing.

If the measures are not correlated, then they are measuring different aspects of ability to succeed. If they are combined without weighting they will tend to cancel each other out – a high score on one measure will be cancelled out by a low score on another uncorrelated measure – and scores will tend to accumulate in the middle of the score range.

If weights are assigned to the measures to reflect priority, the applicants who score high on the one or two measures with highest priority will tend to have ratings in the high range. The rest of the scores will continue to cancel each other out and the rest of the candidates will accumulate in the middle range.

Accumulation of scores in the middle range creates a problem for selection, because the cut-off point usually is found in the middle range, and choices must be made between applicants whose scores are very similar. For example, if one student received a mark of 55 on a test of mathematics, and another student a 57, you would not conclude that the second student was a better mathematician than the first. The difference is probably due to random variables, perhaps something as simple as the first student having a headache.

This also means that the characteristics with lower priority will usually end up having no influence on selection at all because ratings of these characteristics will cancel each other out. If you are rating uncorrelated characteristics and want each to have a specific weight in selection you will need to use a procedure that insures they will have this weight. A simple procedure in our example would be to draw 50% of the selected applicants from those with high scores on the most important characteristic, 30% from those with high scores on the second most important one, and so on. Alternatively, the selection can be made in stages to ensure that each characteristic is evaluated according to its priority rank and separately from uncorrelated characteristics.

Of course, sometimes some characteristics will be correlated and some not. The correlated characteristics can then be combined into a single score that will be more accurate than the single characteristics by themselves. The other lesson to be drawn from this is that someone familiar with test design should review selection procedures to ensure that they have the intended results. Ignoring the relationships between the characteristics you are assessing means that you will be defeating your own purposes – the ones implied by the weight you assigned to each characteristic.

Tomorrow we'll look at some insidious forms of weighting that can sabotage selection even when you don't deliberately weight scores.

Monday, March 5, 2012

What's missing from the Fraser Institute school ranking report

The Fraser Institute ranking of Ontario elementary schools was released on Sunday, and as usual it was covered extensively by the press. Unfortunately, the press did not, as far as I could see, ask some serious questions that need to be asked.

I am not going to fault the Fraser Institute for not including all relevant technical information in the report; it is, after all, intended as a popular guide for parents. However, I could not find on the Institute’s website any link to a technical manual that would provide important information missing from their report.

Perhaps the most serious omission is any mention of test characteristics. The overall score calculated for each school is based on the annual assessment conducted by the Ontario Educational Quality and Accountability Office (EQAO). But are the tests used for these assessments valid measures of scholastic competence? Standard measures of reliability and validity are not reported (nor could I find them on the EQAO website, or in the technical manuals EQAO provides for the tests).

Of course, even a measure that is unreliable in assessing an individual student can be made reliable by aggregating the scores of a whole school. However, an invalid measure cannot be made valid by aggregation, and if a test is not a valid measure of scholastic competence its reliability does not matter. If someone gets your email address wrong, their messages are not going to get to you regardless of how many times they send them to exactly the same wrong address.

Another issue is that Much of the report deals with improvements in schools' scores but that little information is provided about the trend analysis on which reports of improvement were based. In particular, we need to know what statistical technique was used and an explanation of the high significance criterion (p < .10).

Other issues could be raised, but, even if I had included all of them, none of this post could be taken as necessarily implying that the Fraser Institute did not do an adequate job. I've asked more serious questions about studies I've reviewed and received reassuring answers. However, without the additional information described here, we cannot conclude that the ranks assigned by the Institute serve as a guide to school performance.

What's missing from the Fraser Institute school ranking report © 2012, John FitzGerald
More articles at the main site

Tuesday, February 7, 2012

One reason selection tests may not work

Let's suppose you wanted to find out how well students' marks on graduation from high school predicted their marks in the first year of university. You select a sample of students and correlate their high school marks with their university marks. You will often fail to find a statistically significant correlation.

This result is counterintuitive, but the reason for it is simple. Only the best students get into university, and even if they do as well in university as they did in high school their marks will fall in a very restricted range. That is, there is simply less difference in ability between the students than there would be if the full range of ability had been sampled, so it is difficult to observe a correlation between their scores.

The distribution of marks will also probably be skewed (in the statistical sense - the mean will be much different from the median), which also militates against finding a correlation.

Problems like these are why I distrust the idea that people can conduct data mining even if they have no training in inferential statistics.

One reason selection tests may not work © 2001, John FitzGerald
More articles at the main site

Tuesday, January 31, 2012

The raw database

When you construct an analytical database, you're better off to analyze your data after they're in the database rather than before. That is, the data in your database are most useful if they are what are known as raw data – that is, individual scores rather than summary data like statistics (such as percentages) or ranges (such as age ranges).

For example, let's consider a database which consists of the names of twenty cities and their unemployment rates (which are, of course, percentages). If you want to work out the unemployment rate for all the cities or for a subset of them, you can't, because you don't know how many people are in the labour force in each city. If, however, the database consists of the names of the cities, the number of people in the labour force in each city, and the number of unemployed in each city, you can easily work out those figures as well as any you could have worked out with the other database.

That example is a simple one for illustrative purposes, but problems like the one in the example are not rare. Databases constructed with range data rather than raw data are also common. Often, for example, people's ages are entered according to an arbitrary range into which they fall – a 28-year-old might be entered as a 25-to-34-year-old, for example. You can discover useful relationships with data like that, but you can also miss relationships that you would find if you entered the actual ages. If you entered the actual ages you would still be able to investigate your age categories, as well as alternatives to them which might be more useful.

A database of raw data is a much more powerful analytical tool than one of summary data. In compiling a database of summary data you are essentially drawing conclusions about the nature of the data before they have even been entered. Keeping your options open is much the better strategy.

The Raw Database © 2000, John FitzGerald
Originally published at ActualAnalysis.com

Tuesday, January 17, 2012

How the British Geological Survey overcame bad data management - after 165 years

I have gone on at length in this blog and on its parent website about how data aren't informative until they're organized in some useful way. The first step in organizing them is making them accessible - putting them in a database, giving them a unique identifier, and organizing them so that you can use them in the way you intended.

In 1846, John Hooker, a botanist, collected 314 slides of botanical samples for the British Geological Survey. Then he had to rush off on a trip to the Himalayas and didn't get around to entering the samples in the specimen register. In April, 2011, Howard Falcon-Lang, a paleontologist, was poking around in a cabinet in a dark corner of the BGS and found the drawers of Hooker`s slides. He pulled one out, shone his flashlight on it, and read the label "C. Darwin, Esq." (Click here for a news report)

It turned out that Hooker's slides were from Darwin's expedition on the Beagle, and that Dr. Falcon-Lang was apparently the first person in 165 years to recognize what they were. Dr. Falcon-Lang expects that examination of the the samples will contribute to contemporary science. Imagine what contemporary science would be like, though, if these examples had been examined in the 1840s and 50s.

The data most of us use aren't likely to be as significant as Darwin's, but if we can't use them they're as useless as Darwin's were for 165 years. A big problem I have run into with databases is that some data just don't get entered. People entering data omit fields they consider unimportant or too difficult to collect. Often this ends up producing huge amounts of missing data, especially when data are being entered from the field by several people, and if huge quantities of data are missing the data are useless. If you want to use the data, ensure that a complete record must be entered. If you don't use the data, don't collect them. If you don't collect unnecessary data you'll probably make fewer errors in entering the necessary data.

And don't enter summaries of data. For example, enter people's exact ages, not an age range. If you use pre-defined age ranges you may end up with all the ages clumped in one or two categories, which severely limits the analysis you can do. If you enter the exact age, you can define age ranges whose categories have roughly equal numbers of people in them, which makes it easier to find differences between the categories (click here, here, and here for more about these issues.

Similarly, instead of a test score or a rating, enter the individual test and rating items. First of all, that makes it easier to clean the data - to find erroneously recorded items or scores. More importantly, it gives you the ability to assess the adequacy of the test or rating (a PDF you can download from my website describes some things you can do with ratings; click here for the PDF).

And enter the data in a format appropriate for the type of analysis you want to do. Most statistical packages for example, want records entered as rows.

Data will only talk to you if you care for them. Be nice to your data. Only collect the ones you need, and treat the ones you need with respect.

Monday, January 9, 2012

Effect and cause as a clue to the meaning of science

The December 16, 2011, issue of WIRED has a piece by Jonah Lehrer called "Trials and Errors: Why Science is Failing Us" (click here to read it). Mr. Lehrer's argument seems to be that some phenomena are too complex for scientific method to be able to discover what causes them. In his conclusion he writes:

And yet, we must never forget that our causal beliefs are defined by their limitations. For too long, we’ve pretended that the old problem of causality can be cured by our shiny new knowledge. If only we devote more resources to research or dissect the system at a more fundamental level or search for ever more subtle correlations, we can discover how it all works. But a cause is not a fact, and it never will be; the things we can see will always be bracketed by what we cannot. And this is why, even when we know everything about everything, we’ll still be telling stories about why it happened. It’s mystery all the way down.

The comments following the piece do a good job of of pointing out the flaws in the reasoning by which Mr. Lehrer reaches this conclusion. However, one issue is omitted. That issue is that science is not about causes.

Science is about effects. At its simplest, an effect is a non-random relationship between two variables. Scientific experimentation investigates effects by varying one of the variables (the indendent variable) and seeing what happens to the other variable (the dependent variable). The goal is to explain the effect - that is, become more effective in predicting the dependent variable. This model can be expanded to handle large numbers of variables. For example, one of the things I do in evaluating satisfaction with a program is to investigate simultaneously the relative importance of several variables in accounting for satisfaction. What you typically find when you do this correctly is that only a few of the variables have any relationship to satisfaction. What you often find, too, is that the variables that account for their satisfaction are different from the reasons particpants report when asked why they like the program.

The methods I use are correlational, so they cannot attribute causation. What they tell you is that as one thing varies, so does another. Furthermore, the analyses of satisfaction I do are non-experimental, so I can't even be sure that the estimates of the correlations are all that exact. What I can do, though, is make a recommendation that changes be made to see if dealing with the the variables identified by the data analysis will improve satisfaction.

The same considerations apply to a lot of health research, and that consideration alone goes a long way to accounting for the examples Mr. Lehrer adduces. What health researchers do is develop their own recommendations for further research that will test whether their conclusions are correct. In fact, the supposed failure Mr. Lehrer describes is in fact a demonstration of the success of science - a hypothesis was developed from prior research to test whether a drug was effective, and the test failed to find evidence that it was effective. That failure by itself is informative - it tells us not to prescribe the drug.

One of the commenters at the link above (urgelt) goes into the issue of the adequacy of research in more detail. My post of January 5 (click here) provides another example of this type of difficulty. What is clear is that error is inherent in the process of scientific experimentation, and that the foundation of scientific method includes a recognition that error is inherent. Reports of statistical analysis of research results typically include many estimates of the error involved in the relationships estimated by the statistical techniques.

As for Mr. Lehrer's remarks about the mythical nature of causes, scientific method has long allowed explanatory variables that have no real existence (intelligence, for example, cannot be directly measured but only inferred from behaviour). Variables like this are called explanatory fictions. The reason they are allowed is that the point of science is to explain an effect, not to find out what its actual cause is. If a fictional variable can explain the effect where something tangible and real can't, so much the better. Furthermore, even a small improvement in accuracy of prediction will often produce large benefits. Obviously, something which improves accuracy only a small amount is unlikely to be a cause in any meaningful sense, but it can still play an important role in practice.

Complex systems often frustrate scientific research simply because there are so many potential effects to examine, not because scientists are naive about the nature of causes, which anyway they aren't looking for. Mr. Lehrer freely acknowledges that science has been spectacularly successful with some complex systems (the health of large populations, for example), so concluding that failures to be successful with others mean that science has failed to solve the problem of causation is not only questionable and hasty but irrelevant as well.

I am confident that the scientific research of 100 years from now will be superior to today's research. I am also confident that the reason for its superiority will not be that it has solved the problem of causation.

Website
Twitter

Research, cause, and effect © 2012, John FitzGerald

Why information overload is a myth

Everybody’s heard of information overload – a Google search I just did for information overload (in quotation marks) produced over 4 million results. In fact, though, it is data we are overloaded with, not information.

Information consists only of data that reduce uncertainty. A weather forecast is only informative if it predicts the weather accurately. If it doesn't predict the weather accurately, we could end up leaving our umbrellas at home on rainy days. Similarly, if we base corporate decisions on data that don’t predict the results we want to achieve, we could end up being embarrassed and out of pocket.

As the Schumpeter blog in the Economist said on December 31: “As communication grows ever easier, the important thing is detecting whispers of useful information in a howling hurricane of noise.” It’s that overload of noise we must fear.

How do you reduce an overload of noise?

By not collecting data that are irrelevant to the decisions you make.
By not collecting data that are nearly identical to informative data you already collect.
By not collecting more data than you need.
By not combining pieces of information in ways in ways which produce an uninformative total score (by weighting them, for example).

But how do you avoid doing these things? Chiefly by analysing your data with sound statistical methods. For example, you can estimate the relevance of data to a decision with methods like the correlation coefficient. You can use principal components analysis to find variables that are telling you the same story. You can use sampling theory to decide how much data you need to collect. You can use psychometric analysis to combine pieces of information into a single score effectively. The battle against uninformative data has not been won, but you can win that part of it that takes place in your office.

Website
Twitter

Thursday, January 5, 2012

Cognitive decline research: Questions the CBC didn't ask

Today's CBC news report (click here) of a study of cognitive decline is pretty standard science reporting. I'm sure that other news sources provided much the same story. Anyway, it confines itself to reporting the results the researchers reported, results which were fairly stated.

However, there is other information the CBC might have provided, but didn't. First, it doesn't provide a link to the study (or hadn't when I posted a comment asking for one). I, for one, was interested in learning what "a 3.6% decline in mental reasoning" was. Does a decline of that size have a serious effect on people's functioning?

So I looked for the link and found it (here). It's an open access article that can be downloaded free in a PDF. The article doesn't provide a quick answer to the question of how serious the declines observed are, but the authors do suggest that further attention might be paid to people whose declines are greater than the mean in the study. That suggests to me the mean declines are not that serious, although I readily admit I may be reading something into the authors' suggestion that isn't there.

What I also found, though, is that the researchers did not control for health. Since older people tend to be less healthy, were these cognitive declines due to changes in brain function or to the fatigue resulting from poor health? Information about medical risk factors was collected, but the article does not report that it was incorporated in the statistical analyses.

None of this is intended to question the adequacy of the research. Being able to carry on a rigorous study of over 10,000 people for 24 years is proof enough of the researchers' competence. What this is intended to question is the value of a news report that simply reports results without examining them. I'm sure that if the researchers had been asked about the relationship of the health information they collected to cognitive decline they could have explained it fully. I'm sure if they'd been asking questions like that the journalists would have enjoyed their jobs more, too.