Wednesday, February 18, 2015

Plethoratology

Computers have got to the point where they can be used to analyze enormous files of data, many people do analyze enormous files. We do tend to believe that both bigger and more are better. In fact, bigger files are often very desirable. Nevertheless, big files often create big problems, and in this article we will look at some ways to overcome those problems and get the most out of big files.

Files can be big both vertically and horizontally. A file is big vertically if it has a large number of cases (or records, in database terminology). A file is big horizontally when it has a large number of variables (fields).

Problems of vertical size. One of the most important problems with files which are big vertically is the non-sampling error (also known as a mistake). The more cases or records there are in a file, the more likely non-sampling errors become, especially if the increase in the size of the file reduces the time available for the collection of each individual case. For example, if people are under pressure to provide a long list of information, they may record inaccurate estimates or even fabricate information. Big files need to be audited for errors of this type.

Another problem with files with enormous numbers of cases is that statistical tests become so powerful that their results are meaningless. Almost any difference becomes significant at astoundingly low levels. For example, let's suppose you have a sample of 100 people, and you want to know if women are over or under-represented in it. To find out, you are going to perform a chi-square test with a significance criterion of .05. If we assume that women make up 51% of the general population, the percentage of women in your sample would, to satisfy the chi-square test, have to be 10 percentage points higher or lower than 51% for you to conclude that they were over or under-represented. That is, if 61 of your sample were women, you could conclude that women were over-represented, and if only 41 were women, you could conclude that women were under-represented. Those seem like reasonable standards, but if you use large samples, the standards become far less demanding.

For example, if your sample had 10,000 members, the chi-square test would tell you that women were over-represented if they made up as little as 52% of the sample. If your sample had 100,000 members, women would be over-represented if they made up as little as 51.3% of the sample -- less than one-half of one per cent more than the figure for the population. That may be a real difference, and it’s statistically significant, but it may well not be practically significant. The solution to these problems is to make use of sampling theory. First, you can use sampling theory to determine the most statistically appropriate sample size. For example, if you're conducting a survey, and want a 95% confidence interval of ñ5%, you need a random sample of only 385 people. Of course, other considerations may make collecting more data advisable. If you're collecting huge amounts of data, though, you can still use sampling theory to select a subsample of data to analyze.

In psychometric research, for example, it is often necessary to administer huge numbers of tests. Scaling and reliability analysis, however, are often performed on smaller random samples drawn from the main one, so that statistical tests give more meaningful results. The smaller samples can also be analyzed much more quickly. If you want to check the validity of the results obtained with the small sample, you can draw a second small sample and do the same analyses. You'll still be finished in less time than it would have taken to analyze the entire sample.

If drawing a smaller subsample is not possible, you can adjust your significance criterion. To do that, you have to determine how big a difference or how strong a relationship you're looking for, and the power you want the statistical test to have to detect those differences or relationships.

Problems of horizontal size. Having large numbers of variables or fields is not a problem if you know that they all measure different things. Problems arise when different variables measure the same thing but the data analyst assumes they are independent.These problems are quite common nowadays because statistical packages have given everyone the ability to perform statistical analyses. With every good intention, people enter large numbers of variables into multiple linear regressions without inspecting either correlations or residuals. The problem with doing that is that it produces unstable solutions. If several variables are correlated with each other, and about equally correlated with the dependent variable, the order in which they are entered into the equation is determined by small and random differences in the size of correlations with the dependent variable. If you perform the analysis on a second set of data (which you often have to do if data are collected yearly, for example), the variables will often be entered in a different order. The solution I prefer to this problem is to scale the variables. Variables which measure the same thing can be aggregated to produce a single measure. There are other solutions as well.

Reducing the number of variables also helps deal with another problem, dealing with interaction effects, which are often ignored in multiple linear regression analysis. An interaction effect is one which cannot be predicted from the individual (or main) effects of two or more variables. For example, hair loss increases with age, and it is far more common among men – those are what we call main effects of single variables. However, the relationship between age and hair loss is much stronger among men. That is an interaction effect of two variables (you can have higher-order interactions as well). If you don't assess interaction effects you will usually miss important information about the topic you're investigating. To assess interaction effects, you examine residuals and introduce multiplicative terms into your regression equation. Big files can have big benefits. To obtain those benefits, though, you have to be circumspect.

Plethoratology © 1995,1999, 2003, 2006, 2015, John FitzGerald

Saturday, December 14, 2013

The office: then and now

The revolution in office life has been so rapid and so vast that we often forget what life was like in those pre-revolutionary workplaces. Here are some comparisons which help show how much changes in office technology have revolutionized office life:

THEN: Employees wasted time gossiping around the water cooler
NOW: Employees waste time gossiping around the Xerox machine

THEN: Copying business correspondence required messy and finicky carbon paper
NOW: Copying business correspondence requires messy and expensive toner cartridges

THEN: Employees wasted time gossiping around the water cooler
NOW: Employees waste time gossiping by e-mail

THEN: Little information available for reports
NOW: A report on the price of coffee in the staff room incorporates 37 pie charts and a review of the literature

THEN: Employees wasted time gossiping around the water cooler
NOW: Employees waste time experimenting with their screen savers

THEN: The office resounded to the clacking of typewriters
NOW: The office resounds to the oaths of employees who have just jammed the copier or deleted computer files by mistake

THEN: Employees wasted time gossiping around the water cooler
NOW: Employees waste time playing solitaire in Windows

THEN: Reports could only be produced in limited quantities
NOW: Everybody makes 100 copies of everything

THEN: Employees wasted time gossiping around the water cooler
NOW: Employees waste time tweeting what they had for lunch.

As you can see, office life has changed drastically over the last quarter of a century! Change is the one constant of modern society, and we can expect the office of 2043 to be as different from the office of 2013 as the office of 2013 is from the office of 1983!

Gosh, isn't modern society exciting!

The Office: Then and Now © John FitzGerald, 1997, 2000

Thursday, October 4, 2012

More ways to sabotage selection

Yesterday we saw how weighting the different measures you combine to rate applicants for jobs or promotions or school placements or grants can end up undermining your ratings. The measures to which you assign the highest weight end up having almost all the influence on selection, while the other measures end up with none.

There are times, though, when people don't intend to weight their measures but end up weighting them inadvertently anyway. For example, if you measure one characteristic on a scale of 10 and another on a scale of 5, the measure with a maximum score of 10 will end up having more influence (barring extraordinary and very rare circumstances).

That problem's easy to deal with: just make sure that all your measures have scales with the same maximum score. The second is a little more difficult. It is that differences in variability can accidentally weight the measures.

Some of your measures will almost always vary over a wider range than others. The statistic most widely used to assess variability is the standard deviation. The bigger the standard deviation, the more variable the scores. An example will demonstrate the problem differences in variability create.

Let's suppose that a professor gives two tests in a course, each of which is to count for 50% of the final mark. The first test has a mean of 65 and a standard deviation of 8, while the second has a mean of 65 and a standard deviation of 16. The problem with these statistics is that two students can do equally well but end up with different final marks. We'll look at two students' possible results.

The first student finishes one standard deviation above the mean on the first test and right at the mean on the second. That is, her marks were 73 and 65, and her final mark is half of 73 + 65, or 69. The second student finishes at the mean on the first test and one standard deviation above the mean on the second. That is, her marks are 65 and 81, and her final mark is (65 + 81)/2, or 73. So, even though each student finished at the mean on one test and one standard deviation above the mean on the other, one ended up with a higher mark than the other.

To eliminate this bias you can calculate standard scores. You simply subtract the mean from each applicant's score and divide by the standard deviation. That gives you a standard score with a mean of zero; applicants with scores above the mean will have positive standard scores and applicants with scores below the mean will have negative ones. If that sounds complicated, it's not. Spreadsheets will do it for you; in Excel you use the AVERAGE function to get the mean and the STDDEV function to get the standard deviation (there is a STANDARDIZE function, but since it requires you to enter the mean and standard deviation it it's no faster than writing a formula yourself)).

Even if that still seems like a lot of work to you, the choice is clear: either you do the work or you sabotage your ratings. If you sabotage your ratings you sabotage your selection, and if you sabotage your selection you sabotage your organization (and maybe others, if you're doing something like selecting outside applicants for grants).

For more information about standardization click here for the first of a series of brief articles. Alternatively, the next time you're compiling ratings you can involve staff with statistical training or a consultant.

More Ways to Sabotage Selection © 2012, John FitzGerald

Wednesday, October 3, 2012

The hidden danger in selection procedures

When you’re selecting people for jobs, students for university, projects to fund, or making any one of the many significant choices we often find ourselves faced with, you’re often advised to decide what characteristics you want the successful candidate to have, rate the characteristics numerically, weight them according to the importance you think each should have, then add up the weighted ratings.

For example, if you’re rating three characteristics, and you think one is twice as important as each of the other two, you would take 50% of the rating of the most important characteristic and 25% of the ratings of each of the other two, then add them together.

The problem with that procedure, though, is that the in the final analysis the weight of the most important characteristic will be far higher than you had intended. We can see why this happens by looking at the logic of ratings.

Let’s say you’re selecting students for a program. Your rating scale, then, is intended as a measure of ability to succeed in studying the domain the program covers. You are assessing five characteristics, and assigning weights of 50%, 30% 10%, 5%, and 5%.

If the several measures of ability to succeed are all measuring the same concept, then they will be highly correlated – people who score high on each measure will also score high on the others. When this is true there is no reason to weight the measures – that is, if they are measures of the same thing there is no justification for making one more important than the others. The statistics of test design provides clear criteria for determining if all of a group of measures are measuring the same thing.

If the measures are not correlated, then they are measuring different aspects of ability to succeed. If they are combined without weighting they will tend to cancel each other out – a high score on one measure will be cancelled out by a low score on another uncorrelated measure – and scores will tend to accumulate in the middle of the score range.

If weights are assigned to the measures to reflect priority, the applicants who score high on the one or two measures with highest priority will tend to have ratings in the high range. The rest of the scores will continue to cancel each other out and the rest of the candidates will accumulate in the middle range.

Accumulation of scores in the middle range creates a problem for selection, because the cut-off point usually is found in the middle range, and choices must be made between applicants whose scores are very similar. For example, if one student received a mark of 55 on a test of mathematics, and another student a 57, you would not conclude that the second student was a better mathematician than the first. The difference is probably due to random variables, perhaps something as simple as the first student having a headache.

This also means that the characteristics with lower priority will usually end up having no influence on selection at all because ratings of these characteristics will cancel each other out. If you are rating uncorrelated characteristics and want each to have a specific weight in selection you will need to use a procedure that insures they will have this weight. A simple procedure in our example would be to draw 50% of the selected applicants from those with high scores on the most important characteristic, 30% from those with high scores on the second most important one, and so on. Alternatively, the selection can be made in stages to ensure that each characteristic is evaluated according to its priority rank and separately from uncorrelated characteristics.

Of course, sometimes some characteristics will be correlated and some not. The correlated characteristics can then be combined into a single score that will be more accurate than the single characteristics by themselves. The other lesson to be drawn from this is that someone familiar with test design should review selection procedures to ensure that they have the intended results. Ignoring the relationships between the characteristics you are assessing means that you will be defeating your own purposes – the ones implied by the weight you assigned to each characteristic.

Tomorrow we'll look at some insidious forms of weighting that can sabotage selection even when you don't deliberately weight scores.

The Hidden Danger in Selection Procedures © 2012, John FitzGerald

Monday, March 5, 2012

What's missing from the Fraser Institute school ranking report

The Fraser Institute ranking of Ontario elementary schools was released on Sunday, and as usual it was covered extensively by the press. Unfortunately, the press did not, as far as I could see, ask some serious questions that need to be asked.

I am not going to fault the Fraser Institute for not including all relevant technical information in the report; it is, after all, intended as a popular guide for parents. However, I could not find on the Institute’s website any link to a technical manual that would provide important information missing from their report.

Perhaps the most serious omission is any mention of test characteristics. The overall score calculated for each school is based on the annual assessment conducted by the Ontario Educational Quality and Accountability Office (EQAO). But are the tests used for these assessments valid measures of scholastic competence? Standard measures of reliability and validity are not reported (nor could I find them on the EQAO website, or in the technical manuals EQAO provides for the tests).

Of course, even a measure that is unreliable in assessing an individual student can be made reliable by aggregating the scores of a whole school. However, an invalid measure cannot be made valid by aggregation, and if a test is not a valid measure of scholastic competence its reliability does not matter. If someone gets your email address wrong, their messages are not going to get to you regardless of how many times they send them to exactly the same wrong address.

Another issue is that Much of the report deals with improvements in schools' scores but that little information is provided about the trend analysis on which reports of improvement were based. In particular, we need to know what statistical technique was used and an explanation of the high significance criterion (p < .10).

Other issues could be raised, but, even if I had included all of them, none of this post could be taken as necessarily implying that the Fraser Institute did not do an adequate job. I've asked more serious questions about studies I've reviewed and received reassuring answers. However, without the additional information described here, we cannot conclude that the ranks assigned by the Institute serve as a guide to school performance.

What's missing from the Fraser Institute school ranking report © 2012, John FitzGerald

More articles at the main site

Tuesday, February 7, 2012

One reason selection tests may not work

Let's suppose you wanted to find out how well students' marks on graduation from high school predicted their marks in the first year of university. You select a sample of students and correlate their high school marks with their university marks. You will often fail to find a statistically significant correlation.

This result is counterintuitive, but the reason for it is simple. Only the best students get into university, and even if they do as well in university as they did in high school their marks will fall in a very restricted range. That is, there is simply less difference in ability between the students than there would be if the full range of ability had been sampled, so it is difficult to observe a correlation between their scores.

The distribution of marks will also probably be skewed (in the statistical sense - the mean will be much different from the median), which also militates against finding a correlation.

Problems like these are why I distrust the idea that people can conduct data mining even if they have no training in inferential statistics.

One reason selection tests may not work © 2001, John FitzGerald

More articles at the main site

Tuesday, January 31, 2012

The raw database

When you construct an analytical database, you're better off to analyze your data after they're in the database rather than before. That is, the data in your database are most useful if they are what are known as raw data – that is, individual scores rather than summary data like statistics (such as percentages) or ranges (such as age ranges).

For example, let's consider a database which consists of the names of twenty cities and their unemployment rates (which are, of course, percentages). If you want to work out the unemployment rate for all the cities or for a subset of them, you can't, because you don't know how many people are in the labour force in each city. If, however, the database consists of the names of the cities, the number of people in the labour force in each city, and the number of unemployed in each city, you can easily work out those figures as well as any you could have worked out with the other database.

That example is a simple one for illustrative purposes, but problems like the one in the example are not rare. Databases constructed with range data rather than raw data are also common. Often, for example, people's ages are entered according to an arbitrary range into which they fall – a 28-year-old might be entered as a 25-to-34-year-old, for example. You can discover useful relationships with data like that, but you can also miss relationships that you would find if you entered the actual ages. If you entered the actual ages you would still be able to investigate your age categories, as well as alternatives to them which might be more useful.

A database of raw data is a much more powerful analytical tool than one of summary data. In compiling a database of summary data you are essentially drawing conclusions about the nature of the data before they have even been entered. Keeping your options open is much the better strategy.

The Raw Database © 2000, John FitzGerald

Originally published at ActualAnalysis.com