Chances Are Read online

Page 18


  In his new kingdom of small, randomized plots, Fisher saw the opportunity to conduct several experiments at the same time: to investigate variation from different directions. If you study superphosphate as against no treatment, you have only one comparison; but if you study superphosphate, urea, superphosphate plus urea, and no treatment, you have two ways of looking at each treatment: compared with no treatment and compared with a combination, from which you can subtract the effect of the other treatment. This “analysis of variance” was one of Fisher’s great gifts to science: he provided the mathematics to design experiments that could answer several inquiries simultaneously. Similar techniques allowed the experimenter to isolate and adjust for the unavoidable natural variations in subjects, like age, sex, or weight in a clinical trial.

  Small blocks are small samples—and Fisher could not have laid the foundations for the modern scientific method had it not been for the prior work of someone professionally tied to small samples: W. J. Gossett, who wrote under the characteristically modest pseudonym “Student.” Gossett worked for the Guinness brewery, an enterprise critically dependent on knowing the qualities of barley from each field it had contracted for. Guinness’ agent might wander through, pulling an ear here and there, but there had to be some way of knowing how this small sample translated into the quality of the whole crop.

  The mighty Karl Pearson had never bothered with small samples: he had his factory, cranking out thousands of observations and bending the great curves to fit them. He saw no real distinction between sample and population. Student, though, worked alone—and if he tested the whole of every field, all the pubs in Poolbeg Street would run dry. So he developed “Student’s t-test,” describing how far a sample could be expected to differ in mean value and spread from the whole population—using only the number of observations in the sample. This tool was all Fisher needed. Starting with Student’s t-test, adding randomization of the initial setup, and subjecting his results to analysis of variance produced what science had long been waiting for: a method for understanding the conjoined effects of multiple causes, gauging not just if something produced an effect, but how large that effect was. All modern applied sciences, from physics to psychology, use terms like “populations” and “variance” because they learned their statistics from Fisher, a geneticist.

  Fisher was a tough man, and he presumed toughness in the researcher. His method requires that we start with the null hypothesis, that the observed difference is due only to chance: in the policeman’s habitual phrase, “there’s nothing to see here.” We then choose a measure, a statistic, and determine how it would be distributed if the null hypothesis were true. We define what might be an interesting value for this statistic (what we can call an “effect”) and we determine the probability of seeing this value, or one more extreme, if the null hypothesis were true. This probability, called a p-number, is the measure of statistical significance. So if, say, Doc Waughoo’s Seminole Fever Juice reduces patients’ fever by five degrees when variation in temperature without treatment is three, the probability that this effect has appeared by chance would be below a small p-value, suggesting there’s more at work here than just alcohol and red food coloring.

  For Fisher, reaching the measure of significance was the end of the line of inference. The number told you “either the null hypothesis is false and there is a real cause to this effect—or a measurably unusual coincidence has occurred.” That’s all: either this man is dead or my watch has stopped.

  Away from the pure atmosphere of Rothamsted, however, Fisher’s methods revealed the problems of scaling up from barley to people. From the beginning, the two great issues were sampling and ethics—two problems that became one in the Lanarkshire milk experiment of 1930. Lanark, you will remember, had been the county with the puniest chest measurements, and things were not much better now. A few well-nourished farm children shared their schoolrooms with the underfed sons and daughters of coal miners and unemployed wool spinners.

  Twenty thousand children took part in the experiment: 5,000 were to receive three-quarters of a pint of raw milk every day; 5,000 an equivalent amount of pasteurized milk; 10,000 no milk at all. The choice of treatment was random—but teachers could make substitutions if it looked as though too many well- or ill-nourished children were included in any one group. Evidently, the teachers did exactly what you or I would do, since the final results showed that the “no milk” children averaged three months superior in weight and four months in height to the “milk” children. So either milk actually stunts your growth or a measurable coincidence has occurred, or . . .

  A more successful test was the use of streptomycin against tuberculosis, conducted by Austin Bradford Hill in 1948. This was a completely randomized trial: patients of each sex were allocated “bed rest plus streptomycin” or “bed rest alone” on the basis only of chance. Neither the patients, their doctors, nor the coordinator of the experiment knew who had been put in which group. The results were impressive: only 4 of the 55 treated patients died, compared with 14 of the 52 untreated. This vindicated not just streptomycin but the method of trial.

  Why did Hill stick to the rules when the Lanarkshire teachers bent them? Some say it was because there were very limited supplies of streptomycin—by no means enough to give all patients, so why not turn necessity into scientific opportunity? Others might feel that the experience of war had made death in the name of a greater good more bearable as an idea. One could also say that the new drugs offered a challenge that medicine had to accept. A jug of milk may be a help to the wretched, but the prospect of knocking out the great infections, those dreadful harvesters with so many lives already in their sacks—well, that made the few more who died as controls unwitting heroes of our time.

  Hill is famous for his later demonstration of the relation between smoking and lung cancer. Fisher never accepted these results—and not simply because he enjoyed his pipe. He didn’t believe that the correlation shown proved causation, and he didn’t like the use of available rather than random samples. Fisher always demanded rigor—but science wanted to use his techniques, not be bound by his strictures.

  Nowadays, the randomized double-blind clinical trial is to medical experiment what the Boy Scout oath is to life: something we should all at least try to live up to. It is the basis of all publication in good medical journals; it is the prerequisite for all submissions to governmental drug approval agencies. But there have always been problems adapting Fisher’s guidelines to fit the requirements of a global industry.

  The simplest question is still the most important: has the trial shown an effect or not? The classical expression of statistical significance is Fisher’s p-number, which, as you’ve seen, describes a slightly counterintuitive idea: assuming that chance alone was responsible for the results, what is the probability of a correlation at least as strong as the one you saw? As a working scientist with many experiments to do, Fisher simply chose an arbitrary value for the p-number to represent the boundary between insignificant and significant: .05. That is, if simple chance would produce these results only 5 percent of the time, you can move on, confident that either there really was something in the treatment you’d tested—or that a coincidence had occurred that would not normally occur more than once in twenty trials. Choosing a level of 5 percent also made it slightly easier to use the table that came with Student’s t-test; in the days before computers, anything that saved time with pencil and slide rule was a boon.

  So there is our standard; when a researcher describes a result as “statistically significant,” this is what is meant, and nothing more. If we all had the rigorous self-restraint of Fisher, we could probably get along reasonably well with this, taking it to signify no more than it does.

  Unfortunately, there are problems with p-numbers. The most important is that we almost cannot help but misinterpret them. We are bound to be less interested in whether a result was produced by chance than in whether it was produced by the treatment we are testin
g: it is a natural and common error, therefore, to transfer over the degree of significance, turning a 5 percent probability of getting these results assuming randomness into a 5 percent probability of randomness assuming we got these results (and therefore a 95 percent probability of genuine causation). The two do not line up: the probability that I will carry an umbrella, assuming it is raining, is not the same as the probability that it is raining, assuming that I’m carrying an umbrella. And yet even professionals can be heard saying that results show, “with 95 percent probability,” the validity of a cause. It’s a fault as natural and pervasive as misusing “hopefully”—but far more serious in its effects.

  Another problem is that, in a world where more than a hundred thousand clinical trials are going on at any moment, this casually accepted 5 percent chance of coincidence begins to take on real importance. Would you accept a 5 percent probability of a crash when you boarded a plane? Certainly not, but researchers are accepting a similar probability of oblivion for their experiments. Here’s an example: in a group of 33 clinical trials on death from stroke, with a total of 1,066 patients, the treatment being tested reduced mortality on average from 17.3 percent in the control group to 12 percent in the treated group—a reduction of more than 25 percent. Are you impressed? Do you want to know what this treatment is?

  It’s rolling a die. In a study reported in the British Medical Journal, members of a statistics course were asked to roll dice representing individual patients in trials; different groups rolled various numbers of times, representing trials of various sizes. The rules were simple: rolling a six meant the patient died. Overall, mortality averaged out at the figure you would expect: 1/6, or 16.7 percent. But two trials out of 44 (1/22—again, the figure you’d expect for a p-value of 5 percent) showed statistically significant results for the “treatment.” Many of the smaller trials veered far enough from the expected probabilities to produce a significant result when taken together. A follow-up simulation using the real mortality figures from the control group of patients in a study of colorectal cancer (that is, patients who received no treatment), showed the same effect: out of 100 artificially generated randomized trials, four showed statistically significant positive results. One even produced a 40 percent decrease in mortality with a p-value of .003. You can imagine how a study like that would be reported in the news: “Med Stats Prove Cancer Miracle.”

  Chance, therefore, plays its habitual part even in the most rigorous of trials—and while Fisher would be willing to risk a small chance of false significance, you wouldn’t want that to be the basis of your own cancer treatment. If you want to squeeze out chance, you need repetition and large samples.

  Large samples turn out to be even more important than repetition. Meta-analysis—drawing together many disparate experiments and merging their results—has become an increasingly used tool in medical statistics, but its reliability depends on the validity of the original data. The hope is that if you combine enough studies, a few sets of flawed data will be diluted by the others, but “enough” is both undefined and crucial. One meta-analysis in 1991 came up with a very strong positive result for injecting magnesium in cases of suspected heart attack: a 55 percent reduction in the chance of death with a p-value less than .001. The analysis was based on seven separate trials with a total of 1,301 patients. Then came ISIS-4, an enormous trial with 58,050 patients; it found virtually no difference in mortality between the 29,001 who were injected with magnesium and the 29,039 who were not. The results differed so dramatically because, in studies with 100 patients or fewer, one death more in the control group could artificially boost the apparent effectiveness of the treatment—while no deaths in the control group would make it almost impossible to compare effectiveness. Only a large sample allows chance to play its full part.

  There is a further problem with statistical significance if what you are investigating is intrinsically rare. Since significance is a matter of proportion, a high percentage of a tiny population can seem just as significant as a high percentage of a large one. Tuberculosis was a single disease, spread over a global population. Smoking was at one time a nearly universal habit. But now researchers are going after cagier game: disorders that show themselves only rarely and are sometimes difficult to distinguish accurately from their background.

  In Dicing with Death, Stephen Senn tells the story of the combined measles, mumps, and rubella vaccine—which points out how superficially similar situations can require widely separate styles of inference, leading to very different conclusions. That rubella during pregnancy caused birth defects was discovered when an Australian eye specialist overheard a conversation in his waiting room between two mothers whose babies had developed cataracts. A follow-up statistical study revealed a strong correlation between rubella (on its own, a rather minor disease) and a range of birth defects—so a program of immunization seemed a good idea. Similar calculations of risk and benefit suggested that giving early and complete immunity to mumps and measles would have important public health benefits. In the mid-1990s, the UK established a program to inoculate children with a two-stage combined measles, mumps, rubella (MMR) vaccine.

  In 1998, The Lancet published an article by Dr. Andrew Wakefield and others, describing the cases of twelve young children who had a combination of gastrointestinal problems and the sort of developmental difficulties associated with autism. In eight of the cases, the parents said they had noticed the appearance of these problems about the time of the child’s MMR vaccination. Dr. Wakefield and his colleagues wondered whether, in the face of this large correlation, there might be a causal link between the two. The reaction to the paper was immediate, widespread, and extreme: the authors’ hypothetical question was taken by the media as a definitive statement; public confidence in the vaccine dropped precipitately; and measles cases began to rise again.

  On the face of it, there seem to be parallels between the autism study and the original rubella discovery: parents went to a specialist because their children had a rare illness; they found a past experience in common; the natural inference was a possible causal relationship between the two.

  But rubella during pregnancy is relatively rare and the incidence of cataracts associated with it was well over the expected annual rate. The MMR vaccination, on the other hand, is very common—it would be unusual to find a child who had not received it during the years covered by the Lancet study. Moreover, autism is not only rare, but difficult to define as one distinct disorder. To see whether the introduction of MMR had produced a corresponding spike in cases of autism would require a stable baseline of autism cases—something not easy to establish.

  Later studies looking for a causal link between MMR and autism could find no temporal basis for causation, either in the year that the vaccine was introduced or in the child’s age at the onset of developmental problems. In 2004, ten of the authors of the original Lancet paper formally disassociated themselves from the inferences that had been drawn from it—possibly the first time in history that people have retracted not their own statement, but what others had made of it.

  High correlation is not enough for inference: when an effect is naturally rare and the putative cause is very common, the chance of coincidence becomes significant. If you asked people with broken legs whether they had eaten breakfast that morning, you would see a very high correlation. The problem of rarity remains and will become more troubling, the more subtle the illnesses we investigate. Fisher could simply plant another block of wheat; doctors cannot simply conjure up enough patients with a rare condition to ensure a reliable sample.

  The word “control” has misleading connotations for medical testing: when we hear of a “controlled experiment,” it’s natural to assume that, somehow, all the wild variables have been brought to heel. Of course, all that’s really meant is that the experiment includes a control group who receive a placebo as well as a treatment group who get the real thing. The control group is the fallow ground, or the stand of wheat that has to make
its way unaided by guano. Control is the foundation stone of meaning in experiment; without it, we build conclusions in the air. But proper control is not an easy matter: in determining significance, a false result in the control can have the same effect as a true one in the treatment group.

  Let’s consider an obvious problem first: a control group should be essentially similar to the treatment group. It’s no good comparing throat cancer with cancer of the tongue, or the depressed with the schizophrenic. If conditions are rare or difficult to define, the control group will pose the same difficulties of sample size, chance effects, and confounded variables as the treatment group. Controls also complicate the comparison of trials from different places: is the reason your throat-cancer controls show a higher baseline mortality in China than in India simply a matter of chance, to be adjusted out of your data—or is there hidden causation: drinking boiling hot tea or using bamboo food scrapers?

  Moreover, how do you ensure that the control group really believes that it might have received the treatment? In the 1960s, controls in some surgery trials did indeed have their chests opened and immediately sewn up again—a procedure unlikely to pass the ethics committee now. How would you simulate chemotherapy or radiation? Surreptitiously paint the patient’s scalp with depilatory cream? Patients are no fools, especially now in the age of Internet medicine: they know a lot about their conditions and are naturally anxious to find out whether they are getting real treatment. If the control is the foundation, there are some soils on which it is hard to build securely.

  The word “placebo” is a promise: “I will please.” This promise is not a light matter. For a double-blind experiment to succeed, both the control group and the doctors who administer the placebo have to be pleased by it—they have to believe that it is indistinguishable from the active treatment. Considerable work has gone into developing placebos that, while inactive for the condition being tested, provide the side effects associated with the treatment.