Where the rubber meets the road

Jonas Ranstam PhD


Humans are generally unable to handle uncertainty rationally. Finding significance in random stimuli and interpreting random phenomena as mostly dangerous give a survival advantage that can explain some of the anxiety and superstition in today's society. In the history of science, uncertainty has not always been accepted. For example, questioning the church's scientific statements about the earth being the centre of the solar system rendered Galileo a death sentence in 1633. During the following age of enlightenment, scientific societies started to publish research findings in scientific journals but these were mainly what we today describe as expert opinions and subjective case reports without proper recognition of uncertainty.

The authority-based approach to science was gradually replaced by an evidence-based approach during the 20th century. Uncertain research findings, accompanied by objective and reproducible quantification of the uncertainty, were published. Statistics made its entry into medical science, with major effects on health care. By the end of the century it was no longer possible to get market approval for a drug by just presenting a letter from a medical professor having tested the drug on 3 patients and found it effective. Instead, randomized trials with pre-specified endpoints and statistically significant outcomes were required.

Without a genuine education in probability theory and inference theory, statistical significance is, however, a difficult concept, easy to misunderstand. The widespread use of statistical significance tests in the exponentially growing number of research reports has also lead to a caricature of hypothesis testing with presentation of results as either "significant" or "not significant", the former ("p<0.05") believed to indicate practical importance and the latter ("NS") to represent evidence of equivalence. As much as this may seem to be objective, practically useful, and generally accepted, the two interpretations are fundamentally flawed. Statistical significance says nothing about practical importance and statistical non-significance nothing about equivalence. Finding such evidence requires much more than simple statistical tests.

In spite of many available user-friendly statistical computer packages, the uncertainty of the findings presented in scientific publications today are grossly misleading. As a consequence, medical research suffers from a monumental reproducibility crisis. It is obvious that there is no way out of this mess other than to improve the evaluation and presentation of the findings' uncertainty. Statistical science will, as reflected by the growing number of statisticians reviewing grant applications and manuscripts, play an even more important role in future medical research.

However, all stakeholders will not benefit from methodological improvements, and the research environment rewards, at least in the short run, spectacular news better than efforts to find truth. Spin and exaggeration have become the standard. Statistical reviewers and editors have a hard job.

The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge

This quote by the American historian Daniel Boorstin is becoming increasingly relevant also in medical research. Nowadays, it seems to be much more common than one or two decades ago that authors confuse cohort studies with case-control studies, cumulative incidence with incidence density, hazard ratios with odds ratios, etc., etc. It is paradoxical that the increased accessibility of information with Internet (and the easiness with which the definition of methodological terms can be checked) comes, in Rumsfeldian terms, with a reduction of the known unknown and a corresponding growth of the unknown unknown.

The commonest mistake

While statistical significance often is mistaken as an indication of practical importance or scientific relevance, an even greater mistake is to believe that statistical non-significance indicates equivalence or "no difference". It doesn't. Statistical non-significance reflects uncertainty, which perhaps can be considered as an indication of a too small sample size.

Should all manuscripts be statistically reviewed?

All medical scientific publications do not present evidence based research. Many, if not all, hypothesis presentations, non-systematic reviews, and case reports are authority based rather than evidence based. Also such publications may have a role to play for the progress of science. It should, however, always be made clear to the reader whether the author's ambition has been to present a personal opinion or an objective and reproducible research finding. From an editorial point of view, it may be difficult to distinguish between these two types of manuscripts; the author's ambition is seldom declared and statistical inference is often misused. Personal opinions should, of course, not be statistically reviewed because statistical reviewing can make a good expert opinon appear to be bad and a poor one to appear to be good.


A manuscript that is entirely based on assumptions presents a hypothesis. Manuscripts written for presentation of empirical findings must be based on data. As Edward Demings said, "In god we trust, all others must bring data". However, in order to analyse data, assumptions must be made. When presenting the analyses its results, the author's presentation must clearly distinguish between observations, assumptions and analysis outcomes. Confusing assumptions and outcomes is not a good thing.

Subjective or objective

Modern medical research claims to be objective and reproducible. The reproducibility has, however, recently been questioned. One explanation for this could be that while the findings may appear to be based on sound objective research, they actually just represent subjective opinion.

In contrast to well-performed clinical trials many laboratory studies, including those based on statistically correct methods, do not have a well-defined pre-specified study design linking the investigated study hypothesis with the many statistical null hypotheses tested by the investigator. Instead of letting the experiment directly provide the outcome of the study, as in a randomised trial, the investigator uses his or her expert knowledge to interpret an abundance of p-values and to formulate an expert opinion about the study hypothesis.

Apart from the subjectivity and fallibility of this experimental strategy, another drawback is that statistically oriented reviewer comments, for example regarding the consequences of misinterpretated p-values and unfulfilled methodological assumptions, tend to be perceived as a questioning of the investigator's biomedical expertise, which does not facilitate methodological improvement.

Differences and non-differences

Statistical significance has nothing to do with practical importance or scientific relevance; statistical significance reflects sampling uncertainty. Moreover, the number of statistically significant findings that can be expected in a study is related to the study design, not least sample size, number of statistical tests peformed, and strategy used for addressing multiplicity issues.

Successfull investigators develop the design of their experiments in a way that enables detection of practically important and scientifically relevant differences or effects. Parts of such a development are a sample size calculation based on a reasonable estimate of what is practically important, a procedure for data collection that prevents selection bias and confounding, and a strategy for addressing multiplicity issues. In observational research, similar problems have to be resolved in the statistical analysis instead of in the study design. However, entirely disregarding these problems and just interpreting statistical significance as an indication of practical importance and statistical non-significance as an indication of equivalence reflects a fundamental misunderstanding. P-values are not, and have never been, a substitute for scientific reasoning.

Multivariate and multivariable

Using correct terminology is important for avoiding misunderstandings. For example, the terms univatiate and multivariate are often misunderstood. The terms refer to the type of probability distribution a model is based on. A univariate statistical model is based on a univariate probability distribution, i.e. it has one outcome variable, and a multivariate analysis is based on a multivariate probability distribution, i.e. the model has multiple outcome variables. An ANOVA model, for example, is univariate and has one outcome variable, but a MANOVA model is multivariate because it has more than one outcome variable.

A regression model can have one or more regressors. A regression analysis with one outcome variable and one regressor is known as a simple regression analysis; with multiple regressors it is a multiple regression analysis. In order to change the common misuse of the description of "multivariate" for univariate multiple regression models, the term "multivariable" has been coined. This term just says that the statistical model includes multiple variabels. In analogy, a simple regression model should have been called bivariable, but it is described as a univariable model.

In summary, even if it is possible to analyse a multivariate multivariable statistical model, most multivariable models are univariate.

Finite or infinite population

Many authors are confused about whether the purpose of writing a research report is to describe or to generalise. Some authors also seem to believe that p-values and confidence intervals are descriptive measure that must be used to describe the importance of what has been observed in a studied group of subjects. This is not the case; p-values and confidence intervals describe generalisation uncertainty. However, the question is more complicated than this. Generalising with the help of p-values and confidence intervals must sometimes be performed differently depending on the type of population studied.

Most randomised trials, cohort studies, and case-control studies are not performed for the participating subjects themselves but for the benefit of future patients, an infinite population. A survey, on the other hand, is usually performed to learn about a finite population defined in time and space. While an infinite population only can be studied using samples, a finite population can be studied both with samples and censuses. Analysing a sample from a finite population may require different calculations than a sample from an infinite population.

A sample drawn from an infinite population is usually considered to be a simple random sample. Surveys usually have a more complicated sampling design, and this needs to be accounted for in the analysis. A finite population correction (FPC) may also be necessary. Ignoring the sampling design and analysing survey data as if they had been collected as a simple random sample, is likely to yield too small standard errors, too narrow confidence intervals, and too low p-values.

Another, more philosophical, difference is that the analysis of a sample from an infinite population prioritises the internal validity (i.e. an unbiased description of cause-effect relationships between variables). For survey data, the aim is instead to achieve as high external validity (i.e. an unbiased describtion of the populations' properties) as possible.

Statistical models

Statistical models are in observational clinical research primarily used for developing algorithms for individual prediction and for estimating average effects of treatments or of exposure to hazardous agents, and, which is confusing for many authors, these two modelling purposes require different methodological approaches.

While the best prediction model is the model that predicts best (whether or not the parameter estimates are biased is irrelevant) and this is evaluated using the area under the ROC-curve, the best explanatory model is the one with the least biased parameter estimates (prediction accuracy is irrelevant). This requiers considerations regarding cause-effect relationships. For example, confounders are included in the model to reduce confounding bias, but including a factor on the pathway between cause and effect would be a mistake because this would induce adjustment bias.

It is usually wise to avoid presenting risk estimates from prediction models and predictions from explanatory models.


The new version of Stata (release 16) includes LASSO regression. This is excellent because LASSO is one of the better methods for developing prediction and classification models. However, like stepwise regression, it is unfit for producing parameter estimates with adjustment for confounding bias. The adjustment must be based on considerations regarding cause-effect relations (i.e. confounders must be included in, and mediators and colliders excluded from, the statistical model used for the estimation). This information cannot be derived from data.

I fear that we will soon see publications with LASSO regression being used for the wrong purpose.


Addition July 7, 2019

Citation from Stata:

"Lasso is intended for prediction and selects covariates that are jointly correlated with the variables that belong in the best-approximating model. Said differently, lasso estimates the variables that belong in the model. Like all estimation, this is subject to error. However you put it, the inference methods are robust to these errors if the true variables are among the potential control variables that you specify."

The condition "if the true variables are among the potential control variables that you specify" is crucial. The last sentence should be read with the emphasis on "you specify". Don't expect that the Lasso method can help you.

Sample and population

Some manuscripts are based on a detailed description of a series of patients and have a conclusion restricted to what has been observed. This is what could be expected of a case-series report. However, the same patients could also have been considered a random sample drawn from and representing a greater population of patients, perhaps including future ones. In this case, the findings cannot be directly generalised to the greater population because of sampling uncertainty. When attempting to describe the underlying parameters of the population (including the sample), the uncertainty needs to be presented. This is what p-values and confidence intervals are used for. The inclusion of these measures in a descriptive report, written without any ambition to evaluate underlying mechanisms or effects, indicates methodological confusion.


A study hypothesis can usually not be evaluated by a statistical test of a single null hypothesis. The study may be based on comparisons of more than two groups and of more than one endpoint. Several, perhaps hundreds of statistical tests can then be found in a manuscript, and when multiple null hypotheses are tested, the false positive risk increases with the number of tested hypotheses. In confirmatory studies the significance level may need to be corrected for the multiplicity. One often used method is named after an Italian statistician, Bonferroni.

One common misunderstanding is that all multiplicity problems are solved by correcting the significance level for the number of group comparisons. This leads, however, to an insufficient correction when multiple endpoints are ignored.

Significance and nonsignificance

It is a common belief that a statistically significant finding always is practically important and that statistical nonsignificance is a good indication of "no difference". This belief is a major mistake. Statistical significance is a measure of uncertainty, not of importance; practical importance has to be shown by other means than p-values. Equivalence and non-inferiority can only be statistically tested when an equivalence or non-inferiority margin, specifying the practical importance, has been defined.

Regression effects

When evaluating the effect of a treatment, it may be tempting to perform the treatment on a group of subjects that have scored extremely on some measurement, and then measure the subjects again after the treatment. The difference in the measured values would provide a good estimate of the treatment effect. Or wouldn't it?

The answer is, that if measurement errors and accidental variation affect the measurements randomly, more subjects will be included with too high values than with too low, and at the next measurement they will in general not be as unlucky; their measured values will tend to be less extreme. This statistical phenomenon is known as regression to the mean. The only practically way to properly account for such regression effects is to include a control group selected using the same criteria as the treated group. Treatment effects can then be separated from regression effects in the statistical analysis.

Four quartiles?

Given that only three quartiles are defined, the middle one also known as the median (see The International Statistical Institute. The Oxford Dictionary of Statistical Terms. Oxford University Press, New York 2003), it is surprisingly common to see results presented with four quartiles. The explanation is, of course, that the term is misunderstood. The misunderstanding is actually so common that the Merriam-Webster dictionary states that the four quartiles are the same as the four quarts defined by the three quartiles. Confusing? While the exact definition may be of minor importance when writing fiction, avoiding misunderstandings is a crucial part of scientific writing. Stick to the statistical definition of statistical terms.