- Daniel Millimet
- Southern Methodist University
- On Twitter
The analysis of data by researchers in academia or industry serves two possible objectives: prediction or causal inference. In prediction problems, the goal of the analysis is to predict or forecast some classification target such consumer purchase decisions or loan repayment or patient health risks. Here, the features (or covariates) found to be strong predictors should not be attributed any special significance; the weights (or coefficient estimates) should not be given a causal interpretation. In problems of causal inference, the goal of the analysis is to identify the features that have a causal effect on the target (or outcome). Examples include the effect of a certain ad on consumer purchase decisions or a certain medical treatment on mortality. Here, the weights do possess a causal interpretation, but at the cost of typically not predicting the target as well as one could. A 2017 article by Sendhil Mullainathan and Jann Spiess in the Journal of Economic Perspectives provides an excellent discussion of these issues.
Classical machine learning focuses on prediction problems. Today, causal inference has assumed an increasingly prominent role in data science. Regardless of the objective – prediction or causal inference – a frequently overlooked aspect of data analysis is measurement error. Measurement error arises when the variable observed in the data, say x, differs from the truth, say x'. The difference, u = x – x', constitutes the measurement error. Ignoring measurement error can have serious ramifications on the data analysis. Or, it may not.
Before discussing when a researcher should or should not worry about measurement error, there are a few key concepts of which one should be aware.
Classical versus nonclassical measurement error. Classical measurement error refers to situations where the measurement error, u, is mean zero and unrelated to the truth, x'. Nonclassical measurement error refers to any situation where u does not satisfy the classical properties. There are two common situations where nonclassical measurement error can arise. First, u may be one-sided (and, hence, not mean zero). For example, patients may over-report compliance with medication protocols or under-report engaging in behaviors such as alcohol consumption. Second, u may be correlated with x'. For example, it is well known that measurement error in self-reported individual income is mean-reverting; low-income individuals tend to overstate their income and high-income individuals do the reverse. Less well known is that if x' is a bounded variable (such as binary variable), then u must be nonclassical since the error becomes one-sided as the truth approaches the boundaries.
Differential versus nondifferential measurement error. Differential measurement error refers to situations where the measurement error, u, is related to the target, say y'. Nondifferential measurement error refers to any situation where u is unrelated to y'. Differential measurement error often arises in two situations. First, u may be related to the target when data on x' is collected retrospectively, after y' is realized. Second, instances (or observations) may believe x to be the truth and make decisions accordingly. For example, if x' is a binary indicator for whether an individual truly has health insurance or not, x is a self-reported binary indicator for whether the individual has health insurance, and y' is a binary indicator for seeking a preventative checkup, then an individual’s medical decisions likely depend on their beliefs about their insurance status, x, even if these beliefs are incorrect.
Perfect versus imperfect proxy variable. In his textbook, Introductory Econometrics, Jeffrey Wooldridge defines a proxy variable as "an observed variable that is related but not identical to an unobserved explanatory variable in multiple regression analysis." Sounds like measurement error? Essentially as x is “related but not identical” to x'. Proxies can then be further divided into perfect and imperfect proxies. A perfect proxy is linearly dependent on the truth. In this case, we can write x = a + bx', where a and b are scalars. An imperfect proxy is not linearly dependent on the truth. Now, we write x = a +bx' + e, where e is stochastic. In the case of a perfect proxy, the measurement error, u, is equal to a + (b-1)x'. In the case of an imperfect proxy, the measurement error is equal to a + (b-1)x' + e. Thus, researchers claiming to use "proxies" are really just masking the presence of measurement error with a change in terminology.
Now we can return to the issue at hand: When should a researcher be concerned about measurement error? The answer depends on whether (i) the measurement error is in a feature or a target, (ii) the measurement error is classical or not and differential or not, and (iii) the objective is prediction or causal inference. In the case of proxy variables, the answer also depends on whether they are perfect or imperfect.
Not everything that can be counted counts, and not everything that counts can be counted.
– Albert Einstein
The Good News
If the objective is prediction, the measurement error is limited to one or more features, and the distribution of the measurement error is constant in the training and evaluation data, then the researcher may ignore the measurement error. This conclusion arises for two reasons. First, since the weights assigned to the features are not being given any causal interpretation, then the effect of the measurement error on the choice of weights is not of direct consequence. Second, if the distribution of the measurement error (which includes the relationship between proxies and the truth) is unchanged in the training and evaluation data set, the information content of the observed features, x, remains the same. That said, the performance of the fit of any ML algorithm is likely to be poorer than if no measurement error exists due to the extra noise in the features.
If the measurement error is limited to the target only and the measurement error is classical/nondifferential, then the researcher may ignore the measurement error regardless of whether the objective is prediction or causal inference. Because the measurement error simply adds extra random noise to the target, the performance of any estimation algorithm will remain valid (if it is valid in the absence of measurement error). The only consequence will be extra noise in the predictions or causal estimate.
The Bad News
If the objective is prediction, the measurement error is limited to one or more features, but the distribution of the measurement error (which includes the relationship between proxies and the truth) changes across the training and evaluation data sets, then the researcher should be concerned. The changing distribution of the measurement error implies that the association between the observed features, x, and the target, y', differs across the two data sets. As such, the optimal prediction model in the training data set need not be optimal in the evaluation data set.
If the objective continues to be prediction, but the measurement error is in the target and is differential and/or nonclassical, then the researcher should be concerned. This applies regardless of whether or not the distribution of the measurement error is constant across the training and evaluation data sets. First, consider a (binary) classification target, y', that is unobserved. Instead, y is observed. Because y' is binary, the measurement error must be nonclassical/differential; specifically, u must be negatively correlated with y' (see the 2000 article by Dan Black, Mark Berger, and Frank Scott). As a result, the model can only be trained to predict the observed y, not y'. For example, if the objective is to predict deforestation using remotely sensed satellite data, but the data set contain erroneous information on due to instances of deforestation not captured on satellite imagery, then the model will be trained to predict observed deforestation, not actual deforestation. Second, consider a continuous target that is unobserved; the observed target suffers from one-sided measurement error. Again, the model can only be trained to predict the observed y (see my 2021 article with Chris Parmeter). For example, if now the objective is to predict forest coverage, but the data mistakenly conclude that non-forest objects are forests, then forest coverage will be consistently overstated. Again, the model will be trained to predict observed forest coverage, not actual forest coverage.
If the objective continues to be prediction, but the measurement error arises due to the use of a perfect or imperfect proxy for the target, then the researcher should be concerned. This applies regardless of whether or not the relationship between the proxy and the truth is constant across the training and evaluation data sets. If the relationship is constant across data sets, then the model is trained to predict the observed proxy, not the actual target.
If the objective is causal inference, then any type of measurement error in the features or the target, or reliance on proxies, is likely to cause difficulties as the measurement error is likely to invalidate the causal identification strategy being employed by the researcher. There are (at least) three exceptions to this claim. First, if the researcher is interested solely in the causal effect of one feature, then measurement error in other features may not be of consequence depending on the estimation method being used and the correlation structure between the mismeasured features and the feature of interest. Second, classical measurement error in the feature of interest may not be of consequence (aside from a loss in precision) if one is using Instrumental Variables estimation and the measurement error is unrelated to the instruments. Finally, classical and nondifferential measurement error in the target is likely inconsequential (aside from a loss in precision) assuming the assumptions necessary for the identification of causal effects continue to be met.
Some More Good News
Depending on the objective of the researcher and the nature of the measurement error, solutions may be available. However, to my knowledge, these have yet to make their way into the data science universe. For example, the article by myself and Chris Parmeter mentioned above considers estimation methods when the target suffers from one-sided measurement error. Another recent paper by myself and Jennifer Alix-Garcia that is not yet published considers estimation methods when the binary classification target is mismeasured. Our methods build on those proposed in a 1998 article by Jerry Hausman, Jason Abrevaya, and Fiona Scott Morton. A 2019 article by Pierre Nguimkeu, Augustine Denteh, and Rusty Tchernis considers estimation of the causal effect of a binary feature suffering from nonclassical measurement error.
Data science cannot exist without data; it's in the name. However, data scientists to date have seemed too preoccupied with the science and have overlooked the data aspect. Unfortunately, data are often unsatisfactory. Sometimes these shortcomings are simply an annoyance, adding extra noise and leading to a loss in precision. But, with large sample sizes, this may be irrelevant. Other times, though, the consequences can be severe and do not disappear as the sample size grows. In these cases, some tools exist. More need to be developed. Data scientists would be wise to put the data first, as the name suggests, and give this issue the attention it deserves.
Receive updates on new articles, research, and best practice.