# Missing Data (Methods Discussion Group 11/26/2016)

Handling missing data could be an entire course in itself, but Gabrielle Simoneau teased the key tenets down to 1 hour on Friday.  In the context of mice DNA data, she first reminded us of missing data assumptions. We then discussed single and multiple imputation, inverse probability of censoring weighting, and finally touched on a complex case study that made all the methods seem inadequate (and still finished on time!).

The form of missing data dictates the methods that are available to address it.  The best case scenario is ‘missing completely at random’.  These data are not associated with the exposure or outcome, and cases with such missing data can be ignored. Larger quantities might still require imputation methods however because it affects your power.   The next best scenario is ‘missing at random’.  These data are required to identify the effect, but can be predicted from other observed variables.  The doomsday scenario is ‘missing not at random’.  Here, the data are associated with the exposure and outcome, but unavailable and unpredictable from the observed dataset. Resorting to population or literature-based values could be an option, but the methods below cannot be used as described.

Other then ignoring the problem, single imputation is the easiest way to handle missing (completely) at random data.  We choose or predict a value, substitute it in, then estimate our effect of interest. For example, we can impute the mean value of variable X for all the missing X values. However, like all things in life, there is no free lunch.  There are two major problems here: the mean value might not be a good guess at each case’s X value, and we don’t account for the added uncertainty created by our ‘invented’ values. Hence, multiple imputation.

Multiple imputation is a more complex, but more valid, way of handling missing (completely) at random data.  Instead of inputting a single value for each missing value, we generate a series of datasets that each imputes slightly different values of X based on the observed values of X in the original dataset.  We then predict the effect of interest in each of these.  The final estimate is an average of the effects estimated in each imputed dataset, and the variance accounts for the uncertainty of our imputed values.   Making the method even more useful, we can even predict several missing values at the same time. Multiple Imputation by Chained Equations (MICE) packages in statistical software can be used to employ this.

Despite the promise of multiple imputation, it is a little sketchy to start imputing things like your actual outcome or exposure variable, since this is the effect of interest.  Enter, inverse probability of censoring weighting. Inverse probability of censoring weighting avoids imputation altogether and re-constructs a complete dataset using weights. The variables in the dataset are used to predict the probability of being exposed, or a certain outcome value, or other value.  These probabilities are then used as weights in the final analysis. Unlike multiple imputation, this method is not good when there are several missing variables, because we might not have enough information to generate sensible predicted probabilities. So, it is best used when the missing values are concentrated in one variable, and imputation is undesirable.

Ready for the complex case study where all the methods above are inadequate?  In a poorly designed trial, patients were randomized to one of 3+ start treatments, then re-randomized up to three times to a choice of 3+ drugs depending on the success of each treatment.  The final dataset had missing values on follow up in each randomization cycle and treatment trajectory.  This is complex because the missing values are dependent on other variables in the randomization round, but also on individual patients’ previous values. There are also very few patients in each treatment trajectory since there were so many possible courses, limiting the amount of information available to predict anything.  In the end, some combination of all the above methods was used. But the lesson: all the methods in the world cannot save you from data that is just bad to begin with.

Resources from Gabrielle:

Resource for more complex situations:

Inverse probability of censoring weighting for missing data