Scott Oatley (s2265605@ed.ac.uk)
Missing data is a threat to the accurate reporting of substantive results within data analysis. While handling missing data strategies are widely available, many studies fail to account for missingness in their analysis. Those who do engage in handling missing data analysis sometimes engage in less than-gold-standard approaches that leave their substantive interpretation open toward methodological and statistical critique. These gold standard measures; multiple imputation (MI) and full information maximum likelihood (FIML), are rarely compared with one another. Mostly due to discipline-based preference, there has been little work on assessing if using one of these methods or the other would produce less biased estimates. This paper seeks to establish the real impact of implementing handling missing data methods upon statistical analysis and secondly, provide a direct comparison between multiple imputation and full information maximum likelihood methods of handling missing data. A simulation is performed on a variety of handling missing data strategies using a set seed for reproducibility. This simulation of methods is repeated 1000 times, and the 95% confidence intervals of the betas and standard errors of each model are presented to demonstrate the standard each handling missing data method provides the researcher. The results of this analysis confirm that under a missing at-random assumption, methods such as listwise deletion are not at all appropriate in handling missing data. Furthermore, naive methods are analysed, such as single-use imputation and found to be equally unattractive. Finally, results show that multiple imputations appear to provide the most accurate reporting of original ‘non-missing’ models compared to full information maximum likelihood techniques, though the difference is not large enough to rule it out as a viable ‘gold standard’. A discussion of statistical and time-based efficiency is also provided.
MCAR, MAR, MNAR, Multiple Imputation, FIML
Providing a literate workflow of any scientific investigations should be a baseline requiste of empirical research. Publishing a Jupyter Notebook allows individuals to not only reproduce the empirical work held within, it also provides a robust justification for each step in the research process. This process promotes open-science practices and encourages others to do the same.
Encouraging others to openly look at ones work invites critique. These critiques are welcome. The hope of using Notebooks is to have a much more organic and engaging research process whereby the research does not simply end once a paper is published. Critical comments can and will be incorporated into this Notebook to further research practices.
By providing a literate workflow where research, theory, justification, and 'footnote' analysis are all recorded in one place. This notebook invites a widespread auideance, ranging from other academics, to interested stakeholders. Whilst the language used in this Notebook is one intended for an academic auidence, the workflow presented should be possible for anyone that reads the Notebook and takes a methodical step-by-step approach to its application.
As Jupyter is a language agnostic program, the use of language used for analysis is left up to the individual researcher. For this Notebook, Stata is employed for all statistical analysis. Stata is a proprietary software and researchers MUST have access to Stata in order to undertake data analyses within the Jupyter notebook. A primary goal of extending this analysis is to undertake it in a different non-properiatory software.
Using Stata requires enabling the use of the 'stata kernel' in Jupyter. The instructions for which have been outlined in meticulous details in Connelly and Gayle (2019). For the sake of promoting a consistent introduction base for Jupyter Notebooks, and in an attempt to avoid needless confusing rhetoric at the beginning of such Notebooks, their original instructions are pasted below. I thank them for their work in pioneering best practices in the use of Notebooks for social scientific analysis:
The approach for this Notebook uses Stata via magic cells. This facility can be downloaded and installed from this github repository.
At the command prompt you need to type:
pip install ipystata
In a code cell before using Stata you must type:
import ipystata
and then run the cell.
Each cell will now be a Stata code cell as long as you start your syntax with:
%%stata
For example to get a summary of the variables in Stata the cell should include the following code:
%%stata summarize
further information on using Stata via magic is available here.
Missing Data pervades social scientific research. Without appropriately accounting for missing data in reserach agendas, this missingness can have stark consequences.
Missing data is a term that describes an instance whereby either an item (variable) or unit (person/entire observation) is missing from the sample. Missing data is common within social science research and even more ubiquitous within a longitudinal context where wave attrition and non-response are a rule rather than exception. The primary concern surrounding missing data is that dependent upon the type of missingness, there is potential to affect inferences made by the analysis of longitudinal studies (Hawkes and Plewis, 2006: 479; Silverwood et al., 2021). The various factors that account for sample attrition in the datasets presented thus far have the potential to present real issues as they relate to comprehensive data analysis.
Missingness attributed to death or emigration are considered ‘natural’ from the original sample. Those, however, that either refuse continued survey participation or complete surveys partially may present problems of biased estimates if the missingness is not appropriately discussed. Put simply, if there is an identified pattern within the missing data present within a given analysis this has the potential to influence results and alter substantive conclusions had this missingness not occurred – or been appropriately accounted for.
When a unit (case/observation) provides information on every possible item (variable) within a survey, this is a complete record. When information on certain items within a survey are missing, this is then called an incomplete record. When only part of all possible items is missing from the unit this is called item non-response. When all possible items are missing this is called unit non-response. Unit non-response is a natural consequence of a longitudinal survey design. The statistical techniques discussed henceforth will focus on item non-response because it is far more common in practice than the alternative.
The investigation of missingness within data, through the creation of tables and graphical summaries of the full data compared to partially observed variables can beginning to identify the mechanisms underlying the patterns of missingness. The terminology that defines these mechanisms is unfortunately rather obtuse and confusing. The follow paragraphs attempt to best explain and define the key missingness mechanisms that are used with the methodological literature surrounding missing data whilst attempting to provide a clear and concise understanding.
The first major missingness mechanism is called Missing Completely At Random (MCAR). Suppose that only one variable Y has missing data, and that another set of variables represented by the vector X, is always observed (Marsden and Wright, 2010). The data is MCAR if the probability that Y is missing does not depend on either X or Y itself. Evaluating the assumption that missingness on Y depends on some observed variable in X is straightforward. Allison (2012) uses the example of income depending on gender by testing whether the proportions of men and women who report their income differ – a logistic regression in which the dependent variable is the response indicator could be estimated, and significant coefficients would suggest a violation of the MCAR mechanism (ibid). Testing whether missingness on Y does not depend on Y itself is much more complicated. Unless we have existing linked data such as tax records in the income example, it is almost impossible to evaluate this assumption. The upside of an MCAR mechanism is that estimated coefficients will not provide biased results in the presence of data Missing Completely At Random, however their estimation may be less precise (Kang, 2013).
The second missingness mechanism is missing at random (MAR). Data on Y is considered MAR if the probability that Y is missing does not depend on Y, once we control for X. MAR allows for missingness on Y to depend on other variables so long as it does not depend on Y itself.
Finally, missing not at random (MNAR) means missingness depends on unobserved values (Silverwood et al. 2021), and that the probability that Y is missing depends on Y itself, after adjusting for X (Marsden and Wright, 2010). For example, people who have been arrested may be less likely to report their arrest status.
If data is found to be MAR or MCAR, then approaches like multiple imputation (MI), Full Information Maximum likelihood (FIML), and inverse probability weighting (IPW) are made available – the former being extensively documented with the NCDS (Hawkes and Plewis 2006). These ‘gold standard’ approaches to handling missing data have also been found to produce optimal estimates in the MNAR case but it is difficult to have confidence that any given MNAR model is correct (Marsden and Wright, 2010).
The following section explores each major attempt at handling missing data. These techniques and methods are bundled into ‘standards’ to express the level of appropriate application most of these approaches fall into. While the majority are placed in ‘less than gold standard’ approaches, that does not necessarily make them devoid of utility. There are three main categories that handling missing data techniques fall into: ‘gold standard’, ‘less than gold standard’, and ‘inadequate standard’. Each method for each category will be explained.
The first inadequate standard given a MAR mechanism is present is listwise deletion. Listwise deletion removes all observations from the data with a missing value in one or more of the variables included in the analysis. This is also known as Complete Records Analysis (CRA). The CRA approach is unpredictable; there is no way to know the consequences of this loss of information if data is found to be MAR (Carpenter and Kenward, 2012). When data is found to be MAR, a CRA approach is inadequate at handling missing data.
Depending on the variable (either metric or categorical) a simple approach to handling missing data would be to use a single mean or single modal imputation. This, in the example of a categorical variable, takes the mode of the value in said variable and imputes that modal value across all missing values in the data. Single imputation ignores all uncertainty and always underestimates the variance in each model. Advocates of this approach argue that whilst not perfect this approach doesn’t delete a single case and incorporates all available information into a given model. However, this method does not have any confidence in its results. There is a possibility that the estimates from this method may fall close to the true range; of course, the exact opposite is equally likely. The use of single-use imputation has been consistently and conclusively shown to perform poorly except under exceptionally special conditions (Collins, Schafer and Kam, 2001; Little and Rubin, 2019). For these reasons, single use imputation is an inadequate method to handle missing data.
Dummy variable adjustment is another method of handling missing data that falls into the ‘less than gold standard’ category. Dummy variable adjustment may appear to be in the same category of handling missing data methods as single use imputation. This is, however, not the case. Dummy variable adjustment is where all missingness at the given variable is coded to a value within the model. In the example of a binary dummy variable, all missingness is coded to either equal zero or equal one. This does have the identical appeal to the single-use imputation of deleting no cases and incorporates all information into the regression model. However, there is a substantive difference between the two techniques. For the simple model of data missing at Y variable, a dummy variable adjustment will not provide the ‘true’ estimates but if the complete records analysis is compared to a model where all missingness equals zero and another model where all missingness equals one, then the range of the estimates can be located. Whilst Jones (1996) demonstrated that dummy variable adjustment yields biased parameter estimates even when the data is MCAR, the ability to provide a range of the estimates does provide some utility to this technique. Given a MAR example where the reported estimates are a reduced form from their ‘true’ values, iff the complete case analysis and both dummy variable adjustment models present a beta coefficient that is throughout all models positive, one can present those results similar to how we ought to interpret log odds. The results would present evidence for a positive coefficient – though the exact size is unknown, some information can be gathered. For this reason, dummy variable adjustment provides some utility in certain missing data scenarios. This technique has the most utility in scenarios where missingness is so great that it begins to stretch the abilities of even gold-standard techniques. This method for handling missing data is not perfect, but it does provide utility and allows the use of data that has large amounts of missingness.
Another method that deals with missing data is the use of survey weights. Survey weights consider missingness. Inverse Probability Weighting (IPW) creates weighted copies of complete records to remove selection bias introduced by missing data. Whilst IPW is a method of dealing with missing data, alternatives such as multiple imputation are regarded as much more efficient as IPW only determines weights from incomplete cases and partially observed cases are discarded in the weighted analysis. Due to this, weighted estimates can have unacceptably high variance (Seaman et al., 2012; Seaman and White, 2013; Little, Carpenter and Lee, 2022).As such IPWs are placed in the ‘less than gold standard’ category of handling missing data approaches.
There are two ‘gold standard’ approaches to handling missing data, Multiple Imputation (MI) and Maximum Likelihood (ML). Referring to the latter method first, there are currently three ML estimation algorithms for use when missing data is present with either an MCAR or MAR mechanism. The first is the multiple-group method, whereby a sample is divided into subgroups which each share the same pattern of missing data. A likelihood function is computed for each of the subgroups and the groupwise likelihood functions are accumulated across the entire sample and maximised. There are some practical issues of implementing this multiple-group based ML approach (Enders, 2001). The major drawback of this approach however is that it is a group level, rather than individual level ML estimation. Another ML estimation is the expectation-maximisation (EM). This estimation uses a two-step iterative procedure where missing observations are filled in or imputed and the unknown parameters are estimated using maximum likelihood missing data algorithms. The EM approach can only be used to obtain ML estimates of a mean vector and covariance matrix and as a result standard errors will be negatively biased and bootstrapping is recommended (Enders, 2001). The final ML approach discussed here is the Full Information Maximum Likelihood (FIML) estimation. It has also been called the raw maximum likelihood estimation for its likelihood function being calculated at the individual. It is also exceptionally easy to implement compared to the other estimation procedures discussed (Enders, 2001). For these reasons, going forward ML discussions of handling missing data specifically refer to the FIML approach rather than the multiple-group or EM approach.
Multiple Imputation is the second of the ‘Gold standard’ handling missing data methods. Multiple imputation generates replacement values or imputations for the missing data values and repeats this procedure over many iterations to produce a ‘semi-Bayesian’ framework for the most appropriate fit of estimates. For multiple imputation models to be compared to a complete records analysis, the former needs to be ‘‘congenial’’ (White, Royston and Wood, 2011) with the latter. Congeniality or consistency in this respect means that the same variables in the complete record analysis are identical to those included in multiple imputation. Suppose the variables between complete records analysis and multiple imputation models differ. In that case, the correct variance/covariance matrix will not be estimated, and a substantive comparison between the two will become impossible and impracticable due to a loss of statistical power (Von Hippel, 2009; Lynch and Von Hippel, 2013).
Multivariate imputation by chained equations (MICE) is a form of multiple imputation that fills in or imputes missing data within a given dataset through iterative predictive models or k imputations. This specification is required when imputing a variable that must only take on specific values, such as the categorical nature of the economic activity response variable within the current analytical model. Using MICE, each imputation k is drawn from the posterior distribution of the parameters in the given imputation model, and then the model itself is imputed (Carpenter and Kenward, 2012). To create the kth imputation, new parameters are drawn from the posterior distribution. Multiple Imputation following MICE draws from Bayesian influences on the distribution of missing data upon observed data. An essential advantage of Multiple Imputation is that it can be applied for data missing at the response variable or its covariates (Carpenter and Kenward, 2012).
Multiple imputation uses auxiliary variables – variables not included in the main model but are used when setting the data to be imputed. The auxiliary variables' main function is to improve the predictive ability of the imputation model over and above the information recovered from just using the information provided by the analytical variables in the model (Collins, Schafer and Kam, 2001). Auxiliary variables are essential when there are high levels of missingness upon a given variable (Johnson and Young, 2011; Young and Johnson, 2011). There is no strict threshold for what an auxiliary variable needs to be included within the imputation; however, some have recommended an r > 0.4 on at least one of the analytical variables within the model (Allison, 2012a). However, this is disputed (Enders, 2010). Others, such as Silverwood et al. (2021), argue that if an auxiliary variable is predictive of the outcome variable, it makes them suitable for inclusion within the imputation model. An auxiliary variable does not have the requirement that the given variable has to have complete information to be valuable – auxiliary variables can still be influential when they have missingness (Enders, 2010).
Multiple Imputation can be implemented easily and readily across software platforms, unlike FIML. Multiple imputation does, however, have some drawbacks. It can be a lengthy procedure that has the potential to induce human error due to the need to select auxiliary variables, set the correct data for imputation, and set the correct seed etc. There is also a time efficiency argument, whereby for multiple imputation, if the dataset is large, or there are large amounts of missingness, then the time to impute the model of interest can take a large amount of time. MI is an attractive method because it is practical and widely applicable (Carpenter and Kenward, 2012).
Whilst original literature on missing data and MI typically referred to large datasets with marginal levels of missingness, contemporary studies and simulations have increasingly stretched and stress-tested the limits of MI (Hardt et al., 2013). A simulation by Hardt et al (2013) demonstrated that large amounts of missingness can be present within a model without breaking down MI or FIML mechanisms (ibid). Whilst their simulation stops at n=200 where 40 per cent missingness is acceptable, there general argument is that the greater the n the larger the missingness can be within a model without breaking MI or FIML so long as the models themselves are appropriately specified. Imputation-based models are consistently found to outperform a CRA in both absolute bias and Root Mean Squared Error (RMSE) with increasing levels of missingness (Hyuk Lee and Huber Jr., 2021). The most extreme case from Madely-Down et al (2019) demonstrates that so long as the imputation model is properly specified and data are MAR then unbiased results can be obtained even with up to 90 per cent missingness. An imputation model compared to a CRA can achieve a reduction in 99.97 per cent bias when missingness is at 90 per cent (ibid).
Paul Allison, in a series of articles (Allison, 2012a, 2012b, 2015), argues that FIML is 1) more straightforward to implement, 2) FIML has no incompatibility between an imputation model and an analysis model, 3) FIML produces a deterministic result rather than a different result every time, and 4) FIML is asymptomatically efficient.
Firstly, MI does have greater variability than FIML, but that increased choice in model selection is not necessarily a negative so long as proper procedures are followed. In fact, greater variability of choice has the potential to make MI a more attractive candidate for dealing with missingness over FIML. Secondly, MI models only run into an incompatibility problem when the MI model is inconsistent with the CRA model – something that, with appropriate testing and open science practices detailing the model construction, should not happen. Thirdly, MI models are deterministic, provided the same seed is used each time you run the imputation. The only time this would not be plausible would be when open science practices were not followed, and fellow researchers could not access the MI seed. Finally, the argument that FIML is asymptotically efficient only holds to a certain extent. MI models reach asymptotic efficiency by running an infinite number of imputations – though you can reach near full efficiency with a relatively small number of imputations, Allison (2015) argues, around 10. Overall, whilst FIML does offer some advantages, there is nothing so considerable theoretically as to desire FIML over MI on the condition that they both perform at near identical rates. So long as open science procedures are upheld, most major critiques of MI are dealt with.
There are very few comparisons between FIML and MI approaches to missing data, making it hard to assess whether one method is more efficient at dealing with missing data than the other. Before conducting any missing data methods on the NCDS data, a simulation is performed to assess the strengths of a range of handling missing data approaches, with the intent to directly compare FIML and MI methods.
Both FIML and MI practices require data to either be MCAR or MAR. A FIML approach can be achieved in Stata by using the ‘sem’ command – using structural equation modelling and using the ‘mlvm’ estimation option (mlvm means FIML). MI can also be achieved in Stata using the ‘mi’ commands using a semi-Bayesian approach that includes auxiliary variables. There are also other handling missing data methods available such as: single mean imputation and coding all data=0 OR =1. These practices are typically considered ‘bad’ ways of handling missing data but are included in the simulation as a comparison to FIML and MI methods.
The full simulation takes the form of 1000 iterations of a random normal distribution of 1000 observations around a normally distributed metric dependent variable and three independent dummy variables that share an identical distribution. Each independent variable has the same level of correlation associated with the dependent variable. This is to allow for a point of comparison when MAR missingness is injected into one variable and not the others to see what happens when handling missing data practices are implemented. Each model is isolated in its own program whereby a simulation is called using the programs function with an identical seed set to all models. The 95 per cent confidence intervals of the mean betas and standard errors for all variables within each model are gathered and reported.
This dependent variable and three independent variables form a basic OLS linear regression model that is called the ‘Complete Records God Model’. Named as such because no model in a normal social scientific framework would have all observations not missing and have prior knowledge of what the ‘complete’ model would have looked like if their model did have some element of missingness. In addition to this ‘God’ model the same regression is computed using the structural equation modelling framework in Stata to confirm the results would be identical. The next model is where missingness is introduced. Missingness is injected into independent variable three. This missingness accounts for 49 per cent missingness in the model. This amount of missingness is right on the cusp of what contemporary literature on multiple imputation and FIML allow. Dummy variable adjustment is produced whereby all missingness is coded as =0 and another is produced =1. Next a single use modal imputation is used – the same framework as a single use mean imputation but because the variable is categorical mode is used over mean. Finally, an FIML model under the SEM framework is produced alongside three different forms of Multiple Imputation models. The first is an MI with 10 imputations and no auxiliary variables, the second is an MI with 10 imputations and auxiliary variables, and finally the last model is an MI with 100 imputations and auxiliary variables.
import stata_setup
stata_setup.config("C:\Program Files\Stata18", "se")
___ ____ ____ ____ ____ ® /__ / ____/ / ____/ 18.0 ___/ / /___/ / /___/ SE—Standard Edition Statistics and Data Science Copyright 1985-2023 StataCorp LLC StataCorp 4905 Lakeway Drive College Station, Texas 77845 USA 800-STATA-PC https://www.stata.com 979-696-4600 stata@stata.com Stata license: Unlimited-user network, expiring 14 Sep 2025 Serial number: 401809305318 Licensed to: Scott Oatley University of Edinburgh Notes: 1. Unicode is supported; see help unicode_advice. 2. Maximum number of variables is set to 5,000 but can be increased; see help set_maxvar.
Prior to any simulation/analysis, some basic set up is required to save datasets and path folders.
%%stata
global workingdata"G:\Stata data and do\do files\missing data\workingdata"
cd "$workingdata"
. . global workingdata"G:\Stata data and do\do files\missing data\workingdata" . . cd "$workingdata" G:\Stata data and do\do files\missing data\workingdata .
The first program, called "regressbase" simulates a dataset with 1000 observations. Four variables are generated to simulate a standard Ordinary Least Squares linear regression model with one y dependent variable and three x* independent variables. All variables, for ease of interpretation are generated as metric continuous variables. All variables are drawn from a normal distribution and their means and standard deviations are set at different amounts. This is to simulate the variety of independent variables that are typically used in a regression model. The dependent variable y is generated through a relationship to the independent variables to guarantee a significant relationship. The generation of y also sees the addition of an error term that is normally distributed.
This simulated dataset forms the base of all subsequent programs.
For the "regressbase" program, following the generation of variables, an OLS regression is performed and the betas and standard errors of each independent variable is stored as a newly generated variable.
%%stata
capture program drop regress1
program regress1, rclass
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
regress y x1 x2 x3
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1a = sqrt(r(sum) / 997)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop regress1 . . program regress1, rclass 1. drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . regress y x1 x2 x3 8. . predict yhat 9. gen resid = y - yhat 10. gen sqr_resid = resid^2 11. summarize sqr_resid 12. gen RMSE1a = sqrt(r(sum) / 997) 13. . gen beta1 = _b[x1] 14. gen se1 = _se[x1] 15. gen lower1 = beta1 - 1.96 * se1 16. gen upper1 = beta1 + 1.96 * se1 17. . gen beta2 = _b[x2] 18. gen se2 = _se[x2] 19. gen lower2 = beta2 - 1.96 * se2 20. gen upper2 = beta2 + 1.96 * se2 21. . gen beta3 = _b[x3] 22. gen se3 = _se[x3] 23. gen lower3 = beta3 - 1.96 * se3 24. gen upper3 = beta3 + 1.96 * se3 25. . . end .
The second program, "sem1" is nearly identical to the first program. The difference lies in the regression model. Whereas the "regressbase" program uses OLS linear regression, the "sem1" program uses structural equation modeling with maximum likelihood with missing values estimation. These 'base' models should provide identical statistics but both are provided to assess and potential disparities between the two models.
%%stata
capture program drop sem1
program sem1, rclass
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
sem (y <- x1 x2 x3), method(mlmv)
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1b = sqrt(r(sum) / 997)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . . capture program drop sem1 . . program sem1, rclass 1. drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . sem (y <- x1 x2 x3), method(mlmv) 8. . predict yhat 9. gen resid = y - yhat 10. gen sqr_resid = resid^2 11. summarize sqr_resid 12. gen RMSE1b = sqrt(r(sum) / 997) 13. . gen beta1 = _b[x1] 14. gen se1 = _se[x1] 15. gen lower1 = beta1 - 1.96 * se1 16. gen upper1 = beta1 + 1.96 * se1 17. . gen beta2 = _b[x2] 18. gen se2 = _se[x2] 19. gen lower2 = beta2 - 1.96 * se2 20. gen upper2 = beta2 + 1.96 * se2 21. . gen beta3 = _b[x3] 22. gen se3 = _se[x3] 23. gen lower3 = beta3 - 1.96 * se3 24. gen upper3 = beta3 + 1.96 * se3 25. . end .
The third program, "missingmcar" generates a Missing Completely At Random (MCAR) mechanism and injects this into the OLS linear regression model. The MCAR mechanism is generated by first generating a new mcar varibale that is based on a binomial distribution where 50 per cent of values are coded as 1 and 50 per cent are coded as 0. This is then used to replace values in x2 equal to missing if values in mcar equal zero. This simulates a MCAR mechanism of missing completely at random. Under this mechanism, missingness even as much as 50 per cent should not alter the interpretation of the model.
%%stata
capture program drop missingmcar
program missingmcar, rclass
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MCAR
gen rmcar = rbinomial(1, 0.5) // MCAR: 50% chance of missingness (binary random)
replace x2 = . if rmcar == 0 // Set x to missing where rmcar == 0
regress y x1 x2 x3
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1c = sqrt(r(sum) / 485)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop missingmcar . . program missingmcar, rclass 1. drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . * Generate MCAR . gen rmcar = rbinomial(1, 0.5) // MCAR: 50% chance of missingness (binary ran > dom) 8. replace x2 = . if rmcar == 0 // Set x to missing where rmcar == 0 9. . regress y x1 x2 x3 10. . predict yhat 11. gen resid = y - yhat 12. gen sqr_resid = resid^2 13. summarize sqr_resid 14. gen RMSE1c = sqrt(r(sum) / 485) 15. . gen beta1 = _b[x1] 16. gen se1 = _se[x1] 17. gen lower1 = beta1 - 1.96 * se1 18. gen upper1 = beta1 + 1.96 * se1 19. . gen beta2 = _b[x2] 20. gen se2 = _se[x2] 21. gen lower2 = beta2 - 1.96 * se2 22. gen upper2 = beta2 + 1.96 * se2 23. . gen beta3 = _b[x3] 24. gen se3 = _se[x3] 25. gen lower3 = beta3 - 1.96 * se3 26. gen upper3 = beta3 + 1.96 * se3 27. . . end .
The fourth program, "missingmar" follows a similar pattern to the "missingmcar" program but instead of injecting a mcar mechanism, it injects a Missing At Random (MAR) mechanism. The MAR mechanism is generated by first generating a new probability of MAR variable based on the logistic curve of y-2105.2. (The -2105.2 is used to center y in order to ascertain a 50 per cent missingness mechanism). The values of x2 are made equal to missing if the probability of MAR variable is equal to zero.
%%stata
capture program drop missingmar
program missingmar, rclass
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar
replace x2 = . if rmar == 1
regress y x1 x2 x3
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1d = sqrt(r(sum) / 515)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop missingmar . . program missingmar, rclass 1. drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . . * Generate MAR . gen prob_mar = logistic(y-21791) 8. gen rmar = runiform() < prob_mar 9. replace x2 = . if rmar == 1 10. regress y x1 x2 x3 11. . predict yhat 12. gen resid = y - yhat 13. gen sqr_resid = resid^2 14. summarize sqr_resid 15. gen RMSE1d = sqrt(r(sum) / 515) 16. . gen beta1 = _b[x1] 17. gen se1 = _se[x1] 18. gen lower1 = beta1 - 1.96 * se1 19. gen upper1 = beta1 + 1.96 * se1 20. . gen beta2 = _b[x2] 21. gen se2 = _se[x2] 22. gen lower2 = beta2 - 1.96 * se2 23. gen upper2 = beta2 + 1.96 * se2 24. . gen beta3 = _b[x3] 25. gen se3 = _se[x3] 26. gen lower3 = beta3 - 1.96 * se3 27. gen upper3 = beta3 + 1.96 * se3 28. . . end .
The program "missingmean" is the first handling missing data method implemented. This program performs single mean imputation across the 50 per cent missingness within the model. The mean of x2 is generated and used to replace all missing values in x2.
%%stata
capture program drop missingmean
program missingmean, rclass
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar
replace x2 = . if rmar == 1
summarize x2, meanonly
replace x2 = r(mean) if missing(x2)
regress y x1 x2 x3
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1e = sqrt(r(sum) / 997)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop missingmean . . program missingmean, rclass 1. drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . . * Generate MAR . gen prob_mar = logistic(y-21791) 8. gen rmar = runiform() < prob_mar 9. replace x2 = . if rmar == 1 10. . summarize x2, meanonly 11. replace x2 = r(mean) if missing(x2) 12. . regress y x1 x2 x3 13. . predict yhat 14. gen resid = y - yhat 15. gen sqr_resid = resid^2 16. summarize sqr_resid 17. gen RMSE1e = sqrt(r(sum) / 997) 18. . . gen beta1 = _b[x1] 19. gen se1 = _se[x1] 20. gen lower1 = beta1 - 1.96 * se1 21. gen upper1 = beta1 + 1.96 * se1 22. . gen beta2 = _b[x2] 23. gen se2 = _se[x2] 24. gen lower2 = beta2 - 1.96 * se2 25. gen upper2 = beta2 + 1.96 * se2 26. . gen beta3 = _b[x3] 27. gen se3 = _se[x3] 28. gen lower3 = beta3 - 1.96 * se3 29. gen upper3 = beta3 + 1.96 * se3 30. . . end .
The program "semmlvm" is the 'gold standard' Full Information Maximum Likelihood attempt to recover unbiased estimates following an injection of a MAR mechanism. FIML is performed using the sem command with the mlvm estimation.
%%stata
capture program drop semmlmv
program semmlmv, rclass
* drop all variables to create an empty dataset
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar
replace x2 = . if rmar == 1
* Model estimation
sem (y <- x1 x2 x3), method(mlmv)
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1f = sqrt(r(sum) / 997)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop semmlmv . . program semmlmv, rclass 1. * drop all variables to create an empty dataset . drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . . * Generate MAR . gen prob_mar = logistic(y-21791) 8. gen rmar = runiform() < prob_mar 9. replace x2 = . if rmar == 1 10. . * Model estimation . sem (y <- x1 x2 x3), method(mlmv) 11. . predict yhat 12. gen resid = y - yhat 13. gen sqr_resid = resid^2 14. summarize sqr_resid 15. gen RMSE1f = sqrt(r(sum) / 997) 16. . gen beta1 = _b[x1] 17. gen se1 = _se[x1] 18. gen lower1 = beta1 - 1.96 * se1 19. gen upper1 = beta1 + 1.96 * se1 20. . gen beta2 = _b[x2] 21. gen se2 = _se[x2] 22. gen lower2 = beta2 - 1.96 * se2 23. gen upper2 = beta2 + 1.96 * se2 24. . gen beta3 = _b[x3] 25. gen se3 = _se[x3] 26. gen lower3 = beta3 - 1.96 * se3 27. gen upper3 = beta3 + 1.96 * se3 28. . . end .
The program "minoaux1" performs multiple imputation with 10 iterations and 10 burnins with zero auxiliary variables.
%%stata
capture program drop minoaux1
program minoaux1, rclass
* drop all variables to create an empty dataset
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar
replace x2 = . if rmar == 1
* Model estimation
mi set wide
mi register imputed y x1 x2 x3
tab _mi_miss
mi impute chained ///
///
(regress) y x1 x2 x3 ///
, rseed(12345) dots force add(10) burnin(10)
mi estimate, post dots: regress y x1 x2 x3
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1g = sqrt(r(sum) / 997)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop minoaux1 . . program minoaux1, rclass 1. * drop all variables to create an empty dataset . drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . . * Generate MAR . gen prob_mar = logistic(y-21791) 8. gen rmar = runiform() < prob_mar 9. replace x2 = . if rmar == 1 10. . * Model estimation . . mi set wide 11. . mi register imputed y x1 x2 x3 12. . tab _mi_miss 13. . . mi impute chained /// > /// > (regress) y x1 x2 x3 /// > , rseed(12345) dots force add(10) burnin(10) 14. . . mi estimate, post dots: regress y x1 x2 x3 15. . predict yhat 16. gen resid = y - yhat 17. gen sqr_resid = resid^2 18. summarize sqr_resid 19. gen RMSE1g = sqrt(r(sum) / 997) 20. . gen beta1 = _b[x1] 21. gen se1 = _se[x1] 22. gen lower1 = beta1 - 1.96 * se1 23. gen upper1 = beta1 + 1.96 * se1 24. . gen beta2 = _b[x2] 25. gen se2 = _se[x2] 26. gen lower2 = beta2 - 1.96 * se2 27. gen upper2 = beta2 + 1.96 * se2 28. . gen beta3 = _b[x3] 29. gen se3 = _se[x3] 30. gen lower3 = beta3 - 1.96 * se3 31. gen upper3 = beta3 + 1.96 * se3 32. . . end .
The program "miaux1" performs multiple imputation with 10 iterations and 10 burnins with eight auxiliary variables. Five auxiliary variables are generated that are predictive of the probability of the missingness and the missing values, three are generated that are predictive of only the missing values themselves.
%%stata
capture program drop miaux1
program miaux1, rclass
* drop all variables to create an empty dataset
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar
replace x2 = . if rmar == 1
* Generate Auxiliary Variables Predictive of Probability of Missingness and Missing Values
gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10) // Correlated with x2 and y
gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15) // Balanced correlation with x2 and y
gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12) // Adds x1 for slight variation
gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20) // Heavier weight on y (missingness predictor)
gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18) // Introduces correlation with x3
* Generate Auxiliary Variables Predictive of ONLY Missing Values
gen aux6 = 0.9*x2 + rnormal(0, 10) // Strong correlation with x2, no y dependence
gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12) // Includes x3, but no direct link to missingness
gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15) // Includes x1, but missingness-independent
* setting MI data up
mi set wide
mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8
mi impute chained ///
///
(regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 ///
, rseed(12345) dots force add(10) burnin(10)
* mi model
mi estimate, post dots: regress y x1 x2 x3
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1h = sqrt(r(sum) / 997)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop miaux1 . . . program miaux1, rclass 1. * drop all variables to create an empty dataset . drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . * Generate MAR . gen prob_mar = logistic(y-21791) 8. gen rmar = runiform() < prob_mar 9. replace x2 = . if rmar == 1 10. . . * Generate Auxiliary Variables Predictive of Probability of Missingness and M > issing Values . gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10) // Correlated with x2 and y 11. gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15) // Balanced correlation with > x2 and y 12. gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12) // Adds x1 for sligh > t variation 13. gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20) // Heavier weight on y (missi > ngness predictor) 14. gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18) // Introduces correla > tion with x3 15. . * Generate Auxiliary Variables Predictive of ONLY Missing Values . . gen aux6 = 0.9*x2 + rnormal(0, 10) // Strong correlation with x2, no y depe > ndence 16. gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12) // Includes x3, but no direc > t link to missingness 17. gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15) // Includes x1, but missingn > ess-independent 18. . * setting MI data up . . mi set wide 19. . mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 20. . . mi impute chained /// > /// > (regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 /// > , rseed(12345) dots force add(10) burnin(10) 21. . . * mi model . . mi estimate, post dots: regress y x1 x2 x3 22. . predict yhat 23. gen resid = y - yhat 24. gen sqr_resid = resid^2 25. summarize sqr_resid 26. gen RMSE1h = sqrt(r(sum) / 997) 27. . gen beta1 = _b[x1] 28. gen se1 = _se[x1] 29. gen lower1 = beta1 - 1.96 * se1 30. gen upper1 = beta1 + 1.96 * se1 31. . gen beta2 = _b[x2] 32. gen se2 = _se[x2] 33. gen lower2 = beta2 - 1.96 * se2 34. gen upper2 = beta2 + 1.96 * se2 35. . gen beta3 = _b[x3] 36. gen se3 = _se[x3] 37. gen lower3 = beta3 - 1.96 * se3 38. gen upper3 = beta3 + 1.96 * se3 39. . . end .
The final program "miaux100" performs multiple imputation with 100 iterations and 20 burnin using the prior generated eight auxiliary variables.
%%stata
capture program drop miaux100
program miaux100, rclass
* drop all variables to create an empty dataset
drop _all
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar
replace x2 = . if rmar == 1
* Generate Auxiliary Variables Predictive of Probability of Missingness and Missing Values
gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10) // Correlated with x2 and y
gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15) // Balanced correlation with x2 and y
gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12) // Adds x1 for slight variation
gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20) // Heavier weight on y (missingness predictor)
gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18) // Introduces correlation with x3
* Generate Auxiliary Variables Predictive of ONLY Missing Values
gen aux6 = 0.9*x2 + rnormal(0, 10) // Strong correlation with x2, no y dependence
gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12) // Includes x3, but no direct link to missingness
gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15) // Includes x1, but missingness-independent
* setting MI data up
mi set wide
mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8
mi impute chained ///
///
(regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 ///
, rseed(12345) dots force add(100) burnin(20)
* mi model
mi estimate, post dots: regress y x1 x2 x3
predict yhat
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1i = sqrt(r(sum) / 997)
gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1
gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2
gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
end
. . capture program drop miaux100 . . . program miaux100, rclass 1. * drop all variables to create an empty dataset . drop _all 2. . * Set number of observations . set obs 1000 3. . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) 4. . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) 5. . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) 6. . * Generate metric dependent variable . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) 7. . * Generate MAR . gen prob_mar = logistic(y-21791) 8. gen rmar = runiform() < prob_mar 9. replace x2 = . if rmar == 1 10. . . * Generate Auxiliary Variables Predictive of Probability of Missingness and M > issing Values . gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10) // Correlated with x2 and y 11. gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15) // Balanced correlation with > x2 and y 12. gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12) // Adds x1 for sligh > t variation 13. gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20) // Heavier weight on y (missi > ngness predictor) 14. gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18) // Introduces correla > tion with x3 15. . * Generate Auxiliary Variables Predictive of ONLY Missing Values . . gen aux6 = 0.9*x2 + rnormal(0, 10) // Strong correlation with x2, no y depe > ndence 16. gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12) // Includes x3, but no direc > t link to missingness 17. gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15) // Includes x1, but missingn > ess-independent 18. . * setting MI data up . . mi set wide 19. . mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 20. . . mi impute chained /// > /// > (regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 /// > , rseed(12345) dots force add(100) burnin(20) 21. . . * mi model . . mi estimate, post dots: regress y x1 x2 x3 22. . predict yhat 23. gen resid = y - yhat 24. gen sqr_resid = resid^2 25. summarize sqr_resid 26. gen RMSE1i = sqrt(r(sum) / 997) 27. . gen beta1 = _b[x1] 28. gen se1 = _se[x1] 29. gen lower1 = beta1 - 1.96 * se1 30. gen upper1 = beta1 + 1.96 * se1 31. . gen beta2 = _b[x2] 32. gen se2 = _se[x2] 33. gen lower2 = beta2 - 1.96 * se2 34. gen upper2 = beta2 + 1.96 * se2 35. . gen beta3 = _b[x3] 36. gen se3 = _se[x3] 37. gen lower3 = beta3 - 1.96 * se3 38. gen upper3 = beta3 + 1.96 * se3 39. . end .
This next section of the Notebook now runs each program simulation. Each simulation is run 1000 times with a set seed of "1234" for each program simulation. The 95 per cent confidence intervals of the means for each beta and standard error within the models is extracted and the dataset is then saved for each simulation.
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1a, reps(1000) seed(1234): regress1
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1a
rename _sim_2 se1a
rename _sim_3 beta2a
rename _sim_4 se2a
rename _sim_5 beta3a
rename _sim_6 se3a
rename _sim_7 ll1a
rename _sim_8 ll2a
rename _sim_9 ll3a
rename _sim_10 ul1a
rename _sim_11 ul2a
rename _sim_12 ul3a
rename _sim_13 RMSE1a
gen absbias1 = 100 * (beta1a - 30) / 30
gen absbias2 = 100 * (beta2a - 40) / 40
gen absbias3 = 100 * (beta3a - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1a - 30) / se1a
gen bias2 = 100 * (beta2a - 40) / se2a
gen bias3 = 100 * (beta3a - 50) / se3a
summarize bias1 bias2 bias3
summarize RMSE1a
gen covered1 = (30 >= ll1a) & (30 <= ul1a)
gen covered2 = (40 >= ll2a) & (40 <= ul2a)
gen covered3 = (50 >= ll3a) & (50 <= ul3a)
summarize covered1 covered2 covered3
gen length1 = ul1a-ll1a
gen length2 = ul2a-ll2a
gen length3 = ul3a-ll3a
summarize length1 length2 length3
save godmodel, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1a, reps(1000) seed(1234): regress1 Command: regress1 _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1a Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 30.00808 .1223364 29.76802 30.24815 _sim_2 | 1,000 3.958464 .0039263 3.950759 3.966169 _sim_3 | 1,000 40.01876 .0299956 39.9599 40.07762 _sim_4 | 1,000 .9516209 .0009688 .9497198 .9535221 _sim_5 | 1,000 49.87959 .3001486 49.29059 50.46858 -------------+--------------------------------------------------------------- _sim_6 | 1,000 9.51589 .0095674 9.497115 9.534664 _sim_7 | 1,000 22.24949 .1221543 22.00978 22.4892 _sim_8 | 1,000 38.15358 .0300231 38.09467 38.2125 _sim_9 | 1,000 31.22844 .2998234 30.64009 31.8168 _sim_10 | 1,000 37.76667 .1230006 37.5253 38.00804 -------------+--------------------------------------------------------------- _sim_11 | 1,000 41.88393 .0300882 41.82489 41.94298 _sim_12 | 1,000 68.53073 .3016415 67.93881 69.12265 _sim_13 | 1,000 1500 1.075835 1497.889 1502.111 . . rename _sim_1 beta1a . rename _sim_2 se1a . rename _sim_3 beta2a . rename _sim_4 se2a . rename _sim_5 beta3a . rename _sim_6 se3a . rename _sim_7 ll1a . rename _sim_8 ll2a . rename _sim_9 ll3a . rename _sim_10 ul1a . rename _sim_11 ul2a . rename _sim_12 ul3a . rename _sim_13 RMSE1a . . gen absbias1 = 100 * (beta1a - 30) / 30 . gen absbias2 = 100 * (beta2a - 40) / 40 . gen absbias3 = 100 * (beta3a - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 .0269405 12.89539 -39.21011 41.60975 absbias2 | 1,000 .0468945 2.371359 -6.72061 7.922773 absbias3 | 1,000 -.2408279 18.98306 -57.06816 58.99072 . . gen bias1 = 100 * (beta1a - 30) / se1a . gen bias2 = 100 * (beta2a - 40) / se2a . gen bias3 = 100 * (beta3a - 50) / se3a . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 .0288394 97.83546 -295.6941 304.1698 bias2 | 1,000 1.916143 99.79077 -288.9381 337.0835 bias3 | 1,000 -1.418348 99.93945 -294.8282 311.0158 . . summarize RMSE1a Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1a | 1,000 1500 34.02088 1384.698 1605.48 . . gen covered1 = (30 >= ll1a) & (30 <= ul1a) . gen covered2 = (40 >= ll2a) & (40 <= ul2a) . gen covered3 = (50 >= ll3a) & (50 <= ul3a) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 .955 .2074079 0 1 covered2 | 1,000 .943 .2319586 0 1 covered3 | 1,000 .947 .2241456 0 1 . . gen length1 = ul1a-ll1a . gen length2 = ul2a-ll2a . gen length3 = ul3a-ll3a . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 15.51718 .4867134 13.97988 17.13386 length2 | 1,000 3.730354 .1200956 3.358505 4.088402 length3 | 1,000 37.30229 1.185991 33.99916 41.38372 . . . save godmodel, replace file godmodel.dta saved .
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1b, reps(1000) seed(1234): sem1
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1b
rename _sim_2 se1b
rename _sim_3 beta2b
rename _sim_4 se2b
rename _sim_5 beta3b
rename _sim_6 se3b
rename _sim_7 ll1b
rename _sim_8 ll2b
rename _sim_9 ll3b
rename _sim_10 ul1b
rename _sim_11 ul2b
rename _sim_12 ul3b
rename _sim_13 RMSE1b
gen absbias1 = 100 * (beta1b - 30) / 30
gen absbias2 = 100 * (beta2b - 40) / 40
gen absbias3 = 100 * (beta3b - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1b - 30) / se1b
gen bias2 = 100 * (beta2b - 40) / se2b
gen bias3 = 100 * (beta3b - 50) / se3b
summarize bias1 bias2 bias3
summarize RMSE1b
gen covered1 = (30 >= ll1b) & (30 <= ul1b)
gen covered2 = (40 >= ll2b) & (40 <= ul2b)
gen covered3 = (50 >= ll3b) & (50 <= ul3b)
summarize covered1 covered2 covered3
gen length1 = ul1b-ll1b
gen length2 = ul2b-ll2b
gen length3 = ul3b-ll3b
summarize length1 length2 length3
save semgodmodel, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1b, reps(1000) seed(1234): sem1 Command: sem1 _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1b Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 30.00808 .1223364 29.76802 30.24815 _sim_2 | 1,000 3.950539 .0039185 3.94285 3.958229 _sim_3 | 1,000 40.01876 .0299956 39.9599 40.07762 _sim_4 | 1,000 .9497158 .0009669 .9478184 .9516131 _sim_5 | 1,000 49.87959 .3001486 49.29059 50.46858 -------------+--------------------------------------------------------------- _sim_6 | 1,000 9.496839 .0095483 9.478102 9.515576 _sim_7 | 1,000 22.26502 .1221542 22.02532 22.50473 _sim_8 | 1,000 38.15731 .0300229 38.0984 38.21623 _sim_9 | 1,000 31.26578 .2998229 30.67743 31.85414 _sim_10 | 1,000 37.75114 .1229988 37.50977 37.9925 -------------+--------------------------------------------------------------- _sim_11 | 1,000 41.8802 .0300878 41.82116 41.93924 _sim_12 | 1,000 68.49339 .3016373 67.90147 69.08531 _sim_13 | 1,000 1500 1.075835 1497.889 1502.111 . . rename _sim_1 beta1b . rename _sim_2 se1b . rename _sim_3 beta2b . rename _sim_4 se2b . rename _sim_5 beta3b . rename _sim_6 se3b . rename _sim_7 ll1b . rename _sim_8 ll2b . rename _sim_9 ll3b . rename _sim_10 ul1b . rename _sim_11 ul2b . rename _sim_12 ul3b . rename _sim_13 RMSE1b . . gen absbias1 = 100 * (beta1b - 30) / 30 . gen absbias2 = 100 * (beta2b - 40) / 40 . gen absbias3 = 100 * (beta3b - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 .0269405 12.89539 -39.21011 41.60975 absbias2 | 1,000 .0468945 2.371359 -6.72061 7.922773 absbias3 | 1,000 -.2408279 18.98306 -57.06816 58.99072 . . gen bias1 = 100 * (beta1b - 30) / se1b . gen bias2 = 100 * (beta2b - 40) / se2b . gen bias3 = 100 * (beta3b - 50) / se3b . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 .0288972 98.03172 -296.2873 304.78 bias2 | 1,000 1.919987 99.99095 -289.5177 337.7597 bias3 | 1,000 -1.421193 100.1399 -295.4196 311.6397 . . summarize RMSE1b Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1b | 1,000 1500 34.02088 1384.698 1605.48 . . gen covered1 = (30 >= ll1b) & (30 <= ul1b) . gen covered2 = (40 >= ll2b) & (40 <= ul2b) . gen covered3 = (50 >= ll3b) & (50 <= ul3b) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 .955 .2074079 0 1 covered2 | 1,000 .942 .2338604 0 1 covered3 | 1,000 .947 .2241456 0 1 . . gen length1 = ul1b-ll1b . gen length2 = ul2b-ll2b . gen length3 = ul3b-ll3b . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 15.48611 .485739 13.95189 17.09955 length2 | 1,000 3.722886 .1198551 3.351784 4.080215 length3 | 1,000 37.22761 1.183617 33.93109 41.30087 . . save semgodmodel, replace file semgodmodel.dta saved .
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1c, reps(1000) seed(1234): missingmcar
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1c
rename _sim_2 se1c
rename _sim_3 beta2c
rename _sim_4 se2c
rename _sim_5 beta3c
rename _sim_6 se3c
rename _sim_7 ll1c
rename _sim_8 ll2c
rename _sim_9 ll3c
rename _sim_10 ul1c
rename _sim_11 ul2c
rename _sim_12 ul3c
rename _sim_13 RMSE1c
gen absbias1 = 100 * (beta1c - 30) / 30
gen absbias2 = 100 * (beta2c - 40) / 40
gen absbias3 = 100 * (beta3c - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1c - 30) / se1c
gen bias2 = 100 * (beta2c - 40) / se2c
gen bias3 = 100 * (beta3c - 50) / se3c
summarize bias1 bias2 bias3
summarize RMSE1c
gen covered1 = (30 >= ll1c) & (30 <= ul1c)
gen covered2 = (40 >= ll2c) & (40 <= ul2c)
gen covered3 = (50 >= ll3c) & (50 <= ul3c)
summarize covered1 covered2 covered3
gen length1 = ul1c-ll1c
gen length2 = ul2c-ll2c
gen length3 = ul3c-ll3c
summarize length1 length2 length3
save missingmcar, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1c, reps(1000) seed(1234): missingmcar Command: missingmcar _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1c Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 29.96598 .1732293 29.62604 30.30591 _sim_2 | 1,000 5.622786 .0085991 5.605912 5.63966 _sim_3 | 1,000 40.02882 .0423628 39.94568 40.11195 _sim_4 | 1,000 1.348911 .002079 1.344832 1.352991 _sim_5 | 1,000 51.29684 .436001 50.44125 52.15242 -------------+--------------------------------------------------------------- _sim_6 | 1,000 13.47856 .0213482 13.43667 13.52046 _sim_7 | 1,000 18.94532 .1742216 18.60344 19.2872 _sim_8 | 1,000 37.38495 .042557 37.30144 37.46846 _sim_9 | 1,000 24.87885 .4352299 24.02478 25.73292 _sim_10 | 1,000 40.98664 .1738728 40.64544 41.32784 -------------+--------------------------------------------------------------- _sim_11 | 1,000 42.67268 .0425597 42.58916 42.7562 _sim_12 | 1,000 77.71482 .440761 76.8499 78.57974 _sim_13 | 1,000 1519.553 1.684498 1516.248 1522.859 . . rename _sim_1 beta1c . rename _sim_2 se1c . rename _sim_3 beta2c . rename _sim_4 se2c . rename _sim_5 beta3c . rename _sim_6 se3c . rename _sim_7 ll1c . rename _sim_8 ll2c . rename _sim_9 ll3c . rename _sim_10 ul1c . rename _sim_11 ul2c . rename _sim_12 ul3c . rename _sim_13 RMSE1c . . gen absbias1 = 100 * (beta1c - 30) / 30 . gen absbias2 = 100 * (beta2c - 40) / 40 . gen absbias3 = 100 * (beta3c - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 -.1134023 18.25997 -56.99002 62.33422 absbias2 | 1,000 .0720377 3.349077 -10.54511 10.21497 absbias3 | 1,000 2.593673 27.57512 -69.91905 93.54358 . . gen bias1 = 100 * (beta1c - 30) / se1c . gen bias2 = 100 * (beta2c - 40) / se2c . gen bias3 = 100 * (beta3c - 50) / se3c . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 -.5500844 97.42645 -308.4916 323.5037 bias2 | 1,000 2.142476 99.44684 -305.4591 300.9473 bias3 | 1,000 9.302094 102.5857 -258.2774 356.9467 . . summarize RMSE1c Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1c | 1,000 1519.553 53.26849 1355.798 1698.11 . . gen covered1 = (30 >= ll1c) & (30 <= ul1c) . gen covered2 = (40 >= ll2c) & (40 <= ul2c) . gen covered3 = (50 >= ll3c) & (50 <= ul3c) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 .963 .1888562 0 1 covered2 | 1,000 .955 .2074079 0 1 covered3 | 1,000 .94 .2376057 0 1 . . gen length1 = ul1c-ll1c . gen length2 = ul2c-ll2c . gen length3 = ul3c-ll3c . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 22.04132 1.065957 19.29918 26.03639 length2 | 1,000 5.287733 .2577094 4.545906 6.251045 length3 | 1,000 52.83597 2.646351 44.42754 61.73584 . . save missingmcar, replace file missingmcar.dta saved .
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1d, reps(1000) seed(1234): missingmar
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1d
rename _sim_2 se1d
rename _sim_3 beta2d
rename _sim_4 se2d
rename _sim_5 beta3d
rename _sim_6 se3d
rename _sim_7 ll1d
rename _sim_8 ll2d
rename _sim_9 ll3d
rename _sim_10 ul1d
rename _sim_11 ul2d
rename _sim_12 ul3d
rename _sim_13 RMSE1d
gen absbias1 = 100 * (beta1d - 30) / 30
gen absbias2 = 100 * (beta2d - 40) / 40
gen absbias3 = 100 * (beta3d - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1d - 30) / se1d
gen bias2 = 100 * (beta2d - 40) / se2d
gen bias3 = 100 * (beta3d - 50) / se3d
summarize bias1 bias2 bias3
summarize RMSE1d
gen covered1 = (30 >= ll1d) & (30 <= ul1d)
gen covered2 = (40 >= ll2d) & (40 <= ul2d)
gen covered3 = (50 >= ll3d) & (50 <= ul3d)
summarize covered1 covered2 covered3
gen length1 = ul1d-ll1d
gen length2 = ul2d-ll2d
gen length3 = ul3d-ll3d
summarize length1 length2 length3
save missingmar, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1d, reps(1000) seed(1234): missingmar Command: missingmar _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1d Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 18.88248 .1384802 18.61074 19.15423 _sim_2 | 1,000 4.423952 .0068386 4.410532 4.437371 _sim_3 | 1,000 25.14197 .0458819 25.05193 25.232 _sim_4 | 1,000 1.354173 .0020679 1.350115 1.358231 _sim_5 | 1,000 31.69827 .3393094 31.03243 32.36411 -------------+--------------------------------------------------------------- _sim_6 | 1,000 10.55861 .0169251 10.5254 10.59183 _sim_7 | 1,000 10.21154 .1385361 9.939683 10.48339 _sim_8 | 1,000 22.48779 .0457186 22.39807 22.5775 _sim_9 | 1,000 11.00338 .3394656 10.33724 11.66953 _sim_10 | 1,000 27.55343 .1397161 27.27926 27.8276 -------------+--------------------------------------------------------------- _sim_11 | 1,000 27.79615 .0464001 27.70509 27.8872 _sim_12 | 1,000 52.39315 .3423825 51.72128 53.06502 _sim_13 | 1,000 1183.398 1.381987 1180.686 1186.11 . . rename _sim_1 beta1d . rename _sim_2 se1d . rename _sim_3 beta2d . rename _sim_4 se2d . rename _sim_5 beta3d . rename _sim_6 se3d . rename _sim_7 ll1d . rename _sim_8 ll2d . rename _sim_9 ll3d . rename _sim_10 ul1d . rename _sim_11 ul2d . rename _sim_12 ul3d . rename _sim_13 RMSE1d . . gen absbias1 = 100 * (beta1d - 30) / 30 . gen absbias2 = 100 * (beta2d - 40) / 40 . gen absbias3 = 100 * (beta3d - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 -37.05839 14.59709 -88.03668 16.64221 absbias2 | 1,000 -37.14508 3.627286 -49.51259 -26.15721 absbias3 | 1,000 -36.60347 21.45981 -98.38974 36.0162 . . gen bias1 = 100 * (beta1d - 30) / se1d . gen bias2 = 100 * (beta2d - 40) / se2d . gen bias3 = 100 * (beta3d - 50) / se3d . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 -252.1138 100.3254 -582.379 107.1902 bias2 | 1,000 -1100.19 123.7338 -1497.348 -770.2115 bias3 | 1,000 -174.0128 102.1183 -490.4887 159.8803 . . summarize RMSE1d Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1d | 1,000 1183.398 43.70226 1058.915 1326.678 . . gen covered1 = (30 >= ll1d) & (30 <= ul1d) . gen covered2 = (40 >= ll2d) & (40 <= ul2d) . gen covered3 = (50 >= ll3d) & (50 <= ul3d) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 .289 .4535247 0 1 covered2 | 1,000 0 0 0 0 covered3 | 1,000 .578 .4941257 0 1 . . gen length1 = ul1d-ll1d . gen length2 = ul2d-ll2d . gen length3 = ul3d-ll3d . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 17.34189 .8477156 14.75978 20.56871 length2 | 1,000 5.30836 .2563435 4.594616 6.459545 length3 | 1,000 41.38977 2.098058 35.78901 48.7177 . . save missingmar, replace file missingmar.dta saved .
Beyond the average mean statistics of betas and standard errors that are reported, five measures are produced to evaluate the performance of measures going forward. The MAR model is used as a baseline to compare all other model statistics to. The first is the Absolute Percentage Bias. This tells us the magnitude of bias - how far off the estimate is from the true value. It measures the absolute bias as a percentage of the true value. Following from this, the second measure is the Absolute Bias Percentage of the Standard Error. This measures the bias relative to the uncertainty (standard error) of the estimator. The benefit of this is to measure the magnitude of bias compared to precision (Is the bias large compared to how variable the estimate is?). Whilst the former measure is a very simple practical measure that measures difference, the latter allows us to investigate the significance change relative to the precision.
The third measure is the Root Mean Square Error (RMSE). This measure combines bias and efficiency. It quantifies the difference between actual, and predicted values. A smaller RMSE indicates better performance of a model. It is an unstandardised equivelant to the R squared measure.
Coverage is the fourth measure used. The actual coverage of nominal 95% confidence intervals is reported. This is done to assess the accurate Type 1 error rate for a testing procedure with a 0.05 level criterion. Coverage that drops below 90% is considered troublesome (Collins, Schafer, Kam 2001).
Finally, average length of the confidence interval is provided. This is a simple measure that presents how wide the confidence intervals are by taking the upperbound away from the lowerbound intervals. If the interval length is shorter than the base model, it indicates greater accuracy and higher statistical power.
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1e, reps(1000) seed(1234): missingmean
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1e
rename _sim_2 se1e
rename _sim_3 beta2e
rename _sim_4 se2e
rename _sim_5 beta3e
rename _sim_6 se3e
rename _sim_7 ll1e
rename _sim_8 ll2e
rename _sim_9 ll3e
rename _sim_10 ul1e
rename _sim_11 ul2e
rename _sim_12 ul3e
rename _sim_13 RMSE1e
gen absbias1 = 100 * (beta1e - 30) / 30
gen absbias2 = 100 * (beta2e - 40) / 40
gen absbias3 = 100 * (beta3e - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1e - 30) / se1e
gen bias2 = 100 * (beta2e - 40) / se2e
gen bias3 = 100 * (beta3e - 50) / se3e
summarize bias1 bias2 bias3
summarize RMSE1e
gen covered1 = (30 >= ll1e) & (30 <= ul1e)
gen covered2 = (40 >= ll2e) & (40 <= ul2e)
gen covered3 = (50 >= ll3e) & (50 <= ul3e)
summarize covered1 covered2 covered3
gen length1 = ul1e-ll1e
gen length2 = ul2e-ll2e
gen length3 = ul3e-ll3e
summarize length1 length2 length3
save missingmean, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1e, reps(1000) seed(1234): missingmean Command: missingmean _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1e Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 33.9267 .1942382 33.54554 34.30786 _sim_2 | 1,000 6.335807 .0065203 6.323012 6.348602 _sim_3 | 1,000 25.75512 .0470237 25.66285 25.8474 _sim_4 | 1,000 2.718789 .0034573 2.712005 2.725574 _sim_5 | 1,000 56.64393 .4947365 55.67309 57.61478 -------------+--------------------------------------------------------------- _sim_6 | 1,000 15.17203 .0153294 15.14195 15.20212 _sim_7 | 1,000 21.50852 .1947725 21.12631 21.89073 _sim_8 | 1,000 20.4263 .0469501 20.33416 20.51843 _sim_9 | 1,000 26.90675 .4951852 25.93502 27.87847 _sim_10 | 1,000 46.34488 .1945437 45.96312 46.72664 -------------+--------------------------------------------------------------- _sim_11 | 1,000 31.08395 .0480623 30.98963 31.17826 _sim_12 | 1,000 86.38112 .4961103 85.40759 87.35466 _sim_13 | 1,000 2392.798 1.660691 2389.539 2396.057 . . rename _sim_1 beta1e . rename _sim_2 se1e . rename _sim_3 beta2e . rename _sim_4 se2e . rename _sim_5 beta3e . rename _sim_6 se3e . rename _sim_7 ll1e . rename _sim_8 ll2e . rename _sim_9 ll3e . rename _sim_10 ul1e . rename _sim_11 ul2e . rename _sim_12 ul3e . rename _sim_13 RMSE1e . . . gen absbias1 = 100 * (beta1e - 30) / 30 . gen absbias2 = 100 * (beta2e - 40) / 40 . gen absbias3 = 100 * (beta3e - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 13.089 20.4745 -55.57906 76.68382 absbias2 | 1,000 -35.61219 3.717553 -49.32283 -23.42478 absbias3 | 1,000 13.28787 31.28988 -102.6488 116.0377 . . gen bias1 = 100 * (beta1e - 30) / se1e . gen bias2 = 100 * (beta2e - 40) / se2e . gen bias3 = 100 * (beta3e - 50) / se3e . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 62.08034 97.09138 -261.3323 384.6086 bias2 | 1,000 -524.9757 60.37031 -722.8023 -342.2611 bias3 | 1,000 43.78726 103.171 -351.4386 392.1291 . . summarize RMSE1e Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1e | 1,000 2392.798 52.51565 2232.317 2586.839 . . gen covered1 = (30 >= ll1e) & (30 <= ul1e) . gen covered2 = (40 >= ll2e) & (40 <= ul2e) . gen covered3 = (50 >= ll3e) & (50 <= ul3e) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 .917 .2760203 0 1 covered2 | 1,000 0 0 0 0 covered3 | 1,000 .927 .2602667 0 1 . . gen length1 = ul1e-ll1e . gen length2 = ul2e-ll2e . gen length3 = ul3e-ll3e . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 24.83636 .808267 22.74042 28.44074 length2 | 1,000 10.65765 .4285732 9.188656 12.08593 length3 | 1,000 59.47438 1.900247 53.41096 66.90497 . . save missingmean, replace file missingmean.dta saved .
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1f, reps(1000) seed(1234): semmlmv
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1f
rename _sim_2 se1f
rename _sim_3 beta2f
rename _sim_4 se2f
rename _sim_5 beta3f
rename _sim_6 se3f
rename _sim_7 ll1f
rename _sim_8 ll2f
rename _sim_9 ll3f
rename _sim_10 ul1f
rename _sim_11 ul2f
rename _sim_12 ul3f
rename _sim_13 RMSE1f
gen absbias1 = 100 * (beta1f - 30) / 30
gen absbias2 = 100 * (beta2f - 40) / 40
gen absbias3 = 100 * (beta3f - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1f - 30) / se1f
gen bias2 = 100 * (beta2f - 40) / se2f
gen bias3 = 100 * (beta3f - 50) / se3f
summarize bias1 bias2 bias3
summarize RMSE1f
gen covered1 = (30 >= ll1f) & (30 <= ul1f)
gen covered2 = (40 >= ll2f) & (40 <= ul2f)
gen covered3 = (50 >= ll3f) & (50 <= ul3f)
summarize covered1 covered2 covered3
gen length1 = ul1f-ll1f
gen length2 = ul2f-ll2f
gen length3 = ul3f-ll3f
summarize length1 length2 length3
save semmlmv, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1f, reps(1000) seed(1234): semmlmv Command: semmlmv _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1f Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 30.00406 .156705 29.69656 30.31157 _sim_2 | 1,000 5.068415 .0073885 5.053917 5.082914 _sim_3 | 1,000 40.02852 .0404882 39.94907 40.10797 _sim_4 | 1,000 1.260888 .0017933 1.257369 1.264407 _sim_5 | 1,000 50.24373 .3887388 49.48089 51.00657 -------------+--------------------------------------------------------------- _sim_6 | 1,000 12.07971 .0181121 12.04417 12.11525 _sim_7 | 1,000 20.06997 .1553599 19.7651 20.37484 _sim_8 | 1,000 37.55718 .0382948 37.48203 37.63233 _sim_9 | 1,000 26.5675 .3852995 25.81141 27.32359 _sim_10 | 1,000 39.93816 .1593601 39.62544 40.25088 -------------+--------------------------------------------------------------- _sim_11 | 1,000 42.49986 .0428579 42.41576 42.58397 _sim_12 | 1,000 73.91996 .3953485 73.14415 74.69576 _sim_13 | 1,000 1242.749 1.964788 1238.894 1246.605 . . rename _sim_1 beta1f . rename _sim_2 se1f . rename _sim_3 beta2f . rename _sim_4 se2f . rename _sim_5 beta3f . rename _sim_6 se3f . rename _sim_7 ll1f . rename _sim_8 ll2f . rename _sim_9 ll3f . rename _sim_10 ul1f . rename _sim_11 ul2f . rename _sim_12 ul3f . rename _sim_13 RMSE1f . . . gen absbias1 = 100 * (beta1f - 30) / 30 . gen absbias2 = 100 * (beta2f - 40) / 40 . gen absbias3 = 100 * (beta3f - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 .0135462 16.51816 -47.75852 50.64448 absbias2 | 1,000 .0713057 3.200872 -10.34884 9.321384 absbias3 | 1,000 .4874593 24.586 -82.42333 85.41435 . . gen bias1 = 100 * (beta1f - 30) / se1f . gen bias2 = 100 * (beta2f - 40) / se2f . gen bias3 = 100 * (beta3f - 50) / se3f . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 -.5380218 97.58873 -290.6914 303.4151 bias2 | 1,000 -.7176602 102.1402 -352.801 272.922 bias3 | 1,000 1.334883 101.8232 -312.8748 350.3375 . . summarize RMSE1f Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1f | 1,000 1242.749 62.13206 1061.51 1500.909 . . gen covered1 = (30 >= ll1f) & (30 <= ul1f) . gen covered2 = (40 >= ll2f) & (40 <= ul2f) . gen covered3 = (50 >= ll3f) & (50 <= ul3f) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 .95 .218054 0 1 covered2 | 1,000 .954 .2095899 0 1 covered3 | 1,000 .948 .2221381 0 1 . . gen length1 = ul1f-ll1f . gen length2 = ul2f-ll2f . gen length3 = ul3f-ll3f . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 19.86819 .9158888 16.79119 23.3758 length2 | 1,000 4.942682 .2222984 4.252899 6.01104 length3 | 1,000 47.35245 2.245196 41.31921 55.62303 . . . save semmlmv, replace file semmlmv.dta saved .
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1g, reps(1000) seed(1234): minoaux1
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1g
rename _sim_2 se1g
rename _sim_3 beta2g
rename _sim_4 se2g
rename _sim_5 beta3g
rename _sim_6 se3g
rename _sim_7 ll1g
rename _sim_8 ll2g
rename _sim_9 ll3g
rename _sim_10 ul1g
rename _sim_11 ul2g
rename _sim_12 ul3g
rename _sim_13 RMSE1g
gen absbias1 = 100 * (beta1g - 30) / 30
gen absbias2 = 100 * (beta2g - 40) / 40
gen absbias3 = 100 * (beta3g - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1g - 30) / se1g
gen bias2 = 100 * (beta2g - 40) / se2g
gen bias3 = 100 * (beta3g - 50) / se3g
summarize bias1 bias2 bias3
summarize RMSE1g
gen covered1 = (30 >= ll1g) & (30 <= ul1g)
gen covered2 = (40 >= ll2g) & (40 <= ul2g)
gen covered3 = (50 >= ll3g) & (50 <= ul3g)
summarize covered1 covered2 covered3
gen length1 = ul1g-ll1g
gen length2 = ul2g-ll2g
gen length3 = ul3g-ll3g
summarize length1 length2 length3
save minoaux1, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1g, reps(1000) seed(1234): minoaux1 Command: minoaux1 _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1g Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 23.26113 .03977 23.18309 23.33917 _sim_2 | 1,000 5.114843 .0141056 5.087163 5.142523 _sim_3 | 1,000 41.92422 .007087 41.91031 41.93813 _sim_4 | 1,000 1.28995 .0027265 1.284599 1.2953 _sim_5 | 1,000 44.36747 .1072674 44.15697 44.57797 -------------+--------------------------------------------------------------- _sim_6 | 1,000 12.04358 .0336374 11.97757 12.10959 _sim_7 | 1,000 13.23603 .0656352 13.10724 13.36483 _sim_8 | 1,000 39.39592 .0095715 39.37714 39.4147 _sim_9 | 1,000 20.76206 .1440514 20.47938 21.04473 _sim_10 | 1,000 33.28622 .019597 33.24776 33.32468 -------------+--------------------------------------------------------------- _sim_11 | 1,000 44.45252 .0081211 44.43659 44.46846 _sim_12 | 1,000 67.97288 .1046668 67.76749 68.17828 _sim_13 | 1,000 985.9129 .1843666 985.5511 986.2747 . . rename _sim_1 beta1g . rename _sim_2 se1g . rename _sim_3 beta2g . rename _sim_4 se2g . rename _sim_5 beta3g . rename _sim_6 se3g . rename _sim_7 ll1g . rename _sim_8 ll2g . rename _sim_9 ll3g . rename _sim_10 ul1g . rename _sim_11 ul2g . rename _sim_12 ul3g . rename _sim_13 RMSE1g . . . gen absbias1 = 100 * (beta1g - 30) / 30 . gen absbias2 = 100 * (beta2g - 40) / 40 . gen absbias3 = 100 * (beta3g - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 -22.46291 4.192123 -26.55727 5.163403 absbias2 | 1,000 4.810551 .5602747 3.292112 5.726557 absbias3 | 1,000 -11.26506 6.784187 -21.93382 7.914925 . . gen bias1 = 100 * (beta1g - 30) / se1g . gen bias2 = 100 * (beta2g - 40) / se2g . gen bias3 = 100 * (beta3g - 50) / se3g . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 -130.775 16.45119 -142.8849 31.58958 bias2 | 1,000 149.961 21.17644 100.6044 182.8912 bias3 | 1,000 -46.28289 26.30108 -86.16815 35.08329 . . summarize RMSE1g Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1g | 1,000 985.9129 5.830185 981.2299 1132.013 . . gen covered1 = (30 >= ll1g) & (30 <= ul1g) . gen covered2 = (40 >= ll2g) & (40 <= ul2g) . gen covered3 = (50 >= ll3g) & (50 <= ul3g) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 1 0 1 1 covered2 | 1,000 1 0 1 1 covered3 | 1,000 1 0 1 1 . . gen length1 = ul1g-ll1g . gen length2 = ul2g-ll2g . gen length3 = ul3g-ll3g . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 20.05019 1.748552 17.65744 24.34246 length2 | 1,000 5.056603 .3379832 4.545982 5.853828 length3 | 1,000 47.21083 4.169732 40.84899 52.41678 . save minoaux1, replace file minoaux1.dta saved .
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1h, reps(1000) seed(1234): miaux1
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1h
rename _sim_2 se1h
rename _sim_3 beta2h
rename _sim_4 se2h
rename _sim_5 beta3h
rename _sim_6 se3h
rename _sim_7 ll1h
rename _sim_8 ll2h
rename _sim_9 ll3h
rename _sim_10 ul1h
rename _sim_11 ul2h
rename _sim_12 ul3h
rename _sim_13 RMSE1h
gen absbias1 = 100 * (beta1h - 30) / 30
gen absbias2 = 100 * (beta2h - 40) / 40
gen absbias3 = 100 * (beta3h - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1h - 30) / se1h
gen bias2 = 100 * (beta2h - 40) / se2h
gen bias3 = 100 * (beta3h - 50) / se3h
summarize bias1 bias2 bias3
summarize RMSE1h
gen covered1 = (30 >= ll1h) & (30 <= ul1h)
gen covered2 = (40 >= ll2h) & (40 <= ul2h)
gen covered3 = (50 >= ll3h) & (50 <= ul3h)
summarize covered1 covered2 covered3
gen length1 = ul1h-ll1h
gen length2 = ul2h-ll2h
gen length3 = ul3h-ll3h
summarize length1 length2 length3
save miaux1, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1h, reps(1000) seed(1234): miaux1 Command: miaux1 _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1h Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 32.75058 .0060984 32.73861 32.76255 _sim_2 | 1,000 5.413813 .0008607 5.412124 5.415502 _sim_3 | 1,000 40.51805 .0018459 40.51442 40.52167 _sim_4 | 1,000 1.155773 .0003099 1.155165 1.156381 _sim_5 | 1,000 58.83162 .0393252 58.75445 58.90879 -------------+--------------------------------------------------------------- _sim_6 | 1,000 11.78506 .0019157 11.7813 11.78882 _sim_7 | 1,000 22.13951 .0056935 22.12833 22.15068 _sim_8 | 1,000 38.25273 .0018721 38.24906 38.2564 _sim_9 | 1,000 35.7329 .0401093 35.65419 35.81161 _sim_10 | 1,000 43.36165 .0069033 43.34811 43.3752 -------------+--------------------------------------------------------------- _sim_11 | 1,000 42.78336 .0020118 42.77941 42.78731 _sim_12 | 1,000 81.93033 .0388894 81.85401 82.00664 _sim_13 | 1,000 1046.466 .1324932 1046.206 1046.726 . . rename _sim_1 beta1h . rename _sim_2 se1h . rename _sim_3 beta2h . rename _sim_4 se2h . rename _sim_5 beta3h . rename _sim_6 se3h . rename _sim_7 ll1h . rename _sim_8 ll2h . rename _sim_9 ll3h . rename _sim_10 ul1h . rename _sim_11 ul2h . rename _sim_12 ul3h . rename _sim_13 RMSE1h . . . gen absbias1 = 100 * (beta1h - 30) / 30 . gen absbias2 = 100 * (beta2h - 40) / 40 . gen absbias3 = 100 * (beta3h - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 9.168598 .6428228 -3.519891 18.34479 absbias2 | 1,000 1.295115 .1459283 -1.842213 2.990179 absbias3 | 1,000 17.66324 2.487144 -43.32578 17.79318 . . gen bias1 = 100 * (beta1h - 30) / se1h . gen bias2 = 100 * (beta2h - 40) / se2h . gen bias3 = 100 * (beta3h - 50) / se3h . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 50.80001 3.512375 -18.38705 99.46952 bias2 | 1,000 44.82236 4.778728 -61.86059 82.64523 bias3 | 1,000 74.94746 10.64833 -193.7754 75.50605 . . summarize RMSE1h Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1h | 1,000 1046.466 4.189804 1046.222 1129.628 . . gen covered1 = (30 >= ll1h) & (30 <= ul1h) . gen covered2 = (40 >= ll2h) & (40 <= ul2h) . gen covered3 = (50 >= ll3h) & (50 <= ul3h) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 1 0 1 1 covered2 | 1,000 1 0 1 1 covered3 | 1,000 1 0 1 1 . . gen length1 = ul1h-ll1h . gen length2 = ul2h-ll2h . gen length3 = ul3h-ll3h . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 21.22215 .1066926 18.36983 22.51254 length2 | 1,000 4.530632 .0384101 4.528923 5.673164 length3 | 1,000 46.19743 .2374677 43.82318 51.42883 . save miaux1, replace file miaux1.dta saved .
%%stata
simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1i, reps(1000) seed(1234): miaux100
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13
rename _sim_1 beta1i
rename _sim_2 se1i
rename _sim_3 beta2i
rename _sim_4 se2i
rename _sim_5 beta3i
rename _sim_6 se3i
rename _sim_7 ll1i
rename _sim_8 ll2i
rename _sim_9 ll3i
rename _sim_10 ul1i
rename _sim_11 ul2i
rename _sim_12 ul3i
rename _sim_13 RMSE1i
gen absbias1 = 100 * (beta1i - 30) / 30
gen absbias2 = 100 * (beta2i - 40) / 40
gen absbias3 = 100 * (beta3i - 50) / 50
summarize absbias1 absbias2 absbias3
gen bias1 = 100 * (beta1i - 30) / se1i
gen bias2 = 100 * (beta2i - 40) / se2i
gen bias3 = 100 * (beta3i - 50) / se3i
summarize bias1 bias2 bias3
summarize RMSE1i
gen covered1 = (30 >= ll1i) & (30 <= ul1i)
gen covered2 = (40 >= ll2i) & (40 <= ul2i)
gen covered3 = (50 >= ll3i) & (50 <= ul3i)
summarize covered1 covered2 covered3
gen length1 = ul1i-ll1i
gen length2 = ul2i-ll2i
gen length3 = ul3i-ll3i
summarize length1 length2 length3
save miaux100, replace
. . simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp > er3 RMSE1i, reps(1000) seed(1234): miaux100 Command: miaux100 _sim_1: beta1 _sim_2: se1 _sim_3: beta2 _sim_4: se2 _sim_5: beta3 _sim_6: se3 _sim_7: lower1 _sim_8: lower2 _sim_9: lower3 _sim_10: upper1 _sim_11: upper2 _sim_12: upper3 _sim_13: RMSE1i Simulations (1,000): .........10.........20.........30.........40.........50... > ......60.........70.........80.........90.........100.........110.........120 > .........130.........140.........150.........160.........170.........180..... > ....190.........200.........210.........220.........230.........240.........2 > 50.........260.........270.........280.........290.........300.........310... > ......320.........330.........340.........350.........360.........370........ > .380.........390.........400.........410.........420.........430.........440. > ........450.........460.........470.........480.........490.........500...... > ...510.........520.........530.........540.........550.........560.........57 > 0.........580.........590.........600.........610.........620.........630.... > .....640.........650.........660.........670.........680.........690......... > 700.........710.........720.........730.........740.........750.........760.. > .......770.........780.........790.........800.........810.........820....... > ..830.........840.........850.........860.........870.........880.........890 > .........900.........910.........920.........930.........940.........950..... > ....960.........970.........980.........990.........1,000 done . . ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1 > 0 _sim_11 _sim_12 _sim_13 Variable | Obs Mean Std. err. [95% conf. interval] -------------+--------------------------------------------------------------- _sim_1 | 1,000 31.44767 .1693166 31.11541 31.77992 _sim_2 | 1,000 5.056805 .007465 5.042156 5.071454 _sim_3 | 1,000 40.028 .035454 39.95843 40.09757 _sim_4 | 1,000 1.339959 .0025793 1.334897 1.34502 _sim_5 | 1,000 55.41926 .3468092 54.7387 56.09981 -------------+--------------------------------------------------------------- _sim_6 | 1,000 12.70764 .013953 12.68026 12.73502 _sim_7 | 1,000 21.53633 .1761621 21.19064 21.88202 _sim_8 | 1,000 37.40168 .0313884 37.34009 37.46327 _sim_9 | 1,000 30.51228 .3620038 29.80191 31.22266 _sim_10 | 1,000 41.359 .163497 41.03817 41.67984 -------------+--------------------------------------------------------------- _sim_11 | 1,000 42.65432 .0397475 42.57632 42.73232 _sim_12 | 1,000 80.32623 .3331702 79.67243 80.98002 _sim_13 | 1,000 1099.014 1.299717 1096.464 1101.565 . . rename _sim_1 beta1i . rename _sim_2 se1i . rename _sim_3 beta2i . rename _sim_4 se2i . rename _sim_5 beta3i . rename _sim_6 se3i . rename _sim_7 ll1i . rename _sim_8 ll2i . rename _sim_9 ll3i . rename _sim_10 ul1i . rename _sim_11 ul2i . rename _sim_12 ul3i . rename _sim_13 RMSE1i . . gen absbias1 = 100 * (beta1i - 30) / 30 . gen absbias2 = 100 * (beta2i - 40) / 40 . gen absbias3 = 100 * (beta3i - 50) / 50 . . summarize absbias1 absbias2 absbias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- absbias1 | 1,000 4.825554 17.84753 -18.00061 29.55877 absbias2 | 1,000 .0699968 2.802888 -5.157242 3.376732 absbias3 | 1,000 10.83851 21.93414 -21.32155 51.01494 . . gen bias1 = 100 * (beta1i - 30) / se1i . gen bias2 = 100 * (beta2i - 40) / se2i . gen bias3 = 100 * (beta3i - 50) / se3i . . summarize bias1 bias2 bias3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bias1 | 1,000 30.85355 105.8439 -103.3199 182.0997 bias2 | 1,000 -2.148084 86.41637 -166.1422 105.7601 bias3 | 1,000 44.27564 88.90823 -83.6627 208.403 . . summarize RMSE1i Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- RMSE1i | 1,000 1099.014 41.10065 1022.184 1145.602 . . gen covered1 = (30 >= ll1i) & (30 <= ul1i) . gen covered2 = (40 >= ll2i) & (40 <= ul2i) . gen covered3 = (50 >= ll3i) & (50 <= ul3i) . . summarize covered1 covered2 covered3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- covered1 | 1,000 1 0 1 1 covered2 | 1,000 1 0 1 1 covered3 | 1,000 .801 .3994478 0 1 . . gen length1 = ul1i-ll1i . gen length2 = ul2i-ll2i . gen length3 = ul3i-ll3i . . summarize length1 length2 length3 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- length1 | 1,000 19.82268 .9253684 18.3557 21.05093 length2 | 1,000 5.252637 .3197271 4.867249 5.71347 length3 | 1,000 49.81394 1.729626 46.15965 52.64177 . save miaux100, replace file miaux100.dta saved .
Allison, P. (2012a) ‘Handling Missing Data by Maximum Likelihood’, SAS Global Forum [Preprint].
Allison, P. (2012b) ‘Why Maximum Likelihood is Better Than Multiple Imputation’, Statistical Horizons, 9 July. Available at: https://statisticalhorizons.com/ml-better-than-mi/ (Accessed: 15 May 2023).
Allison, P. (2015) ‘Maximum Likelihood is Better than Multiple Imputation: Part II’, Statistical Horizons, 5 May. Available at: https://statisticalhorizons.com/ml-is-better-than-mi/ (Accessed: 15 May 2023).
Carpenter, J.R. and Kenward, M. (2012) Multiple imputation and its application. John Wiley & Sons.
Collins, L.M., Schafer, J.L. and Kam, C.-M. (2001) ‘A comparison of inclusive and restrictive strategies in modern missing data procedures.’, Psychological Methods, 6(4), pp. 330–351. Available at: https://doi.org/10.1037/1082-989X.6.4.330.
Enders, C.K. (2001) ‘A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data’, Structural Equation Modeling: A Multidisciplinary Journal, 8(1), pp. 128–141. Available at: https://doi.org/10.1207/S15328007SEM0801_7.
Enders, C.K. (2010) Applied missing data analysis. New York: Guilford Press (Methodology in the social sciences).
Hardt, J. et al. (2013) ‘Multiple Imputation of Missing Data: A Simulation Study on a Binary Response’, Open Journal of Statistics, 03(05), pp. 370–378. Available at: https://doi.org/10.4236/ojs.2013.35043.
Hawkes, D. and Plewis, I. (2006) ‘Modelling non-response in the National Child Development Study’, Journal of the Royal Statistical Society, 169(3), pp. 479–491. Available at: https://doi.org/10.1111/j.1467-985X.2006.00401.x.
Hyuk Lee, J. and Huber Jr., J.C. (2021) ‘Evaluation of Multiple Imputation with Large Proportions of Missing Data: How Much Is Too Much?’, Iranian Journal of Public Health [Preprint]. Available at: https://doi.org/10.18502/ijph.v50i7.6626.
Johnson, D.R. and Young, R. (2011) ‘Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations’, Journal of Marriage and Family, 73(5), pp. 926–945. Available at: https://doi.org/10.1111/j.1741-3737.2011.00861.x.
Jones, M.P. (1996) ‘Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression’, Journal of the American Statistical Association, 91(433).
Kang, H. (2013) ‘The prevention and handling of the missing data’, Korean Journal of Anesthesiology, 64(5), p. 402. Available at: https://doi.org/10.4097/kjae.2013.64.5.402.
Little, R.J., Carpenter, J.R. and Lee, K.J. (2022) ‘A Comparison of Three Popular Methods for Handling Missing Data: Complete-Case Analysis, Inverse Probability Weighting, and Multiple Imputation’, Sociological Methods & Research, p. 004912412211138. Available at: https://doi.org/10.1177/00491241221113873.
Little, R.J. and Rubin, D.B. (2019) Statistical analysis with missing data. John Wiley & Sons.
Lynch, J. and Von Hippel, P.T. (2013) ‘Efficiency Gains from Using Auxiliary Variables in Imputation’, Cornell University Library [Preprint].
Madley-Dowd, P. et al. (2019) ‘The proportion of missing data should not be used to guide decisions on multiple imputation’, Journal of Clinical Epidemiology, 110, pp. 63–73. Available at: https://doi.org/10.1016/j.jclinepi.2019.02.016.
Marsden, P.V. and Wright, J.D. (eds) (2010) Handbook of survey research. Second edition. Bingley: Emerald Group Publ.
Seaman, S.R. et al. (2012) ‘Combining Multiple Imputation and Inverse‐Probability Weighting’, Biometrics, 68(1), pp. 129–137. Available at: https://doi.org/10.1111/j.1541-0420.2011.01666.x.
Seaman, S.R. and White, I.R. (2013) ‘Review of inverse probability weighting for dealing with missing data’, Statistical Methods in Medical Research, 22(3), pp. 278–295. Available at: https://doi.org/10.1177/0962280210395740.
Silverwood, R. et al. (2021) ‘Handling missing data in the National Child Development Study: User guide (Version 2).’
Von Hippel, P.T. (2009) ‘How to Impute Interactions, Squares, and Other Transformed Variables’, Sociological Methodology, 39(1), pp. 265–291. Available at: https://doi.org/10.1111/j.1467-9531.2009.01215.x.
White, I.R., Royston, P. and Wood, A.M. (2011) ‘Multiple imputation using chained equations: Issues and guidance for practice’, Statistics in Medicine, 30(4), pp. 377–399. Available at: https://doi.org/10.1002/sim.4067.
Young, R. and Johnson, D.R. (2011) ‘Imputing the Missing Y’s: Implications for Survey Producers and Survey Users’, Proceedings of the AAPOR Conference Abstracts [Preprint].