A Simulation Study Comparing Handling Missing Data Strategies¶

Scott Oatley (s2265605@ed.ac.uk)

Abstract¶

Missing data is a threat to the accurate reporting of substantive results within data analysis. While handling missing data strategies are widely available, many studies fail to account for missingness in their analysis. Those who do engage in handling missing data analysis sometimes engage in less than-gold-standard approaches that leave their substantive interpretation open toward methodological and statistical critique. These gold standard measures; multiple imputation (MI) and full information maximum likelihood (FIML), are rarely compared with one another. Mostly due to discipline-based preference, there has been little work on assessing if using one of these methods or the other would produce less biased estimates. This paper seeks to establish the real impact of implementing handling missing data methods upon statistical analysis and secondly, provide a direct comparison between multiple imputation and full information maximum likelihood methods of handling missing data. A simulation is performed on a variety of handling missing data strategies using a set seed for reproducibility. This simulation of methods is repeated 1000 times, and the 95% confidence intervals of the betas and standard errors of each model are presented to demonstrate the standard each handling missing data method provides the researcher. The results of this analysis confirm that under a missing at-random assumption, methods such as listwise deletion are not at all appropriate in handling missing data. Furthermore, naive methods are analysed, such as single-use imputation and found to be equally unattractive. Finally, results show that multiple imputations appear to provide the most accurate reporting of original ‘non-missing’ models compared to full information maximum likelihood techniques, though the difference is not large enough to rule it out as a viable ‘gold standard’. A discussion of statistical and time-based efficiency is also provided.

Key Words¶

MCAR, MAR, MNAR, Multiple Imputation, FIML

Justification For Notebook¶

Providing a literate workflow of any scientific investigations should be a baseline requiste of empirical research. Publishing a Jupyter Notebook allows individuals to not only reproduce the empirical work held within, it also provides a robust justification for each step in the research process. This process promotes open-science practices and encourages others to do the same.

Encouraging others to openly look at ones work invites critique. These critiques are welcome. The hope of using Notebooks is to have a much more organic and engaging research process whereby the research does not simply end once a paper is published. Critical comments can and will be incorporated into this Notebook to further research practices.

By providing a literate workflow where research, theory, justification, and 'footnote' analysis are all recorded in one place. This notebook invites a widespread auideance, ranging from other academics, to interested stakeholders. Whilst the language used in this Notebook is one intended for an academic auidence, the workflow presented should be possible for anyone that reads the Notebook and takes a methodical step-by-step approach to its application.

Using Stata¶

As Jupyter is a language agnostic program, the use of language used for analysis is left up to the individual researcher. For this Notebook, Stata is employed for all statistical analysis. Stata is a proprietary software and researchers MUST have access to Stata in order to undertake data analyses within the Jupyter notebook. A primary goal of extending this analysis is to undertake it in a different non-properiatory software.

Using Stata requires enabling the use of the 'stata kernel' in Jupyter. The instructions for which have been outlined in meticulous details in Connelly and Gayle (2019). For the sake of promoting a consistent introduction base for Jupyter Notebooks, and in an attempt to avoid needless confusing rhetoric at the beginning of such Notebooks, their original instructions are pasted below. I thank them for their work in pioneering best practices in the use of Notebooks for social scientific analysis:

Using Stata via Magic Cells¶

The approach for this Notebook uses Stata via magic cells. This facility can be downloaded and installed from this github repository.

At the command prompt you need to type:

pip install ipystata

In a code cell before using Stata you must type:

import ipystata

and then run the cell.

Each cell will now be a Stata code cell as long as you start your syntax with:

%%stata

For example to get a summary of the variables in Stata the cell should include the following code:

%%stata summarize

further information on using Stata via magic is available here.

Table of Contents¶

  • Introduction
  • Background
  • Missing Data Theory
  • Simulated Data
  • Preparation Programs
  • Descriptive Statistics
  • Missingness Models
  • Discussion of Results
  • Conclusions
  • Notes
  • Supplementary Materials
  • References

Introduction¶

Missing Data pervades social scientific research. Without appropriately accounting for missing data in reserach agendas, this missingness can have stark consequences.

Background¶

Missing data is a term that describes an instance whereby either an item (variable) or unit (person/entire observation) is missing from the sample. Missing data is common within social science research and even more ubiquitous within a longitudinal context where wave attrition and non-response are a rule rather than exception. The primary concern surrounding missing data is that dependent upon the type of missingness, there is potential to affect inferences made by the analysis of longitudinal studies (Hawkes and Plewis, 2006: 479; Silverwood et al., 2021). The various factors that account for sample attrition in the datasets presented thus far have the potential to present real issues as they relate to comprehensive data analysis.

Missingness attributed to death or emigration are considered ‘natural’ from the original sample. Those, however, that either refuse continued survey participation or complete surveys partially may present problems of biased estimates if the missingness is not appropriately discussed. Put simply, if there is an identified pattern within the missing data present within a given analysis this has the potential to influence results and alter substantive conclusions had this missingness not occurred – or been appropriately accounted for.

When a unit (case/observation) provides information on every possible item (variable) within a survey, this is a complete record. When information on certain items within a survey are missing, this is then called an incomplete record. When only part of all possible items is missing from the unit this is called item non-response. When all possible items are missing this is called unit non-response. Unit non-response is a natural consequence of a longitudinal survey design. The statistical techniques discussed henceforth will focus on item non-response because it is far more common in practice than the alternative.

Missing Data Theory¶

The investigation of missingness within data, through the creation of tables and graphical summaries of the full data compared to partially observed variables can beginning to identify the mechanisms underlying the patterns of missingness. The terminology that defines these mechanisms is unfortunately rather obtuse and confusing. The follow paragraphs attempt to best explain and define the key missingness mechanisms that are used with the methodological literature surrounding missing data whilst attempting to provide a clear and concise understanding.

The first major missingness mechanism is called Missing Completely At Random (MCAR). Suppose that only one variable Y has missing data, and that another set of variables represented by the vector X, is always observed (Marsden and Wright, 2010). The data is MCAR if the probability that Y is missing does not depend on either X or Y itself. Evaluating the assumption that missingness on Y depends on some observed variable in X is straightforward. Allison (2012) uses the example of income depending on gender by testing whether the proportions of men and women who report their income differ – a logistic regression in which the dependent variable is the response indicator could be estimated, and significant coefficients would suggest a violation of the MCAR mechanism (ibid). Testing whether missingness on Y does not depend on Y itself is much more complicated. Unless we have existing linked data such as tax records in the income example, it is almost impossible to evaluate this assumption. The upside of an MCAR mechanism is that estimated coefficients will not provide biased results in the presence of data Missing Completely At Random, however their estimation may be less precise (Kang, 2013).

The second missingness mechanism is missing at random (MAR). Data on Y is considered MAR if the probability that Y is missing does not depend on Y, once we control for X. MAR allows for missingness on Y to depend on other variables so long as it does not depend on Y itself.

Finally, missing not at random (MNAR) means missingness depends on unobserved values (Silverwood et al. 2021), and that the probability that Y is missing depends on Y itself, after adjusting for X (Marsden and Wright, 2010). For example, people who have been arrested may be less likely to report their arrest status.

If data is found to be MAR or MCAR, then approaches like multiple imputation (MI), Full Information Maximum likelihood (FIML), and inverse probability weighting (IPW) are made available – the former being extensively documented with the NCDS (Hawkes and Plewis 2006). These ‘gold standard’ approaches to handling missing data have also been found to produce optimal estimates in the MNAR case but it is difficult to have confidence that any given MNAR model is correct (Marsden and Wright, 2010).

Methods to Handle Missing Data¶

The following section explores each major attempt at handling missing data. These techniques and methods are bundled into ‘standards’ to express the level of appropriate application most of these approaches fall into. While the majority are placed in ‘less than gold standard’ approaches, that does not necessarily make them devoid of utility. There are three main categories that handling missing data techniques fall into: ‘gold standard’, ‘less than gold standard’, and ‘inadequate standard’. Each method for each category will be explained.

Inadequate Standard¶

The first inadequate standard given a MAR mechanism is present is listwise deletion. Listwise deletion removes all observations from the data with a missing value in one or more of the variables included in the analysis. This is also known as Complete Records Analysis (CRA). The CRA approach is unpredictable; there is no way to know the consequences of this loss of information if data is found to be MAR (Carpenter and Kenward, 2012). When data is found to be MAR, a CRA approach is inadequate at handling missing data.

Depending on the variable (either metric or categorical) a simple approach to handling missing data would be to use a single mean or single modal imputation. This, in the example of a categorical variable, takes the mode of the value in said variable and imputes that modal value across all missing values in the data. Single imputation ignores all uncertainty and always underestimates the variance in each model. Advocates of this approach argue that whilst not perfect this approach doesn’t delete a single case and incorporates all available information into a given model. However, this method does not have any confidence in its results. There is a possibility that the estimates from this method may fall close to the true range; of course, the exact opposite is equally likely. The use of single-use imputation has been consistently and conclusively shown to perform poorly except under exceptionally special conditions (Collins, Schafer and Kam, 2001; Little and Rubin, 2019). For these reasons, single use imputation is an inadequate method to handle missing data.

Less than Gold Standard¶

Dummy variable adjustment is another method of handling missing data that falls into the ‘less than gold standard’ category. Dummy variable adjustment may appear to be in the same category of handling missing data methods as single use imputation. This is, however, not the case. Dummy variable adjustment is where all missingness at the given variable is coded to a value within the model. In the example of a binary dummy variable, all missingness is coded to either equal zero or equal one. This does have the identical appeal to the single-use imputation of deleting no cases and incorporates all information into the regression model. However, there is a substantive difference between the two techniques. For the simple model of data missing at Y variable, a dummy variable adjustment will not provide the ‘true’ estimates but if the complete records analysis is compared to a model where all missingness equals zero and another model where all missingness equals one, then the range of the estimates can be located. Whilst Jones (1996) demonstrated that dummy variable adjustment yields biased parameter estimates even when the data is MCAR, the ability to provide a range of the estimates does provide some utility to this technique. Given a MAR example where the reported estimates are a reduced form from their ‘true’ values, iff the complete case analysis and both dummy variable adjustment models present a beta coefficient that is throughout all models positive, one can present those results similar to how we ought to interpret log odds. The results would present evidence for a positive coefficient – though the exact size is unknown, some information can be gathered. For this reason, dummy variable adjustment provides some utility in certain missing data scenarios. This technique has the most utility in scenarios where missingness is so great that it begins to stretch the abilities of even gold-standard techniques. This method for handling missing data is not perfect, but it does provide utility and allows the use of data that has large amounts of missingness.

Another method that deals with missing data is the use of survey weights. Survey weights consider missingness. Inverse Probability Weighting (IPW) creates weighted copies of complete records to remove selection bias introduced by missing data. Whilst IPW is a method of dealing with missing data, alternatives such as multiple imputation are regarded as much more efficient as IPW only determines weights from incomplete cases and partially observed cases are discarded in the weighted analysis. Due to this, weighted estimates can have unacceptably high variance (Seaman et al., 2012; Seaman and White, 2013; Little, Carpenter and Lee, 2022).As such IPWs are placed in the ‘less than gold standard’ category of handling missing data approaches.

Gold Standard¶

There are two ‘gold standard’ approaches to handling missing data, Multiple Imputation (MI) and Maximum Likelihood (ML). Referring to the latter method first, there are currently three ML estimation algorithms for use when missing data is present with either an MCAR or MAR mechanism. The first is the multiple-group method, whereby a sample is divided into subgroups which each share the same pattern of missing data. A likelihood function is computed for each of the subgroups and the groupwise likelihood functions are accumulated across the entire sample and maximised. There are some practical issues of implementing this multiple-group based ML approach (Enders, 2001). The major drawback of this approach however is that it is a group level, rather than individual level ML estimation. Another ML estimation is the expectation-maximisation (EM). This estimation uses a two-step iterative procedure where missing observations are filled in or imputed and the unknown parameters are estimated using maximum likelihood missing data algorithms. The EM approach can only be used to obtain ML estimates of a mean vector and covariance matrix and as a result standard errors will be negatively biased and bootstrapping is recommended (Enders, 2001). The final ML approach discussed here is the Full Information Maximum Likelihood (FIML) estimation. It has also been called the raw maximum likelihood estimation for its likelihood function being calculated at the individual. It is also exceptionally easy to implement compared to the other estimation procedures discussed (Enders, 2001). For these reasons, going forward ML discussions of handling missing data specifically refer to the FIML approach rather than the multiple-group or EM approach.

Multiple Imputation is the second of the ‘Gold standard’ handling missing data methods. Multiple imputation generates replacement values or imputations for the missing data values and repeats this procedure over many iterations to produce a ‘semi-Bayesian’ framework for the most appropriate fit of estimates. For multiple imputation models to be compared to a complete records analysis, the former needs to be ‘‘congenial’’ (White, Royston and Wood, 2011) with the latter. Congeniality or consistency in this respect means that the same variables in the complete record analysis are identical to those included in multiple imputation. Suppose the variables between complete records analysis and multiple imputation models differ. In that case, the correct variance/covariance matrix will not be estimated, and a substantive comparison between the two will become impossible and impracticable due to a loss of statistical power (Von Hippel, 2009; Lynch and Von Hippel, 2013).

Multivariate imputation by chained equations (MICE) is a form of multiple imputation that fills in or imputes missing data within a given dataset through iterative predictive models or k imputations. This specification is required when imputing a variable that must only take on specific values, such as the categorical nature of the economic activity response variable within the current analytical model. Using MICE, each imputation k is drawn from the posterior distribution of the parameters in the given imputation model, and then the model itself is imputed (Carpenter and Kenward, 2012). To create the kth imputation, new parameters are drawn from the posterior distribution. Multiple Imputation following MICE draws from Bayesian influences on the distribution of missing data upon observed data. An essential advantage of Multiple Imputation is that it can be applied for data missing at the response variable or its covariates (Carpenter and Kenward, 2012).

Multiple imputation uses auxiliary variables – variables not included in the main model but are used when setting the data to be imputed. The auxiliary variables' main function is to improve the predictive ability of the imputation model over and above the information recovered from just using the information provided by the analytical variables in the model (Collins, Schafer and Kam, 2001). Auxiliary variables are essential when there are high levels of missingness upon a given variable (Johnson and Young, 2011; Young and Johnson, 2011). There is no strict threshold for what an auxiliary variable needs to be included within the imputation; however, some have recommended an r > 0.4 on at least one of the analytical variables within the model (Allison, 2012a). However, this is disputed (Enders, 2010). Others, such as Silverwood et al. (2021), argue that if an auxiliary variable is predictive of the outcome variable, it makes them suitable for inclusion within the imputation model. An auxiliary variable does not have the requirement that the given variable has to have complete information to be valuable – auxiliary variables can still be influential when they have missingness (Enders, 2010).

Multiple Imputation can be implemented easily and readily across software platforms, unlike FIML. Multiple imputation does, however, have some drawbacks. It can be a lengthy procedure that has the potential to induce human error due to the need to select auxiliary variables, set the correct data for imputation, and set the correct seed etc. There is also a time efficiency argument, whereby for multiple imputation, if the dataset is large, or there are large amounts of missingness, then the time to impute the model of interest can take a large amount of time. MI is an attractive method because it is practical and widely applicable (Carpenter and Kenward, 2012).

Whilst original literature on missing data and MI typically referred to large datasets with marginal levels of missingness, contemporary studies and simulations have increasingly stretched and stress-tested the limits of MI (Hardt et al., 2013). A simulation by Hardt et al (2013) demonstrated that large amounts of missingness can be present within a model without breaking down MI or FIML mechanisms (ibid). Whilst their simulation stops at n=200 where 40 per cent missingness is acceptable, there general argument is that the greater the n the larger the missingness can be within a model without breaking MI or FIML so long as the models themselves are appropriately specified. Imputation-based models are consistently found to outperform a CRA in both absolute bias and Root Mean Squared Error (RMSE) with increasing levels of missingness (Hyuk Lee and Huber Jr., 2021). The most extreme case from Madely-Down et al (2019) demonstrates that so long as the imputation model is properly specified and data are MAR then unbiased results can be obtained even with up to 90 per cent missingness. An imputation model compared to a CRA can achieve a reduction in 99.97 per cent bias when missingness is at 90 per cent (ibid).

Comparisons of Gold Standard Methods¶

Paul Allison, in a series of articles (Allison, 2012a, 2012b, 2015), argues that FIML is 1) more straightforward to implement, 2) FIML has no incompatibility between an imputation model and an analysis model, 3) FIML produces a deterministic result rather than a different result every time, and 4) FIML is asymptomatically efficient.

Firstly, MI does have greater variability than FIML, but that increased choice in model selection is not necessarily a negative so long as proper procedures are followed. In fact, greater variability of choice has the potential to make MI a more attractive candidate for dealing with missingness over FIML. Secondly, MI models only run into an incompatibility problem when the MI model is inconsistent with the CRA model – something that, with appropriate testing and open science practices detailing the model construction, should not happen. Thirdly, MI models are deterministic, provided the same seed is used each time you run the imputation. The only time this would not be plausible would be when open science practices were not followed, and fellow researchers could not access the MI seed. Finally, the argument that FIML is asymptotically efficient only holds to a certain extent. MI models reach asymptotic efficiency by running an infinite number of imputations – though you can reach near full efficiency with a relatively small number of imputations, Allison (2015) argues, around 10. Overall, whilst FIML does offer some advantages, there is nothing so considerable theoretically as to desire FIML over MI on the condition that they both perform at near identical rates. So long as open science procedures are upheld, most major critiques of MI are dealt with.

There are very few comparisons between FIML and MI approaches to missing data, making it hard to assess whether one method is more efficient at dealing with missing data than the other. Before conducting any missing data methods on the NCDS data, a simulation is performed to assess the strengths of a range of handling missing data approaches, with the intent to directly compare FIML and MI methods.

Methods¶

Both FIML and MI practices require data to either be MCAR or MAR. A FIML approach can be achieved in Stata by using the ‘sem’ command – using structural equation modelling and using the ‘mlvm’ estimation option (mlvm means FIML). MI can also be achieved in Stata using the ‘mi’ commands using a semi-Bayesian approach that includes auxiliary variables. There are also other handling missing data methods available such as: single mean imputation and coding all data=0 OR =1. These practices are typically considered ‘bad’ ways of handling missing data but are included in the simulation as a comparison to FIML and MI methods.

The full simulation takes the form of 1000 iterations of a random normal distribution of 1000 observations around a normally distributed metric dependent variable and three independent dummy variables that share an identical distribution. Each independent variable has the same level of correlation associated with the dependent variable. This is to allow for a point of comparison when MAR missingness is injected into one variable and not the others to see what happens when handling missing data practices are implemented. Each model is isolated in its own program whereby a simulation is called using the programs function with an identical seed set to all models. The 95 per cent confidence intervals of the mean betas and standard errors for all variables within each model are gathered and reported.

This dependent variable and three independent variables form a basic OLS linear regression model that is called the ‘Complete Records God Model’. Named as such because no model in a normal social scientific framework would have all observations not missing and have prior knowledge of what the ‘complete’ model would have looked like if their model did have some element of missingness. In addition to this ‘God’ model the same regression is computed using the structural equation modelling framework in Stata to confirm the results would be identical. The next model is where missingness is introduced. Missingness is injected into independent variable three. This missingness accounts for 49 per cent missingness in the model. This amount of missingness is right on the cusp of what contemporary literature on multiple imputation and FIML allow. Dummy variable adjustment is produced whereby all missingness is coded as =0 and another is produced =1. Next a single use modal imputation is used – the same framework as a single use mean imputation but because the variable is categorical mode is used over mean. Finally, an FIML model under the SEM framework is produced alongside three different forms of Multiple Imputation models. The first is an MI with 10 imputations and no auxiliary variables, the second is an MI with 10 imputations and auxiliary variables, and finally the last model is an MI with 100 imputations and auxiliary variables.

Preparation Programs¶

In [1]:
import stata_setup

stata_setup.config("C:\Program Files\Stata18", "se")
  ___  ____  ____  ____  ____ ®
 /__    /   ____/   /   ____/      18.0
___/   /   /___/   /   /___/       SE—Standard Edition

 Statistics and Data Science       Copyright 1985-2023 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Unlimited-user network, expiring 14 Sep 2025
Serial number: 401809305318
  Licensed to: Scott Oatley
               University of Edinburgh

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. Maximum number of variables is set to 5,000 but can be increased;
          see help set_maxvar.

Prior to any simulation/analysis, some basic set up is required to save datasets and path folders.

In [2]:
%%stata

global workingdata"G:\Stata data and do\do files\missing data\workingdata"

cd "$workingdata"
. 
. global workingdata"G:\Stata data and do\do files\missing data\workingdata"

. 
. cd "$workingdata"
G:\Stata data and do\do files\missing data\workingdata

. 

The first program, called "regressbase" simulates a dataset with 1000 observations. Four variables are generated to simulate a standard Ordinary Least Squares linear regression model with one y dependent variable and three x* independent variables. All variables, for ease of interpretation are generated as metric continuous variables. All variables are drawn from a normal distribution and their means and standard deviations are set at different amounts. This is to simulate the variety of independent variables that are typically used in a regression model. The dependent variable y is generated through a relationship to the independent variables to guarantee a significant relationship. The generation of y also sees the addition of an error term that is normally distributed.

This simulated dataset forms the base of all subsequent programs.

For the "regressbase" program, following the generation of variables, an OLS regression is performed and the betas and standard errors of each independent variable is stored as a newly generated variable.

In [3]:
%%stata

capture program drop regress1

program regress1, rclass
drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

regress y x1 x2 x3

predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1a = sqrt(r(sum) / 997)

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3


end
. 
. capture program drop regress1

. 
. program regress1, rclass
  1. drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. regress y x1 x2 x3
  8. 
. predict yhat       
  9. gen resid = y - yhat
 10. gen sqr_resid = resid^2
 11. summarize sqr_resid
 12. gen RMSE1a = sqrt(r(sum) / 997)
 13. 
. gen beta1 = _b[x1]
 14. gen se1 = _se[x1]
 15. gen lower1 = beta1 - 1.96 * se1
 16. gen upper1 = beta1 + 1.96 * se1
 17. 
. gen beta2 = _b[x2]
 18. gen se2 = _se[x2]
 19. gen lower2 = beta2 - 1.96 * se2
 20. gen upper2 = beta2 + 1.96 * se2
 21. 
. gen beta3 = _b[x3]
 22. gen se3 = _se[x3]
 23. gen lower3 = beta3 - 1.96 * se3
 24. gen upper3 = beta3 + 1.96 * se3
 25. 
. 
. end

. 

The second program, "sem1" is nearly identical to the first program. The difference lies in the regression model. Whereas the "regressbase" program uses OLS linear regression, the "sem1" program uses structural equation modeling with maximum likelihood with missing values estimation. These 'base' models should provide identical statistics but both are provided to assess and potential disparities between the two models.

In [4]:
%%stata


capture program drop sem1

program sem1, rclass
 drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

sem (y <- x1 x2 x3), method(mlmv)

predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1b = sqrt(r(sum) / 997)

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3

end
. 
. 
. capture program drop sem1

. 
. program sem1, rclass
  1.  drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. sem (y <- x1 x2 x3), method(mlmv)
  8. 
. predict yhat       
  9. gen resid = y - yhat
 10. gen sqr_resid = resid^2
 11. summarize sqr_resid
 12. gen RMSE1b = sqrt(r(sum) / 997)
 13. 
. gen beta1 = _b[x1]
 14. gen se1 = _se[x1]
 15. gen lower1 = beta1 - 1.96 * se1
 16. gen upper1 = beta1 + 1.96 * se1
 17. 
. gen beta2 = _b[x2]
 18. gen se2 = _se[x2]
 19. gen lower2 = beta2 - 1.96 * se2
 20. gen upper2 = beta2 + 1.96 * se2
 21. 
. gen beta3 = _b[x3]
 22. gen se3 = _se[x3]
 23. gen lower3 = beta3 - 1.96 * se3
 24. gen upper3 = beta3 + 1.96 * se3
 25. 
. end

. 

The third program, "missingmcar" generates a Missing Completely At Random (MCAR) mechanism and injects this into the OLS linear regression model. The MCAR mechanism is generated by first generating a new mcar varibale that is based on a binomial distribution where 50 per cent of values are coded as 1 and 50 per cent are coded as 0. This is then used to replace values in x2 equal to missing if values in mcar equal zero. This simulates a MCAR mechanism of missing completely at random. Under this mechanism, missingness even as much as 50 per cent should not alter the interpretation of the model.

In [5]:
%%stata 

capture program drop missingmcar

program missingmcar, rclass
 drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MCAR 
gen rmcar = rbinomial(1, 0.5)  // MCAR: 50% chance of missingness (binary random)
replace x2 = . if rmcar == 0  // Set x to missing where rmcar == 0

regress y x1 x2 x3

predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1c = sqrt(r(sum) / 485)

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3


end
. 
. capture program drop missingmcar

. 
. program missingmcar, rclass
  1.  drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. * Generate MCAR 
. gen rmcar = rbinomial(1, 0.5)  // MCAR: 50% chance of missingness (binary ran
> dom)
  8. replace x2 = . if rmcar == 0  // Set x to missing where rmcar == 0
  9. 
. regress y x1 x2 x3
 10. 
. predict yhat       
 11. gen resid = y - yhat
 12. gen sqr_resid = resid^2
 13. summarize sqr_resid
 14. gen RMSE1c = sqrt(r(sum) / 485)
 15. 
. gen beta1 = _b[x1]
 16. gen se1 = _se[x1]
 17. gen lower1 = beta1 - 1.96 * se1
 18. gen upper1 = beta1 + 1.96 * se1
 19. 
. gen beta2 = _b[x2]
 20. gen se2 = _se[x2]
 21. gen lower2 = beta2 - 1.96 * se2
 22. gen upper2 = beta2 + 1.96 * se2
 23. 
. gen beta3 = _b[x3]
 24. gen se3 = _se[x3]
 25. gen lower3 = beta3 - 1.96 * se3
 26. gen upper3 = beta3 + 1.96 * se3
 27. 
. 
. end

. 

The fourth program, "missingmar" follows a similar pattern to the "missingmcar" program but instead of injecting a mcar mechanism, it injects a Missing At Random (MAR) mechanism. The MAR mechanism is generated by first generating a new probability of MAR variable based on the logistic curve of y-2105.2. (The -2105.2 is used to center y in order to ascertain a 50 per cent missingness mechanism). The values of x2 are made equal to missing if the probability of MAR variable is equal to zero.

In [6]:
%%stata

capture program drop missingmar

program missingmar, rclass
 drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)


* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar     
replace x2 = . if rmar == 1   
regress y x1 x2 x3

predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1d = sqrt(r(sum) / 515)

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3


end
. 
. capture program drop missingmar

. 
. program missingmar, rclass
  1.  drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)
  8. gen rmar = runiform() < prob_mar     
  9. replace x2 = . if rmar == 1   
 10. regress y x1 x2 x3
 11. 
. predict yhat       
 12. gen resid = y - yhat
 13. gen sqr_resid = resid^2
 14. summarize sqr_resid
 15. gen RMSE1d = sqrt(r(sum) / 515)
 16. 
. gen beta1 = _b[x1]
 17. gen se1 = _se[x1]
 18. gen lower1 = beta1 - 1.96 * se1
 19. gen upper1 = beta1 + 1.96 * se1
 20. 
. gen beta2 = _b[x2]
 21. gen se2 = _se[x2]
 22. gen lower2 = beta2 - 1.96 * se2
 23. gen upper2 = beta2 + 1.96 * se2
 24. 
. gen beta3 = _b[x3]
 25. gen se3 = _se[x3]
 26. gen lower3 = beta3 - 1.96 * se3
 27. gen upper3 = beta3 + 1.96 * se3
 28. 
. 
. end

. 

The program "missingmean" is the first handling missing data method implemented. This program performs single mean imputation across the 50 per cent missingness within the model. The mean of x2 is generated and used to replace all missing values in x2.

In [7]:
%%stata

capture program drop missingmean

program missingmean, rclass
 drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)


* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar     
replace x2 = . if rmar == 1   

summarize x2, meanonly
replace x2 = r(mean) if missing(x2)

regress y x1 x2 x3

predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1e = sqrt(r(sum) / 997)


gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3


end
. 
. capture program drop missingmean

. 
. program missingmean, rclass
  1.  drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)
  8. gen rmar = runiform() < prob_mar     
  9. replace x2 = . if rmar == 1   
 10. 
. summarize x2, meanonly
 11. replace x2 = r(mean) if missing(x2)
 12. 
. regress y x1 x2 x3
 13. 
. predict yhat       
 14. gen resid = y - yhat
 15. gen sqr_resid = resid^2
 16. summarize sqr_resid
 17. gen RMSE1e = sqrt(r(sum) / 997)
 18. 
. 
. gen beta1 = _b[x1]
 19. gen se1 = _se[x1]
 20. gen lower1 = beta1 - 1.96 * se1
 21. gen upper1 = beta1 + 1.96 * se1
 22. 
. gen beta2 = _b[x2]
 23. gen se2 = _se[x2]
 24. gen lower2 = beta2 - 1.96 * se2
 25. gen upper2 = beta2 + 1.96 * se2
 26. 
. gen beta3 = _b[x3]
 27. gen se3 = _se[x3]
 28. gen lower3 = beta3 - 1.96 * se3
 29. gen upper3 = beta3 + 1.96 * se3
 30. 
. 
. end

. 

The program "semmlvm" is the 'gold standard' Full Information Maximum Likelihood attempt to recover unbiased estimates following an injection of a MAR mechanism. FIML is performed using the sem command with the mlvm estimation.

In [8]:
%%stata

capture program drop semmlmv

program semmlmv, rclass
    * drop all variables to create an empty dataset
    drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)


* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar     
replace x2 = . if rmar == 1   
    
* Model estimation
sem (y <- x1 x2 x3), method(mlmv)

predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1f = sqrt(r(sum) / 997)

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
    

end
. 
. capture program drop semmlmv

. 
. program semmlmv, rclass
  1.     * drop all variables to create an empty dataset
.     drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)
  8. gen rmar = runiform() < prob_mar     
  9. replace x2 = . if rmar == 1   
 10.     
. * Model estimation
. sem (y <- x1 x2 x3), method(mlmv)
 11. 
. predict yhat       
 12. gen resid = y - yhat
 13. gen sqr_resid = resid^2
 14. summarize sqr_resid
 15. gen RMSE1f = sqrt(r(sum) / 997)
 16. 
. gen beta1 = _b[x1]
 17. gen se1 = _se[x1]
 18. gen lower1 = beta1 - 1.96 * se1
 19. gen upper1 = beta1 + 1.96 * se1
 20. 
. gen beta2 = _b[x2]
 21. gen se2 = _se[x2]
 22. gen lower2 = beta2 - 1.96 * se2
 23. gen upper2 = beta2 + 1.96 * se2
 24. 
. gen beta3 = _b[x3]
 25. gen se3 = _se[x3]
 26. gen lower3 = beta3 - 1.96 * se3
 27. gen upper3 = beta3 + 1.96 * se3
 28.     
. 
. end

. 

The program "minoaux1" performs multiple imputation with 10 iterations and 10 burnins with zero auxiliary variables.

In [9]:
%%stata

capture program drop minoaux1

program minoaux1, rclass
    * drop all variables to create an empty dataset
    drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)


* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar     
replace x2 = . if rmar == 1   

* Model estimation

mi set wide

mi register imputed y x1 x2 x3

tab _mi_miss


mi impute chained ///
///
(regress) y x1 x2 x3 ///
, rseed(12345) dots force add(10) burnin(10) 


mi estimate, post dots: regress y x1 x2 x3

predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1g = sqrt(r(sum) / 997)    

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
    

end
. 
. capture program drop minoaux1

. 
. program minoaux1, rclass
  1.     * drop all variables to create an empty dataset
.     drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)
  8. gen rmar = runiform() < prob_mar     
  9. replace x2 = . if rmar == 1   
 10. 
. * Model estimation
. 
. mi set wide
 11. 
. mi register imputed y x1 x2 x3
 12. 
. tab _mi_miss
 13. 
. 
. mi impute chained ///
> ///
> (regress) y x1 x2 x3 ///
> , rseed(12345) dots force add(10) burnin(10) 
 14. 
. 
. mi estimate, post dots: regress y x1 x2 x3
 15. 
. predict yhat       
 16. gen resid = y - yhat
 17. gen sqr_resid = resid^2
 18. summarize sqr_resid
 19. gen RMSE1g = sqrt(r(sum) / 997)    
 20. 
. gen beta1 = _b[x1]
 21. gen se1 = _se[x1]
 22. gen lower1 = beta1 - 1.96 * se1
 23. gen upper1 = beta1 + 1.96 * se1
 24. 
. gen beta2 = _b[x2]
 25. gen se2 = _se[x2]
 26. gen lower2 = beta2 - 1.96 * se2
 27. gen upper2 = beta2 + 1.96 * se2
 28. 
. gen beta3 = _b[x3]
 29. gen se3 = _se[x3]
 30. gen lower3 = beta3 - 1.96 * se3
 31. gen upper3 = beta3 + 1.96 * se3
 32.     
. 
. end

. 

The program "miaux1" performs multiple imputation with 10 iterations and 10 burnins with eight auxiliary variables. Five auxiliary variables are generated that are predictive of the probability of the missingness and the missing values, three are generated that are predictive of only the missing values themselves.

In [10]:
%%stata

capture program drop miaux1


program miaux1, rclass
    * drop all variables to create an empty dataset
    drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar     
replace x2 = . if rmar == 1   
	
	
* Generate Auxiliary Variables Predictive of Probability of Missingness and Missing Values 	
gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10)   // Correlated with x2 and y
gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15)   // Balanced correlation with x2 and y
gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12)   // Adds x1 for slight variation
gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20)   // Heavier weight on y (missingness predictor)
gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18)  // Introduces correlation with x3

* Generate Auxiliary Variables Predictive of ONLY Missing Values 

gen aux6 = 0.9*x2 + rnormal(0, 10)   // Strong correlation with x2, no y dependence
gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12)   // Includes x3, but no direct link to missingness
gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15)  // Includes x1, but missingness-independent
	
* setting MI data up

mi set wide

mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 


mi impute chained ///
///
(regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 ///
, rseed(12345) dots force add(10) burnin(10) 


* mi model

mi estimate, post dots: regress y x1 x2 x3
    
predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1h = sqrt(r(sum) / 997)    

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3
    

end
. 
. capture program drop miaux1

. 
. 
. program miaux1, rclass
  1.     * drop all variables to create an empty dataset
.     drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)
  8. gen rmar = runiform() < prob_mar     
  9. replace x2 = . if rmar == 1   
 10.         
.         
. * Generate Auxiliary Variables Predictive of Probability of Missingness and M
> issing Values      
. gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10)   // Correlated with x2 and y
 11. gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15)   // Balanced correlation with 
> x2 and y
 12. gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12)   // Adds x1 for sligh
> t variation
 13. gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20)   // Heavier weight on y (missi
> ngness predictor)
 14. gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18)  // Introduces correla
> tion with x3
 15. 
. * Generate Auxiliary Variables Predictive of ONLY Missing Values 
. 
. gen aux6 = 0.9*x2 + rnormal(0, 10)   // Strong correlation with x2, no y depe
> ndence
 16. gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12)   // Includes x3, but no direc
> t link to missingness
 17. gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15)  // Includes x1, but missingn
> ess-independent
 18.         
. * setting MI data up
. 
. mi set wide
 19. 
. mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 
 20. 
. 
. mi impute chained ///
> ///
> (regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 ///
> , rseed(12345) dots force add(10) burnin(10) 
 21. 
. 
. * mi model
. 
. mi estimate, post dots: regress y x1 x2 x3
 22.     
. predict yhat       
 23. gen resid = y - yhat
 24. gen sqr_resid = resid^2
 25. summarize sqr_resid
 26. gen RMSE1h = sqrt(r(sum) / 997)    
 27. 
. gen beta1 = _b[x1]
 28. gen se1 = _se[x1]
 29. gen lower1 = beta1 - 1.96 * se1
 30. gen upper1 = beta1 + 1.96 * se1
 31. 
. gen beta2 = _b[x2]
 32. gen se2 = _se[x2]
 33. gen lower2 = beta2 - 1.96 * se2
 34. gen upper2 = beta2 + 1.96 * se2
 35. 
. gen beta3 = _b[x3]
 36. gen se3 = _se[x3]
 37. gen lower3 = beta3 - 1.96 * se3
 38. gen upper3 = beta3 + 1.96 * se3
 39.     
. 
. end

. 

The final program "miaux100" performs multiple imputation with 100 iterations and 20 burnin using the prior generated eight auxiliary variables.

In [11]:
%%stata

capture program drop miaux100


program miaux100, rclass
    * drop all variables to create an empty dataset
    drop _all
    
* Set number of observations
set obs 1000
    
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = runiform() < prob_mar     
replace x2 = . if rmar == 1   
	
	
* Generate Auxiliary Variables Predictive of Probability of Missingness and Missing Values 	
gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10)   // Correlated with x2 and y
gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15)   // Balanced correlation with x2 and y
gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12)   // Adds x1 for slight variation
gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20)   // Heavier weight on y (missingness predictor)
gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18)  // Introduces correlation with x3

* Generate Auxiliary Variables Predictive of ONLY Missing Values 

gen aux6 = 0.9*x2 + rnormal(0, 10)   // Strong correlation with x2, no y dependence
gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12)   // Includes x3, but no direct link to missingness
gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15)  // Includes x1, but missingness-independent
	
* setting MI data up

mi set wide

mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 


mi impute chained ///
///
(regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 ///
, rseed(12345) dots force add(100) burnin(20) 


* mi model

mi estimate, post dots: regress y x1 x2 x3
    
predict yhat       
gen resid = y - yhat
gen sqr_resid = resid^2
summarize sqr_resid
gen RMSE1i = sqrt(r(sum) / 997)    

gen beta1 = _b[x1]
gen se1 = _se[x1]
gen lower1 = beta1 - 1.96 * se1
gen upper1 = beta1 + 1.96 * se1

gen beta2 = _b[x2]
gen se2 = _se[x2]
gen lower2 = beta2 - 1.96 * se2
gen upper2 = beta2 + 1.96 * se2

gen beta3 = _b[x3]
gen se3 = _se[x3]
gen lower3 = beta3 - 1.96 * se3
gen upper3 = beta3 + 1.96 * se3   

end
. 
. capture program drop miaux100

. 
. 
. program miaux100, rclass
  1.     * drop all variables to create an empty dataset
.     drop _all
  2.     
. * Set number of observations
. set obs 1000
  3.     
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)
  4. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)
  5. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)
  6. 
. * Generate metric dependent variable
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
  7. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)
  8. gen rmar = runiform() < prob_mar     
  9. replace x2 = . if rmar == 1   
 10.         
.         
. * Generate Auxiliary Variables Predictive of Probability of Missingness and M
> issing Values      
. gen aux1 = 0.7*x2 + 0.3*y + rnormal(0, 10)   // Correlated with x2 and y
 11. gen aux2 = 0.5*x2 + 0.5*y + rnormal(0, 15)   // Balanced correlation with 
> x2 and y
 12. gen aux3 = 0.6*x2 + 0.2*x1 + 0.2*y + rnormal(0, 12)   // Adds x1 for sligh
> t variation
 13. gen aux4 = 0.4*x2 + 0.6*y + rnormal(0, 20)   // Heavier weight on y (missi
> ngness predictor)
 14. gen aux5 = 0.3*x2 + 0.3*y + 0.4*x3 + rnormal(0, 18)  // Introduces correla
> tion with x3
 15. 
. * Generate Auxiliary Variables Predictive of ONLY Missing Values 
. 
. gen aux6 = 0.9*x2 + rnormal(0, 10)   // Strong correlation with x2, no y depe
> ndence
 16. gen aux7 = 0.8*x2 + 0.2*x3 + rnormal(0, 12)   // Includes x3, but no direc
> t link to missingness
 17. gen aux8 = 0.85*x2 + 0.1*x1 + rnormal(0, 15)  // Includes x1, but missingn
> ess-independent
 18.         
. * setting MI data up
. 
. mi set wide
 19. 
. mi register imputed y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 
 20. 
. 
. mi impute chained ///
> ///
> (regress) y x1 x2 x3 aux1 aux2 aux3 aux4 aux5 aux6 aux7 aux8 ///
> , rseed(12345) dots force add(100) burnin(20) 
 21. 
. 
. * mi model
. 
. mi estimate, post dots: regress y x1 x2 x3
 22.     
. predict yhat       
 23. gen resid = y - yhat
 24. gen sqr_resid = resid^2
 25. summarize sqr_resid
 26. gen RMSE1i = sqrt(r(sum) / 997)    
 27. 
. gen beta1 = _b[x1]
 28. gen se1 = _se[x1]
 29. gen lower1 = beta1 - 1.96 * se1
 30. gen upper1 = beta1 + 1.96 * se1
 31. 
. gen beta2 = _b[x2]
 32. gen se2 = _se[x2]
 33. gen lower2 = beta2 - 1.96 * se2
 34. gen upper2 = beta2 + 1.96 * se2
 35. 
. gen beta3 = _b[x3]
 36. gen se3 = _se[x3]
 37. gen lower3 = beta3 - 1.96 * se3
 38. gen upper3 = beta3 + 1.96 * se3   
 39. 
. end

. 

Monte Carlo Simulations¶

This next section of the Notebook now runs each program simulation. Each simulation is run 1000 times with a set seed of "1234" for each program simulation. The 95 per cent confidence intervals of the means for each beta and standard error within the models is extracted and the dataset is then saved for each simulation.

In [12]:
%%stata 

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1a, reps(1000) seed(1234): regress1 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1a 
rename _sim_2 se1a 
rename _sim_3 beta2a 
rename _sim_4 se2a 
rename _sim_5 beta3a 
rename _sim_6 se3a 
rename _sim_7 ll1a 
rename _sim_8 ll2a 
rename _sim_9 ll3a 
rename _sim_10 ul1a 
rename _sim_11 ul2a 
rename _sim_12 ul3a
rename _sim_13 RMSE1a

gen absbias1 = 100 * (beta1a - 30) / 30
gen absbias2 = 100 * (beta2a - 40) / 40
gen absbias3 = 100 * (beta3a - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1a - 30) / se1a
gen bias2 = 100 * (beta2a - 40) / se2a
gen bias3 = 100 * (beta3a - 50) / se3a

summarize bias1 bias2 bias3
                   
summarize RMSE1a                   
                   
gen covered1 = (30 >= ll1a) & (30 <= ul1a)
gen covered2 = (40 >= ll2a) & (40 <= ul2a)
gen covered3 = (50 >= ll3a) & (50 <= ul3a)
                   
summarize covered1 covered2 covered3
                   
gen length1 = ul1a-ll1a     
gen length2 = ul2a-ll2a                   
gen length3 = ul3a-ll3a                   

summarize length1 length2 length3   
                     

save godmodel, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1a, reps(1000) seed(1234): regress1 

      Command: regress1
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1a

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    30.00808    .1223364        29.76802    30.24815
      _sim_2 |      1,000    3.958464    .0039263        3.950759    3.966169
      _sim_3 |      1,000    40.01876    .0299956         39.9599    40.07762
      _sim_4 |      1,000    .9516209    .0009688        .9497198    .9535221
      _sim_5 |      1,000    49.87959    .3001486        49.29059    50.46858
-------------+---------------------------------------------------------------
      _sim_6 |      1,000     9.51589    .0095674        9.497115    9.534664
      _sim_7 |      1,000    22.24949    .1221543        22.00978     22.4892
      _sim_8 |      1,000    38.15358    .0300231        38.09467     38.2125
      _sim_9 |      1,000    31.22844    .2998234        30.64009     31.8168
     _sim_10 |      1,000    37.76667    .1230006         37.5253    38.00804
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    41.88393    .0300882        41.82489    41.94298
     _sim_12 |      1,000    68.53073    .3016415        67.93881    69.12265
     _sim_13 |      1,000        1500    1.075835        1497.889    1502.111

. 
. rename _sim_1 beta1a 

. rename _sim_2 se1a 

. rename _sim_3 beta2a 

. rename _sim_4 se2a 

. rename _sim_5 beta3a 

. rename _sim_6 se3a 

. rename _sim_7 ll1a 

. rename _sim_8 ll2a 

. rename _sim_9 ll3a 

. rename _sim_10 ul1a 

. rename _sim_11 ul2a 

. rename _sim_12 ul3a

. rename _sim_13 RMSE1a

. 
. gen absbias1 = 100 * (beta1a - 30) / 30

. gen absbias2 = 100 * (beta2a - 40) / 40

. gen absbias3 = 100 * (beta3a - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000    .0269405    12.89539  -39.21011   41.60975
    absbias2 |      1,000    .0468945    2.371359   -6.72061   7.922773
    absbias3 |      1,000   -.2408279    18.98306  -57.06816   58.99072

. 
. gen bias1 = 100 * (beta1a - 30) / se1a

. gen bias2 = 100 * (beta2a - 40) / se2a

. gen bias3 = 100 * (beta3a - 50) / se3a

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000    .0288394    97.83546  -295.6941   304.1698
       bias2 |      1,000    1.916143    99.79077  -288.9381   337.0835
       bias3 |      1,000   -1.418348    99.93945  -294.8282   311.0158

.                    
. summarize RMSE1a                   

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1a |      1,000        1500    34.02088   1384.698    1605.48

.                    
. gen covered1 = (30 >= ll1a) & (30 <= ul1a)

. gen covered2 = (40 >= ll2a) & (40 <= ul2a)

. gen covered3 = (50 >= ll3a) & (50 <= ul3a)

.                    
. summarize covered1 covered2 covered3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000        .955    .2074079          0          1
    covered2 |      1,000        .943    .2319586          0          1
    covered3 |      1,000        .947    .2241456          0          1

.                    
. gen length1 = ul1a-ll1a     

. gen length2 = ul2a-ll2a                   

. gen length3 = ul3a-ll3a                   

. 
. summarize length1 length2 length3   

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    15.51718    .4867134   13.97988   17.13386
     length2 |      1,000    3.730354    .1200956   3.358505   4.088402
     length3 |      1,000    37.30229    1.185991   33.99916   41.38372

.                      
. 
. save godmodel, replace
file godmodel.dta saved

. 
In [13]:
%%stata 

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1b, reps(1000) seed(1234): sem1 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1b
rename _sim_2 se1b
rename _sim_3 beta2b 
rename _sim_4 se2b
rename _sim_5 beta3b 
rename _sim_6 se3b
rename _sim_7 ll1b 
rename _sim_8 ll2b 
rename _sim_9 ll3b 
rename _sim_10 ul1b 
rename _sim_11 ul2b 
rename _sim_12 ul3b 
rename _sim_13 RMSE1b

gen absbias1 = 100 * (beta1b - 30) / 30
gen absbias2 = 100 * (beta2b - 40) / 40
gen absbias3 = 100 * (beta3b - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1b - 30) / se1b
gen bias2 = 100 * (beta2b - 40) / se2b
gen bias3 = 100 * (beta3b - 50) / se3b

summarize bias1 bias2 bias3
                   
summarize RMSE1b                   
                   
gen covered1 = (30 >= ll1b) & (30 <= ul1b)
gen covered2 = (40 >= ll2b) & (40 <= ul2b)
gen covered3 = (50 >= ll3b) & (50 <= ul3b)
                   
summarize covered1 covered2 covered3
                   
gen length1 = ul1b-ll1b     
gen length2 = ul2b-ll2b                   
gen length3 = ul3b-ll3b                   

summarize length1 length2 length3   

save semgodmodel, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1b, reps(1000) seed(1234): sem1 

      Command: sem1
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1b

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    30.00808    .1223364        29.76802    30.24815
      _sim_2 |      1,000    3.950539    .0039185         3.94285    3.958229
      _sim_3 |      1,000    40.01876    .0299956         39.9599    40.07762
      _sim_4 |      1,000    .9497158    .0009669        .9478184    .9516131
      _sim_5 |      1,000    49.87959    .3001486        49.29059    50.46858
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    9.496839    .0095483        9.478102    9.515576
      _sim_7 |      1,000    22.26502    .1221542        22.02532    22.50473
      _sim_8 |      1,000    38.15731    .0300229         38.0984    38.21623
      _sim_9 |      1,000    31.26578    .2998229        30.67743    31.85414
     _sim_10 |      1,000    37.75114    .1229988        37.50977     37.9925
-------------+---------------------------------------------------------------
     _sim_11 |      1,000     41.8802    .0300878        41.82116    41.93924
     _sim_12 |      1,000    68.49339    .3016373        67.90147    69.08531
     _sim_13 |      1,000        1500    1.075835        1497.889    1502.111

. 
. rename _sim_1 beta1b

. rename _sim_2 se1b

. rename _sim_3 beta2b 

. rename _sim_4 se2b

. rename _sim_5 beta3b 

. rename _sim_6 se3b

. rename _sim_7 ll1b 

. rename _sim_8 ll2b 

. rename _sim_9 ll3b 

. rename _sim_10 ul1b 

. rename _sim_11 ul2b 

. rename _sim_12 ul3b 

. rename _sim_13 RMSE1b

. 
. gen absbias1 = 100 * (beta1b - 30) / 30

. gen absbias2 = 100 * (beta2b - 40) / 40

. gen absbias3 = 100 * (beta3b - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000    .0269405    12.89539  -39.21011   41.60975
    absbias2 |      1,000    .0468945    2.371359   -6.72061   7.922773
    absbias3 |      1,000   -.2408279    18.98306  -57.06816   58.99072

. 
. gen bias1 = 100 * (beta1b - 30) / se1b

. gen bias2 = 100 * (beta2b - 40) / se2b

. gen bias3 = 100 * (beta3b - 50) / se3b

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000    .0288972    98.03172  -296.2873     304.78
       bias2 |      1,000    1.919987    99.99095  -289.5177   337.7597
       bias3 |      1,000   -1.421193    100.1399  -295.4196   311.6397

.                    
. summarize RMSE1b                   

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1b |      1,000        1500    34.02088   1384.698    1605.48

.                    
. gen covered1 = (30 >= ll1b) & (30 <= ul1b)

. gen covered2 = (40 >= ll2b) & (40 <= ul2b)

. gen covered3 = (50 >= ll3b) & (50 <= ul3b)

.                    
. summarize covered1 covered2 covered3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000        .955    .2074079          0          1
    covered2 |      1,000        .942    .2338604          0          1
    covered3 |      1,000        .947    .2241456          0          1

.                    
. gen length1 = ul1b-ll1b     

. gen length2 = ul2b-ll2b                   

. gen length3 = ul3b-ll3b                   

. 
. summarize length1 length2 length3   

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    15.48611     .485739   13.95189   17.09955
     length2 |      1,000    3.722886    .1198551   3.351784   4.080215
     length3 |      1,000    37.22761    1.183617   33.93109   41.30087

. 
. save semgodmodel, replace
file semgodmodel.dta saved

. 
In [14]:
%%stata

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1c, reps(1000) seed(1234): missingmcar 
    
ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1c
rename _sim_2 se1c
rename _sim_3 beta2c 
rename _sim_4 se2c
rename _sim_5 beta3c 
rename _sim_6 se3c
rename _sim_7 ll1c 
rename _sim_8 ll2c 
rename _sim_9 ll3c 
rename _sim_10 ul1c 
rename _sim_11 ul2c 
rename _sim_12 ul3c 
rename _sim_13 RMSE1c

gen absbias1 = 100 * (beta1c - 30) / 30
gen absbias2 = 100 * (beta2c - 40) / 40
gen absbias3 = 100 * (beta3c - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1c - 30) / se1c
gen bias2 = 100 * (beta2c - 40) / se2c
gen bias3 = 100 * (beta3c - 50) / se3c

summarize bias1 bias2 bias3
                   
summarize RMSE1c                                      
                   
gen covered1 = (30 >= ll1c) & (30 <= ul1c)
gen covered2 = (40 >= ll2c) & (40 <= ul2c)
gen covered3 = (50 >= ll3c) & (50 <= ul3c)
                   
summarize covered1 covered2 covered3
                   
gen length1 = ul1c-ll1c     
gen length2 = ul2c-ll2c                   
gen length3 = ul3c-ll3c                   

summarize length1 length2 length3                         

save missingmcar, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1c, reps(1000) seed(1234): missingmcar 

      Command: missingmcar
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1c

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

.     
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    29.96598    .1732293        29.62604    30.30591
      _sim_2 |      1,000    5.622786    .0085991        5.605912     5.63966
      _sim_3 |      1,000    40.02882    .0423628        39.94568    40.11195
      _sim_4 |      1,000    1.348911     .002079        1.344832    1.352991
      _sim_5 |      1,000    51.29684     .436001        50.44125    52.15242
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    13.47856    .0213482        13.43667    13.52046
      _sim_7 |      1,000    18.94532    .1742216        18.60344     19.2872
      _sim_8 |      1,000    37.38495     .042557        37.30144    37.46846
      _sim_9 |      1,000    24.87885    .4352299        24.02478    25.73292
     _sim_10 |      1,000    40.98664    .1738728        40.64544    41.32784
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    42.67268    .0425597        42.58916     42.7562
     _sim_12 |      1,000    77.71482     .440761         76.8499    78.57974
     _sim_13 |      1,000    1519.553    1.684498        1516.248    1522.859

. 
. rename _sim_1 beta1c

. rename _sim_2 se1c

. rename _sim_3 beta2c 

. rename _sim_4 se2c

. rename _sim_5 beta3c 

. rename _sim_6 se3c

. rename _sim_7 ll1c 

. rename _sim_8 ll2c 

. rename _sim_9 ll3c 

. rename _sim_10 ul1c 

. rename _sim_11 ul2c 

. rename _sim_12 ul3c 

. rename _sim_13 RMSE1c

. 
. gen absbias1 = 100 * (beta1c - 30) / 30

. gen absbias2 = 100 * (beta2c - 40) / 40

. gen absbias3 = 100 * (beta3c - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000   -.1134023    18.25997  -56.99002   62.33422
    absbias2 |      1,000    .0720377    3.349077  -10.54511   10.21497
    absbias3 |      1,000    2.593673    27.57512  -69.91905   93.54358

. 
. gen bias1 = 100 * (beta1c - 30) / se1c

. gen bias2 = 100 * (beta2c - 40) / se2c

. gen bias3 = 100 * (beta3c - 50) / se3c

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000   -.5500844    97.42645  -308.4916   323.5037
       bias2 |      1,000    2.142476    99.44684  -305.4591   300.9473
       bias3 |      1,000    9.302094    102.5857  -258.2774   356.9467

.                    
. summarize RMSE1c                                      

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1c |      1,000    1519.553    53.26849   1355.798    1698.11

.                    
. gen covered1 = (30 >= ll1c) & (30 <= ul1c)

. gen covered2 = (40 >= ll2c) & (40 <= ul2c)

. gen covered3 = (50 >= ll3c) & (50 <= ul3c)

.                    
. summarize covered1 covered2 covered3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000        .963    .1888562          0          1
    covered2 |      1,000        .955    .2074079          0          1
    covered3 |      1,000         .94    .2376057          0          1

.                    
. gen length1 = ul1c-ll1c     

. gen length2 = ul2c-ll2c                   

. gen length3 = ul3c-ll3c                   

. 
. summarize length1 length2 length3                         

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    22.04132    1.065957   19.29918   26.03639
     length2 |      1,000    5.287733    .2577094   4.545906   6.251045
     length3 |      1,000    52.83597    2.646351   44.42754   61.73584

. 
. save missingmcar, replace
file missingmcar.dta saved

. 
In [15]:
%%stata

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1d, reps(1000) seed(1234): missingmar 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1d
rename _sim_2 se1d
rename _sim_3 beta2d 
rename _sim_4 se2d
rename _sim_5 beta3d 
rename _sim_6 se3d
rename _sim_7 ll1d 
rename _sim_8 ll2d 
rename _sim_9 ll3d 
rename _sim_10 ul1d 
rename _sim_11 ul2d 
rename _sim_12 ul3d 
rename _sim_13 RMSE1d

gen absbias1 = 100 * (beta1d - 30) / 30
gen absbias2 = 100 * (beta2d - 40) / 40
gen absbias3 = 100 * (beta3d - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1d - 30) / se1d
gen bias2 = 100 * (beta2d - 40) / se2d
gen bias3 = 100 * (beta3d - 50) / se3d

summarize bias1 bias2 bias3
                   
summarize RMSE1d                                      
                   
gen covered1 = (30 >= ll1d) & (30 <= ul1d)
gen covered2 = (40 >= ll2d) & (40 <= ul2d)
gen covered3 = (50 >= ll3d) & (50 <= ul3d)
                   
summarize covered1 covered2 covered3    
                   
gen length1 = ul1d-ll1d       
gen length2 = ul2d-ll2d                   
gen length3 = ul3d-ll3d                   

summarize length1 length2 length3                   

save missingmar, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1d, reps(1000) seed(1234): missingmar 

      Command: missingmar
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1d

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    18.88248    .1384802        18.61074    19.15423
      _sim_2 |      1,000    4.423952    .0068386        4.410532    4.437371
      _sim_3 |      1,000    25.14197    .0458819        25.05193      25.232
      _sim_4 |      1,000    1.354173    .0020679        1.350115    1.358231
      _sim_5 |      1,000    31.69827    .3393094        31.03243    32.36411
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    10.55861    .0169251         10.5254    10.59183
      _sim_7 |      1,000    10.21154    .1385361        9.939683    10.48339
      _sim_8 |      1,000    22.48779    .0457186        22.39807     22.5775
      _sim_9 |      1,000    11.00338    .3394656        10.33724    11.66953
     _sim_10 |      1,000    27.55343    .1397161        27.27926     27.8276
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    27.79615    .0464001        27.70509     27.8872
     _sim_12 |      1,000    52.39315    .3423825        51.72128    53.06502
     _sim_13 |      1,000    1183.398    1.381987        1180.686     1186.11

. 
. rename _sim_1 beta1d

. rename _sim_2 se1d

. rename _sim_3 beta2d 

. rename _sim_4 se2d

. rename _sim_5 beta3d 

. rename _sim_6 se3d

. rename _sim_7 ll1d 

. rename _sim_8 ll2d 

. rename _sim_9 ll3d 

. rename _sim_10 ul1d 

. rename _sim_11 ul2d 

. rename _sim_12 ul3d 

. rename _sim_13 RMSE1d

. 
. gen absbias1 = 100 * (beta1d - 30) / 30

. gen absbias2 = 100 * (beta2d - 40) / 40

. gen absbias3 = 100 * (beta3d - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000   -37.05839    14.59709  -88.03668   16.64221
    absbias2 |      1,000   -37.14508    3.627286  -49.51259  -26.15721
    absbias3 |      1,000   -36.60347    21.45981  -98.38974    36.0162

. 
. gen bias1 = 100 * (beta1d - 30) / se1d

. gen bias2 = 100 * (beta2d - 40) / se2d

. gen bias3 = 100 * (beta3d - 50) / se3d

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000   -252.1138    100.3254   -582.379   107.1902
       bias2 |      1,000    -1100.19    123.7338  -1497.348  -770.2115
       bias3 |      1,000   -174.0128    102.1183  -490.4887   159.8803

.                    
. summarize RMSE1d                                      

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1d |      1,000    1183.398    43.70226   1058.915   1326.678

.                    
. gen covered1 = (30 >= ll1d) & (30 <= ul1d)

. gen covered2 = (40 >= ll2d) & (40 <= ul2d)

. gen covered3 = (50 >= ll3d) & (50 <= ul3d)

.                    
. summarize covered1 covered2 covered3    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000        .289    .4535247          0          1
    covered2 |      1,000           0           0          0          0
    covered3 |      1,000        .578    .4941257          0          1

.                    
. gen length1 = ul1d-ll1d       

. gen length2 = ul2d-ll2d                   

. gen length3 = ul3d-ll3d                   

. 
. summarize length1 length2 length3                   

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    17.34189    .8477156   14.75978   20.56871
     length2 |      1,000     5.30836    .2563435   4.594616   6.459545
     length3 |      1,000    41.38977    2.098058   35.78901    48.7177

. 
. save missingmar, replace
file missingmar.dta saved

. 

Beyond the average mean statistics of betas and standard errors that are reported, five measures are produced to evaluate the performance of measures going forward. The MAR model is used as a baseline to compare all other model statistics to. The first is the Absolute Percentage Bias. This tells us the magnitude of bias - how far off the estimate is from the true value. It measures the absolute bias as a percentage of the true value. Following from this, the second measure is the Absolute Bias Percentage of the Standard Error. This measures the bias relative to the uncertainty (standard error) of the estimator. The benefit of this is to measure the magnitude of bias compared to precision (Is the bias large compared to how variable the estimate is?). Whilst the former measure is a very simple practical measure that measures difference, the latter allows us to investigate the significance change relative to the precision.

The third measure is the Root Mean Square Error (RMSE). This measure combines bias and efficiency. It quantifies the difference between actual, and predicted values. A smaller RMSE indicates better performance of a model. It is an unstandardised equivelant to the R squared measure.

Coverage is the fourth measure used. The actual coverage of nominal 95% confidence intervals is reported. This is done to assess the accurate Type 1 error rate for a testing procedure with a 0.05 level criterion. Coverage that drops below 90% is considered troublesome (Collins, Schafer, Kam 2001).

Finally, average length of the confidence interval is provided. This is a simple measure that presents how wide the confidence intervals are by taking the upperbound away from the lowerbound intervals. If the interval length is shorter than the base model, it indicates greater accuracy and higher statistical power.

In [16]:
%%stata

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1e, reps(1000) seed(1234): missingmean 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1e
rename _sim_2 se1e
rename _sim_3 beta2e 
rename _sim_4 se2e
rename _sim_5 beta3e 
rename _sim_6 se3e
rename _sim_7 ll1e 
rename _sim_8 ll2e 
rename _sim_9 ll3e 
rename _sim_10 ul1e 
rename _sim_11 ul2e 
rename _sim_12 ul3e 
rename _sim_13 RMSE1e


gen absbias1 = 100 * (beta1e - 30) / 30
gen absbias2 = 100 * (beta2e - 40) / 40
gen absbias3 = 100 * (beta3e - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1e - 30) / se1e
gen bias2 = 100 * (beta2e - 40) / se2e
gen bias3 = 100 * (beta3e - 50) / se3e

summarize bias1 bias2 bias3
                   
summarize RMSE1e                                    
                   
gen covered1 = (30 >= ll1e) & (30 <= ul1e)
gen covered2 = (40 >= ll2e) & (40 <= ul2e)
gen covered3 = (50 >= ll3e) & (50 <= ul3e)
                   
summarize covered1 covered2 covered3    
                   
gen length1 = ul1e-ll1e       
gen length2 = ul2e-ll2e                   
gen length3 = ul3e-ll3e                   

summarize length1 length2 length3 
                   
save missingmean, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1e, reps(1000) seed(1234): missingmean 

      Command: missingmean
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1e

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000     33.9267    .1942382        33.54554    34.30786
      _sim_2 |      1,000    6.335807    .0065203        6.323012    6.348602
      _sim_3 |      1,000    25.75512    .0470237        25.66285     25.8474
      _sim_4 |      1,000    2.718789    .0034573        2.712005    2.725574
      _sim_5 |      1,000    56.64393    .4947365        55.67309    57.61478
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    15.17203    .0153294        15.14195    15.20212
      _sim_7 |      1,000    21.50852    .1947725        21.12631    21.89073
      _sim_8 |      1,000     20.4263    .0469501        20.33416    20.51843
      _sim_9 |      1,000    26.90675    .4951852        25.93502    27.87847
     _sim_10 |      1,000    46.34488    .1945437        45.96312    46.72664
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    31.08395    .0480623        30.98963    31.17826
     _sim_12 |      1,000    86.38112    .4961103        85.40759    87.35466
     _sim_13 |      1,000    2392.798    1.660691        2389.539    2396.057

. 
. rename _sim_1 beta1e

. rename _sim_2 se1e

. rename _sim_3 beta2e 

. rename _sim_4 se2e

. rename _sim_5 beta3e 

. rename _sim_6 se3e

. rename _sim_7 ll1e 

. rename _sim_8 ll2e 

. rename _sim_9 ll3e 

. rename _sim_10 ul1e 

. rename _sim_11 ul2e 

. rename _sim_12 ul3e 

. rename _sim_13 RMSE1e

. 
. 
. gen absbias1 = 100 * (beta1e - 30) / 30

. gen absbias2 = 100 * (beta2e - 40) / 40

. gen absbias3 = 100 * (beta3e - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000      13.089     20.4745  -55.57906   76.68382
    absbias2 |      1,000   -35.61219    3.717553  -49.32283  -23.42478
    absbias3 |      1,000    13.28787    31.28988  -102.6488   116.0377

. 
. gen bias1 = 100 * (beta1e - 30) / se1e

. gen bias2 = 100 * (beta2e - 40) / se2e

. gen bias3 = 100 * (beta3e - 50) / se3e

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000    62.08034    97.09138  -261.3323   384.6086
       bias2 |      1,000   -524.9757    60.37031  -722.8023  -342.2611
       bias3 |      1,000    43.78726     103.171  -351.4386   392.1291

.                    
. summarize RMSE1e                                    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1e |      1,000    2392.798    52.51565   2232.317   2586.839

.                    
. gen covered1 = (30 >= ll1e) & (30 <= ul1e)

. gen covered2 = (40 >= ll2e) & (40 <= ul2e)

. gen covered3 = (50 >= ll3e) & (50 <= ul3e)

.                    
. summarize covered1 covered2 covered3    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000        .917    .2760203          0          1
    covered2 |      1,000           0           0          0          0
    covered3 |      1,000        .927    .2602667          0          1

.                    
. gen length1 = ul1e-ll1e       

. gen length2 = ul2e-ll2e                   

. gen length3 = ul3e-ll3e                   

. 
. summarize length1 length2 length3 

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    24.83636     .808267   22.74042   28.44074
     length2 |      1,000    10.65765    .4285732   9.188656   12.08593
     length3 |      1,000    59.47438    1.900247   53.41096   66.90497

.                    
. save missingmean, replace
file missingmean.dta saved

. 
In [17]:
%%stata

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1f, reps(1000) seed(1234): semmlmv 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1f
rename _sim_2 se1f
rename _sim_3 beta2f 
rename _sim_4 se2f
rename _sim_5 beta3f 
rename _sim_6 se3f
rename _sim_7 ll1f 
rename _sim_8 ll2f 
rename _sim_9 ll3f 
rename _sim_10 ul1f 
rename _sim_11 ul2f 
rename _sim_12 ul3f 
rename _sim_13 RMSE1f


gen absbias1 = 100 * (beta1f - 30) / 30
gen absbias2 = 100 * (beta2f - 40) / 40
gen absbias3 = 100 * (beta3f - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1f - 30) / se1f
gen bias2 = 100 * (beta2f - 40) / se2f
gen bias3 = 100 * (beta3f - 50) / se3f

summarize bias1 bias2 bias3
                   
summarize RMSE1f                                  
                   
gen covered1 = (30 >= ll1f) & (30 <= ul1f)
gen covered2 = (40 >= ll2f) & (40 <= ul2f)
gen covered3 = (50 >= ll3f) & (50 <= ul3f)
                   
summarize covered1 covered2 covered3    
                   
gen length1 = ul1f-ll1f       
gen length2 = ul2f-ll2f                   
gen length3 = ul3f-ll3f                   

summarize length1 length2 length3                    
                   

save semmlmv, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1f, reps(1000) seed(1234): semmlmv 

      Command: semmlmv
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1f

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    30.00406     .156705        29.69656    30.31157
      _sim_2 |      1,000    5.068415    .0073885        5.053917    5.082914
      _sim_3 |      1,000    40.02852    .0404882        39.94907    40.10797
      _sim_4 |      1,000    1.260888    .0017933        1.257369    1.264407
      _sim_5 |      1,000    50.24373    .3887388        49.48089    51.00657
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    12.07971    .0181121        12.04417    12.11525
      _sim_7 |      1,000    20.06997    .1553599         19.7651    20.37484
      _sim_8 |      1,000    37.55718    .0382948        37.48203    37.63233
      _sim_9 |      1,000     26.5675    .3852995        25.81141    27.32359
     _sim_10 |      1,000    39.93816    .1593601        39.62544    40.25088
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    42.49986    .0428579        42.41576    42.58397
     _sim_12 |      1,000    73.91996    .3953485        73.14415    74.69576
     _sim_13 |      1,000    1242.749    1.964788        1238.894    1246.605

. 
. rename _sim_1 beta1f

. rename _sim_2 se1f

. rename _sim_3 beta2f 

. rename _sim_4 se2f

. rename _sim_5 beta3f 

. rename _sim_6 se3f

. rename _sim_7 ll1f 

. rename _sim_8 ll2f 

. rename _sim_9 ll3f 

. rename _sim_10 ul1f 

. rename _sim_11 ul2f 

. rename _sim_12 ul3f 

. rename _sim_13 RMSE1f

. 
. 
. gen absbias1 = 100 * (beta1f - 30) / 30

. gen absbias2 = 100 * (beta2f - 40) / 40

. gen absbias3 = 100 * (beta3f - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000    .0135462    16.51816  -47.75852   50.64448
    absbias2 |      1,000    .0713057    3.200872  -10.34884   9.321384
    absbias3 |      1,000    .4874593      24.586  -82.42333   85.41435

. 
. gen bias1 = 100 * (beta1f - 30) / se1f

. gen bias2 = 100 * (beta2f - 40) / se2f

. gen bias3 = 100 * (beta3f - 50) / se3f

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000   -.5380218    97.58873  -290.6914   303.4151
       bias2 |      1,000   -.7176602    102.1402   -352.801    272.922
       bias3 |      1,000    1.334883    101.8232  -312.8748   350.3375

.                    
. summarize RMSE1f                                  

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1f |      1,000    1242.749    62.13206    1061.51   1500.909

.                    
. gen covered1 = (30 >= ll1f) & (30 <= ul1f)

. gen covered2 = (40 >= ll2f) & (40 <= ul2f)

. gen covered3 = (50 >= ll3f) & (50 <= ul3f)

.                    
. summarize covered1 covered2 covered3    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000         .95     .218054          0          1
    covered2 |      1,000        .954    .2095899          0          1
    covered3 |      1,000        .948    .2221381          0          1

.                    
. gen length1 = ul1f-ll1f       

. gen length2 = ul2f-ll2f                   

. gen length3 = ul3f-ll3f                   

. 
. summarize length1 length2 length3                    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    19.86819    .9158888   16.79119    23.3758
     length2 |      1,000    4.942682    .2222984   4.252899    6.01104
     length3 |      1,000    47.35245    2.245196   41.31921   55.62303

.                    
. 
. save semmlmv, replace
file semmlmv.dta saved

. 
In [18]:
%%stata

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1g, reps(1000) seed(1234): minoaux1 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1g
rename _sim_2 se1g
rename _sim_3 beta2g 
rename _sim_4 se2g
rename _sim_5 beta3g 
rename _sim_6 se3g
rename _sim_7 ll1g 
rename _sim_8 ll2g 
rename _sim_9 ll3g 
rename _sim_10 ul1g 
rename _sim_11 ul2g 
rename _sim_12 ul3g 
rename _sim_13 RMSE1g


gen absbias1 = 100 * (beta1g - 30) / 30
gen absbias2 = 100 * (beta2g - 40) / 40
gen absbias3 = 100 * (beta3g - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1g - 30) / se1g
gen bias2 = 100 * (beta2g - 40) / se2g
gen bias3 = 100 * (beta3g - 50) / se3g

summarize bias1 bias2 bias3
                   
summarize RMSE1g                                  
                   
gen covered1 = (30 >= ll1g) & (30 <= ul1g)
gen covered2 = (40 >= ll2g) & (40 <= ul2g)
gen covered3 = (50 >= ll3g) & (50 <= ul3g)
                   
summarize covered1 covered2 covered3    
                   
gen length1 = ul1g-ll1g       
gen length2 = ul2g-ll2g                   
gen length3 = ul3g-ll3g                   

summarize length1 length2 length3   
save minoaux1, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1g, reps(1000) seed(1234): minoaux1 

      Command: minoaux1
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1g

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    23.26113      .03977        23.18309    23.33917
      _sim_2 |      1,000    5.114843    .0141056        5.087163    5.142523
      _sim_3 |      1,000    41.92422     .007087        41.91031    41.93813
      _sim_4 |      1,000     1.28995    .0027265        1.284599      1.2953
      _sim_5 |      1,000    44.36747    .1072674        44.15697    44.57797
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    12.04358    .0336374        11.97757    12.10959
      _sim_7 |      1,000    13.23603    .0656352        13.10724    13.36483
      _sim_8 |      1,000    39.39592    .0095715        39.37714     39.4147
      _sim_9 |      1,000    20.76206    .1440514        20.47938    21.04473
     _sim_10 |      1,000    33.28622     .019597        33.24776    33.32468
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    44.45252    .0081211        44.43659    44.46846
     _sim_12 |      1,000    67.97288    .1046668        67.76749    68.17828
     _sim_13 |      1,000    985.9129    .1843666        985.5511    986.2747

. 
. rename _sim_1 beta1g

. rename _sim_2 se1g

. rename _sim_3 beta2g 

. rename _sim_4 se2g

. rename _sim_5 beta3g 

. rename _sim_6 se3g

. rename _sim_7 ll1g 

. rename _sim_8 ll2g 

. rename _sim_9 ll3g 

. rename _sim_10 ul1g 

. rename _sim_11 ul2g 

. rename _sim_12 ul3g 

. rename _sim_13 RMSE1g

. 
. 
. gen absbias1 = 100 * (beta1g - 30) / 30

. gen absbias2 = 100 * (beta2g - 40) / 40

. gen absbias3 = 100 * (beta3g - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000   -22.46291    4.192123  -26.55727   5.163403
    absbias2 |      1,000    4.810551    .5602747   3.292112   5.726557
    absbias3 |      1,000   -11.26506    6.784187  -21.93382   7.914925

. 
. gen bias1 = 100 * (beta1g - 30) / se1g

. gen bias2 = 100 * (beta2g - 40) / se2g

. gen bias3 = 100 * (beta3g - 50) / se3g

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000    -130.775    16.45119  -142.8849   31.58958
       bias2 |      1,000     149.961    21.17644   100.6044   182.8912
       bias3 |      1,000   -46.28289    26.30108  -86.16815   35.08329

.                    
. summarize RMSE1g                                  

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1g |      1,000    985.9129    5.830185   981.2299   1132.013

.                    
. gen covered1 = (30 >= ll1g) & (30 <= ul1g)

. gen covered2 = (40 >= ll2g) & (40 <= ul2g)

. gen covered3 = (50 >= ll3g) & (50 <= ul3g)

.                    
. summarize covered1 covered2 covered3    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000           1           0          1          1
    covered2 |      1,000           1           0          1          1
    covered3 |      1,000           1           0          1          1

.                    
. gen length1 = ul1g-ll1g       

. gen length2 = ul2g-ll2g                   

. gen length3 = ul3g-ll3g                   

. 
. summarize length1 length2 length3   

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    20.05019    1.748552   17.65744   24.34246
     length2 |      1,000    5.056603    .3379832   4.545982   5.853828
     length3 |      1,000    47.21083    4.169732   40.84899   52.41678

. save minoaux1, replace
file minoaux1.dta saved

. 
In [19]:
%%stata

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1h, reps(1000) seed(1234): miaux1 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1h
rename _sim_2 se1h
rename _sim_3 beta2h 
rename _sim_4 se2h
rename _sim_5 beta3h 
rename _sim_6 se3h
rename _sim_7 ll1h 
rename _sim_8 ll2h 
rename _sim_9 ll3h 
rename _sim_10 ul1h 
rename _sim_11 ul2h 
rename _sim_12 ul3h 
rename _sim_13 RMSE1h


gen absbias1 = 100 * (beta1h - 30) / 30
gen absbias2 = 100 * (beta2h - 40) / 40
gen absbias3 = 100 * (beta3h - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1h - 30) / se1h
gen bias2 = 100 * (beta2h - 40) / se2h
gen bias3 = 100 * (beta3h - 50) / se3h

summarize bias1 bias2 bias3
                   
summarize RMSE1h                                  
                   
gen covered1 = (30 >= ll1h) & (30 <= ul1h)
gen covered2 = (40 >= ll2h) & (40 <= ul2h)
gen covered3 = (50 >= ll3h) & (50 <= ul3h)
                   
summarize covered1 covered2 covered3    
                   
gen length1 = ul1h-ll1h       
gen length2 = ul2h-ll2h                   
gen length3 = ul3h-ll3h                   

summarize length1 length2 length3   
save miaux1, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1h, reps(1000) seed(1234): miaux1 

      Command: miaux1
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1h

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    32.75058    .0060984        32.73861    32.76255
      _sim_2 |      1,000    5.413813    .0008607        5.412124    5.415502
      _sim_3 |      1,000    40.51805    .0018459        40.51442    40.52167
      _sim_4 |      1,000    1.155773    .0003099        1.155165    1.156381
      _sim_5 |      1,000    58.83162    .0393252        58.75445    58.90879
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    11.78506    .0019157         11.7813    11.78882
      _sim_7 |      1,000    22.13951    .0056935        22.12833    22.15068
      _sim_8 |      1,000    38.25273    .0018721        38.24906     38.2564
      _sim_9 |      1,000     35.7329    .0401093        35.65419    35.81161
     _sim_10 |      1,000    43.36165    .0069033        43.34811     43.3752
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    42.78336    .0020118        42.77941    42.78731
     _sim_12 |      1,000    81.93033    .0388894        81.85401    82.00664
     _sim_13 |      1,000    1046.466    .1324932        1046.206    1046.726

. 
. rename _sim_1 beta1h

. rename _sim_2 se1h

. rename _sim_3 beta2h 

. rename _sim_4 se2h

. rename _sim_5 beta3h 

. rename _sim_6 se3h

. rename _sim_7 ll1h 

. rename _sim_8 ll2h 

. rename _sim_9 ll3h 

. rename _sim_10 ul1h 

. rename _sim_11 ul2h 

. rename _sim_12 ul3h 

. rename _sim_13 RMSE1h

. 
. 
. gen absbias1 = 100 * (beta1h - 30) / 30

. gen absbias2 = 100 * (beta2h - 40) / 40

. gen absbias3 = 100 * (beta3h - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000    9.168598    .6428228  -3.519891   18.34479
    absbias2 |      1,000    1.295115    .1459283  -1.842213   2.990179
    absbias3 |      1,000    17.66324    2.487144  -43.32578   17.79318

. 
. gen bias1 = 100 * (beta1h - 30) / se1h

. gen bias2 = 100 * (beta2h - 40) / se2h

. gen bias3 = 100 * (beta3h - 50) / se3h

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000    50.80001    3.512375  -18.38705   99.46952
       bias2 |      1,000    44.82236    4.778728  -61.86059   82.64523
       bias3 |      1,000    74.94746    10.64833  -193.7754   75.50605

.                    
. summarize RMSE1h                                  

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1h |      1,000    1046.466    4.189804   1046.222   1129.628

.                    
. gen covered1 = (30 >= ll1h) & (30 <= ul1h)

. gen covered2 = (40 >= ll2h) & (40 <= ul2h)

. gen covered3 = (50 >= ll3h) & (50 <= ul3h)

.                    
. summarize covered1 covered2 covered3    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000           1           0          1          1
    covered2 |      1,000           1           0          1          1
    covered3 |      1,000           1           0          1          1

.                    
. gen length1 = ul1h-ll1h       

. gen length2 = ul2h-ll2h                   

. gen length3 = ul3h-ll3h                   

. 
. summarize length1 length2 length3   

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    21.22215    .1066926   18.36983   22.51254
     length2 |      1,000    4.530632    .0384101   4.528923   5.673164
     length3 |      1,000    46.19743    .2374677   43.82318   51.42883

. save miaux1, replace
file miaux1.dta saved

. 
In [20]:
%%stata

simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upper3 RMSE1i, reps(1000) seed(1234): miaux100 

ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_10 _sim_11 _sim_12 _sim_13

rename _sim_1 beta1i
rename _sim_2 se1i
rename _sim_3 beta2i 
rename _sim_4 se2i
rename _sim_5 beta3i 
rename _sim_6 se3i
rename _sim_7 ll1i 
rename _sim_8 ll2i 
rename _sim_9 ll3i 
rename _sim_10 ul1i 
rename _sim_11 ul2i 
rename _sim_12 ul3i 
rename _sim_13 RMSE1i

gen absbias1 = 100 * (beta1i - 30) / 30
gen absbias2 = 100 * (beta2i - 40) / 40
gen absbias3 = 100 * (beta3i - 50) / 50

summarize absbias1 absbias2 absbias3

gen bias1 = 100 * (beta1i - 30) / se1i
gen bias2 = 100 * (beta2i - 40) / se2i
gen bias3 = 100 * (beta3i - 50) / se3i

summarize bias1 bias2 bias3
                   
summarize RMSE1i                                  
                   
gen covered1 = (30 >= ll1i) & (30 <= ul1i)
gen covered2 = (40 >= ll2i) & (40 <= ul2i)
gen covered3 = (50 >= ll3i) & (50 <= ul3i)
                   
summarize covered1 covered2 covered3    
                   
gen length1 = ul1i-ll1i       
gen length2 = ul2i-ll2i                   
gen length3 = ul3i-ll3i                   

summarize length1 length2 length3  
save miaux100, replace
. 
. simulate beta1 se1 beta2 se2 beta3 se3 lower1 lower2 lower3 upper1 upper2 upp
> er3 RMSE1i, reps(1000) seed(1234): miaux100 

      Command: miaux100
       _sim_1: beta1
       _sim_2: se1
       _sim_3: beta2
       _sim_4: se2
       _sim_5: beta3
       _sim_6: se3
       _sim_7: lower1
       _sim_8: lower2
       _sim_9: lower3
      _sim_10: upper1
      _sim_11: upper2
      _sim_12: upper3
      _sim_13: RMSE1i

Simulations (1,000): .........10.........20.........30.........40.........50...
> ......60.........70.........80.........90.........100.........110.........120
> .........130.........140.........150.........160.........170.........180.....
> ....190.........200.........210.........220.........230.........240.........2
> 50.........260.........270.........280.........290.........300.........310...
> ......320.........330.........340.........350.........360.........370........
> .380.........390.........400.........410.........420.........430.........440.
> ........450.........460.........470.........480.........490.........500......
> ...510.........520.........530.........540.........550.........560.........57
> 0.........580.........590.........600.........610.........620.........630....
> .....640.........650.........660.........670.........680.........690.........
> 700.........710.........720.........730.........740.........750.........760..
> .......770.........780.........790.........800.........810.........820.......
> ..830.........840.........850.........860.........870.........880.........890
> .........900.........910.........920.........930.........940.........950.....
> ....960.........970.........980.........990.........1,000 done

. 
. ci mean _sim_1 _sim_2 _sim_3 _sim_4 _sim_5 _sim_6 _sim_7 _sim_8 _sim_9 _sim_1
> 0 _sim_11 _sim_12 _sim_13

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      _sim_1 |      1,000    31.44767    .1693166        31.11541    31.77992
      _sim_2 |      1,000    5.056805     .007465        5.042156    5.071454
      _sim_3 |      1,000      40.028     .035454        39.95843    40.09757
      _sim_4 |      1,000    1.339959    .0025793        1.334897     1.34502
      _sim_5 |      1,000    55.41926    .3468092         54.7387    56.09981
-------------+---------------------------------------------------------------
      _sim_6 |      1,000    12.70764     .013953        12.68026    12.73502
      _sim_7 |      1,000    21.53633    .1761621        21.19064    21.88202
      _sim_8 |      1,000    37.40168    .0313884        37.34009    37.46327
      _sim_9 |      1,000    30.51228    .3620038        29.80191    31.22266
     _sim_10 |      1,000      41.359     .163497        41.03817    41.67984
-------------+---------------------------------------------------------------
     _sim_11 |      1,000    42.65432    .0397475        42.57632    42.73232
     _sim_12 |      1,000    80.32623    .3331702        79.67243    80.98002
     _sim_13 |      1,000    1099.014    1.299717        1096.464    1101.565

. 
. rename _sim_1 beta1i

. rename _sim_2 se1i

. rename _sim_3 beta2i 

. rename _sim_4 se2i

. rename _sim_5 beta3i 

. rename _sim_6 se3i

. rename _sim_7 ll1i 

. rename _sim_8 ll2i 

. rename _sim_9 ll3i 

. rename _sim_10 ul1i 

. rename _sim_11 ul2i 

. rename _sim_12 ul3i 

. rename _sim_13 RMSE1i

. 
. gen absbias1 = 100 * (beta1i - 30) / 30

. gen absbias2 = 100 * (beta2i - 40) / 40

. gen absbias3 = 100 * (beta3i - 50) / 50

. 
. summarize absbias1 absbias2 absbias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    absbias1 |      1,000    4.825554    17.84753  -18.00061   29.55877
    absbias2 |      1,000    .0699968    2.802888  -5.157242   3.376732
    absbias3 |      1,000    10.83851    21.93414  -21.32155   51.01494

. 
. gen bias1 = 100 * (beta1i - 30) / se1i

. gen bias2 = 100 * (beta2i - 40) / se2i

. gen bias3 = 100 * (beta3i - 50) / se3i

. 
. summarize bias1 bias2 bias3

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       bias1 |      1,000    30.85355    105.8439  -103.3199   182.0997
       bias2 |      1,000   -2.148084    86.41637  -166.1422   105.7601
       bias3 |      1,000    44.27564    88.90823   -83.6627    208.403

.                    
. summarize RMSE1i                                  

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      RMSE1i |      1,000    1099.014    41.10065   1022.184   1145.602

.                    
. gen covered1 = (30 >= ll1i) & (30 <= ul1i)

. gen covered2 = (40 >= ll2i) & (40 <= ul2i)

. gen covered3 = (50 >= ll3i) & (50 <= ul3i)

.                    
. summarize covered1 covered2 covered3    

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    covered1 |      1,000           1           0          1          1
    covered2 |      1,000           1           0          1          1
    covered3 |      1,000        .801    .3994478          0          1

.                    
. gen length1 = ul1i-ll1i       

. gen length2 = ul2i-ll2i                   

. gen length3 = ul3i-ll3i                   

. 
. summarize length1 length2 length3  

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     length1 |      1,000    19.82268    .9253684    18.3557   21.05093
     length2 |      1,000    5.252637    .3197271   4.867249    5.71347
     length3 |      1,000    49.81394    1.729626   46.15965   52.64177

. save miaux100, replace
file miaux100.dta saved

. 

Reference List¶

Allison, P. (2012a) ‘Handling Missing Data by Maximum Likelihood’, SAS Global Forum [Preprint].

Allison, P. (2012b) ‘Why Maximum Likelihood is Better Than Multiple Imputation’, Statistical Horizons, 9 July. Available at: https://statisticalhorizons.com/ml-better-than-mi/ (Accessed: 15 May 2023).

Allison, P. (2015) ‘Maximum Likelihood is Better than Multiple Imputation: Part II’, Statistical Horizons, 5 May. Available at: https://statisticalhorizons.com/ml-is-better-than-mi/ (Accessed: 15 May 2023).

Carpenter, J.R. and Kenward, M. (2012) Multiple imputation and its application. John Wiley & Sons.

Collins, L.M., Schafer, J.L. and Kam, C.-M. (2001) ‘A comparison of inclusive and restrictive strategies in modern missing data procedures.’, Psychological Methods, 6(4), pp. 330–351. Available at: https://doi.org/10.1037/1082-989X.6.4.330.

Enders, C.K. (2001) ‘A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data’, Structural Equation Modeling: A Multidisciplinary Journal, 8(1), pp. 128–141. Available at: https://doi.org/10.1207/S15328007SEM0801_7.

Enders, C.K. (2010) Applied missing data analysis. New York: Guilford Press (Methodology in the social sciences).

Hardt, J. et al. (2013) ‘Multiple Imputation of Missing Data: A Simulation Study on a Binary Response’, Open Journal of Statistics, 03(05), pp. 370–378. Available at: https://doi.org/10.4236/ojs.2013.35043.

Hawkes, D. and Plewis, I. (2006) ‘Modelling non-response in the National Child Development Study’, Journal of the Royal Statistical Society, 169(3), pp. 479–491. Available at: https://doi.org/10.1111/j.1467-985X.2006.00401.x.

Hyuk Lee, J. and Huber Jr., J.C. (2021) ‘Evaluation of Multiple Imputation with Large Proportions of Missing Data: How Much Is Too Much?’, Iranian Journal of Public Health [Preprint]. Available at: https://doi.org/10.18502/ijph.v50i7.6626.

Johnson, D.R. and Young, R. (2011) ‘Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations’, Journal of Marriage and Family, 73(5), pp. 926–945. Available at: https://doi.org/10.1111/j.1741-3737.2011.00861.x.

Jones, M.P. (1996) ‘Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression’, Journal of the American Statistical Association, 91(433).

Kang, H. (2013) ‘The prevention and handling of the missing data’, Korean Journal of Anesthesiology, 64(5), p. 402. Available at: https://doi.org/10.4097/kjae.2013.64.5.402.

Little, R.J., Carpenter, J.R. and Lee, K.J. (2022) ‘A Comparison of Three Popular Methods for Handling Missing Data: Complete-Case Analysis, Inverse Probability Weighting, and Multiple Imputation’, Sociological Methods & Research, p. 004912412211138. Available at: https://doi.org/10.1177/00491241221113873.

Little, R.J. and Rubin, D.B. (2019) Statistical analysis with missing data. John Wiley & Sons.

Lynch, J. and Von Hippel, P.T. (2013) ‘Efficiency Gains from Using Auxiliary Variables in Imputation’, Cornell University Library [Preprint].

Madley-Dowd, P. et al. (2019) ‘The proportion of missing data should not be used to guide decisions on multiple imputation’, Journal of Clinical Epidemiology, 110, pp. 63–73. Available at: https://doi.org/10.1016/j.jclinepi.2019.02.016.

Marsden, P.V. and Wright, J.D. (eds) (2010) Handbook of survey research. Second edition. Bingley: Emerald Group Publ.

Seaman, S.R. et al. (2012) ‘Combining Multiple Imputation and Inverse‐Probability Weighting’, Biometrics, 68(1), pp. 129–137. Available at: https://doi.org/10.1111/j.1541-0420.2011.01666.x.

Seaman, S.R. and White, I.R. (2013) ‘Review of inverse probability weighting for dealing with missing data’, Statistical Methods in Medical Research, 22(3), pp. 278–295. Available at: https://doi.org/10.1177/0962280210395740.

Silverwood, R. et al. (2021) ‘Handling missing data in the National Child Development Study: User guide (Version 2).’

Von Hippel, P.T. (2009) ‘How to Impute Interactions, Squares, and Other Transformed Variables’, Sociological Methodology, 39(1), pp. 265–291. Available at: https://doi.org/10.1111/j.1467-9531.2009.01215.x.

White, I.R., Royston, P. and Wood, A.M. (2011) ‘Multiple imputation using chained equations: Issues and guidance for practice’, Statistics in Medicine, 30(4), pp. 377–399. Available at: https://doi.org/10.1002/sim.4067.

Young, R. and Johnson, D.R. (2011) ‘Imputing the Missing Y’s: Implications for Survey Producers and Survey Users’, Proceedings of the AAPOR Conference Abstracts [Preprint].