Before any analysis. Both the R kernal and Stata kernal need to be properly setup. First start with R. We first need to install the correct Python to R package using pip.
pip install rpy2
Requirement already satisfied: rpy2 in /opt/anaconda3/lib/python3.12/site-packages (3.5.17) Requirement already satisfied: cffi>=1.15.1 in /opt/anaconda3/lib/python3.12/site-packages (from rpy2) (1.17.1) Requirement already satisfied: jinja2 in /opt/anaconda3/lib/python3.12/site-packages (from rpy2) (3.1.4) Requirement already satisfied: tzlocal in /opt/anaconda3/lib/python3.12/site-packages (from rpy2) (5.3.1) Requirement already satisfied: pycparser in /opt/anaconda3/lib/python3.12/site-packages (from cffi>=1.15.1->rpy2) (2.21) Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/lib/python3.12/site-packages (from jinja2->rpy2) (2.1.3) Note: you may need to restart the kernel to use updated packages.
After we have done this we then need to setup R within the Python kernal
%load_ext rpy2.ipython
Finally a quick little test to make sure R now works. Anytime we want R code to run we have to use the magic cells %%R.
%%R
x <- c(1,2,3,4,5)
mean(x)
[1] 3
Now we have to install the python to stata dependecies to get Stata to work in a Python kernal. We start by installing the pystata package.
pip install pystata
Requirement already satisfied: pystata in /opt/anaconda3/lib/python3.12/site-packages (0.0.1) Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (from pystata) (2.2.2) Requirement already satisfied: numpy>=1.26.0 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (2023.3) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->pystata) (1.16.0) Note: you may need to restart the kernel to use updated packages.
We then also install the stata_setup package.
pip install --upgrade --user stata_setup
Requirement already satisfied: stata_setup in /Users/scottoatley/.local/lib/python3.12/site-packages (0.1.3) Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (from stata_setup) (1.26.4) Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (from stata_setup) (2.2.2) Requirement already satisfied: ipython in /opt/anaconda3/lib/python3.12/site-packages (from stata_setup) (8.27.0) Requirement already satisfied: decorator in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (5.1.1) Requirement already satisfied: jedi>=0.16 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (0.19.1) Requirement already satisfied: matplotlib-inline in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (0.1.6) Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (3.0.43) Requirement already satisfied: pygments>=2.4.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (2.15.1) Requirement already satisfied: stack-data in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (0.2.0) Requirement already satisfied: traitlets>=5.13.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (5.14.3) Requirement already satisfied: pexpect>4.3 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (4.8.0) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->stata_setup) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->stata_setup) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->stata_setup) (2023.3) Requirement already satisfied: parso<0.9.0,>=0.8.3 in /opt/anaconda3/lib/python3.12/site-packages (from jedi>=0.16->ipython->stata_setup) (0.8.3) Requirement already satisfied: ptyprocess>=0.5 in /opt/anaconda3/lib/python3.12/site-packages (from pexpect>4.3->ipython->stata_setup) (0.7.0) Requirement already satisfied: wcwidth in /opt/anaconda3/lib/python3.12/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython->stata_setup) (0.2.5) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->stata_setup) (1.16.0) Requirement already satisfied: executing in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython->stata_setup) (0.8.3) Requirement already satisfied: asttokens in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython->stata_setup) (2.0.5) Requirement already satisfied: pure-eval in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython->stata_setup) (0.2.2) Note: you may need to restart the kernel to use updated packages.
We now import that stata_setup package into the python kernal.
import stata_setup
And finally we import Stata itself into the kernal.
stata_setup.config("/Applications/Stata", "se")
___ ____ ____ ____ ____ ® /__ / ____/ / ____/ 18.0 ___/ / /___/ / /___/ SE—Standard Edition Statistics and Data Science Copyright 1985-2023 StataCorp LLC StataCorp 4905 Lakeway Drive College Station, Texas 77845 USA 800-STATA-PC https://www.stata.com 979-696-4600 stata@stata.com Stata license: Unlimited-user network, expiring 14 Sep 2025 Serial number: 401809305318 Licensed to: Scott oatley University of Edinburgh Notes: 1. Unicode is supported; see help unicode_advice. 2. Maximum number of variables is set to 5,000 but can be increased; see help set_maxvar.
And we now test that Stata works by doing a simple test using display.
%%stata
display 1+1
. . display 1+1 2 .
And now back to R to make sure they both can run at the same time.
%%R
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
[1] FALSE FALSE FALSE TRUE
Each handling missing data method will now be explored using the four aforementioned software programs: R, Stata, Python, and Mplus. Some methods are not possible with certain software, when this happens, only relevant software will be mentioned. We start with a simple simulation of an OLS linear regression model with three continuous independent variables that are normally distributed with means ranging from 40-200 and standard deviations ranging from 5-50. First we start with R, then Stata, then Python, and finally Mplus.
All methods will be produced.
%%R
# Load necessary library
set.seed(123) # Ensure reproducibility
# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)
# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)
# Create data frame
data <- data.frame(y, x1, x2, x3)
# Run regression
model <- lm(y ~ x1 + x2 + x3, data = data)
summary(model)
Call: lm(formula = y ~ x1 + x2 + x3, data = data) Residuals: Min 1Q Median 3Q Max -4584.3 -950.2 0.7 1012.9 4338.4 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2726.9512 1462.4421 1.865 0.0625 . x1 29.8375 3.9740 7.508 1.33e-13 *** x2 39.7564 0.9369 42.432 < 2e-16 *** x3 65.4442 9.6348 6.792 1.89e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1489 on 996 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6652 F-statistic: 662.5 on 3 and 996 DF, p-value: < 2.2e-16
This is the same as this in Stata:
%%stata
* Set seed
set seed 123
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Run Model
regress y x1 x2 x3
etable
. . * Set seed . set seed 123 . . * Set number of observations . set obs 1000 Number of observations (_N) was 0, now 1,000. . . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) . . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) . . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) . . * Generate metric dependent variable with large coefficients . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) . . * Run Model . regress y x1 x2 x3 Source | SS df MS Number of obs = 1,000 -------------+---------------------------------- F(3, 996) = 714.04 Model | 4.7321e+09 3 1.5774e+09 Prob > F = 0.0000 Residual | 2.2002e+09 996 2209043.89 R-squared = 0.6826 -------------+---------------------------------- Adj R-squared = 0.6817 Total | 6.9323e+09 999 6939202.43 Root MSE = 1486.3 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- x1 | 29.4092 4.032441 7.29 0.000 21.49614 37.32225 x2 | 42.71715 .9441681 45.24 0.000 40.86437 44.56994 x3 | 54.51955 9.245538 5.90 0.000 36.37658 72.66252 _cons | 3778.896 1402.667 2.69 0.007 1026.375 6531.417 ------------------------------------------------------------------------------ . . etable --------------------------------- y --------------------------------- x1 29.409 (4.032) x2 42.717 (0.944) x3 54.520 (9.246) Intercept 3778.896 (1402.667) Number of observations 1000 --------------------------------- .
Now don't worry too much that the results from the models aren't exactly perfect. That is because we have different seeds for each software. Substantively they are the same and that is all that matters here. Now for Python.
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Set seed for reproducibility
np.random.seed(123)
# Generate metric independent variables
x1 = np.random.normal(loc=40, scale=12, size=1000)
x2 = np.random.normal(loc=200, scale=50, size=1000)
x3 = np.random.normal(loc=150, scale=5, size=1000)
# Generate metric dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=1000)
# Create DataFrame
df = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})
# Run regression
X = df[['x1', 'x2', 'x3']] # Independent variables
X = sm.add_constant(X) # Add intercept
model = sm.OLS(df['y'], X).fit() # Fit model
# Display summary
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.618 Model: OLS Adj. R-squared: 0.617 Method: Least Squares F-statistic: 537.7 Date: Mon, 05 May 2025 Prob (F-statistic): 1.03e-207 Time: 11:34:34 Log-Likelihood: -8719.9 No. Observations: 1000 AIC: 1.745e+04 Df Residuals: 996 BIC: 1.747e+04 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 5257.2104 1467.373 3.583 0.000 2377.713 8136.707 x1 26.4715 3.914 6.764 0.000 18.792 34.151 x2 38.7536 0.981 39.513 0.000 36.829 40.678 x3 51.5446 9.574 5.384 0.000 32.757 70.332 ============================================================================== Omnibus: 2.499 Durbin-Watson: 1.897 Prob(Omnibus): 0.287 Jarque-Bera (JB): 2.519 Skew: 0.122 Prob(JB): 0.284 Kurtosis: 2.964 Cond. No. 8.01e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.01e+03. This might indicate that there are strong multicollinearity or other numerical problems.
The simple OLS linear regression model has now been conducted. We now perform the same model under a sem framework for all three softwares. We start with R, then move onto Stata, and then Python.
%%R
# Load necessary libraries
library(lavaan)
library(MASS) # For mvrnorm (multivariate normal distribution)
# Set seed for reproducibility
set.seed(123)
# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)
# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)
# Create a dataframe
data <- data.frame(y, x1, x2, x3)
# Define the SEM model
sem_model <- '
y ~ x1 + x2 + x3
'
# Fit the model using Maximum Likelihood with Missing Values (MLMV)
fit <- sem(sem_model, data = data, missing = "ml")
# Print model summary
summary(fit, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-18 ended normally after 1 iteration Estimator ML Optimization method NLMINB Number of model parameters 5 Number of observations 1000 Number of missing patterns 1 Model Test User Model: Test statistic 0.000 Degrees of freedom 0 Model Test Baseline Model: Test statistic 1097.097 Degrees of freedom 3 P-value 0.000 User Model versus Baseline Model: Comparative Fit Index (CFI) 1.000 Tucker-Lewis Index (TLI) 1.000 Robust Comparative Fit Index (CFI) 1.000 Robust Tucker-Lewis Index (TLI) 1.000 Loglikelihood and Information Criteria: Loglikelihood user model (H0) -8722.665 Loglikelihood unrestricted model (H1) -8722.665 Akaike (AIC) 17455.330 Bayesian (BIC) 17479.869 Sample-size adjusted Bayesian (SABIC) 17463.989 Root Mean Square Error of Approximation: RMSEA 0.000 90 Percent confidence interval - lower 0.000 90 Percent confidence interval - upper 0.000 P-value H_0: RMSEA <= 0.050 NA P-value H_0: RMSEA >= 0.080 NA Robust RMSEA 0.000 90 Percent confidence interval - lower 0.000 90 Percent confidence interval - upper 0.000 P-value H_0: Robust RMSEA <= 0.050 NA P-value H_0: Robust RMSEA >= 0.080 NA Standardized Root Mean Square Residual: SRMR 0.000 Parameter Estimates: Standard errors Standard Information Observed Observed information based on Hessian Regressions: Estimate Std.Err z-value P(>|z|) Std.lv Std.all y ~ x1 29.837 3.966 7.523 0.000 29.837 0.138 x2 39.756 0.935 42.518 0.000 39.756 0.780 x3 65.444 9.616 6.806 0.000 65.444 0.124 Intercepts: Estimate Std.Err z-value P(>|z|) Std.lv Std.all .y 2726.951 1459.514 1.868 0.062 2726.951 1.060 Variances: Estimate Std.Err z-value P(>|z|) Std.lv Std.all .y 2207681.041 98745.450 22.357 0.000 2207681.041 0.334
This is lavaan 0.6-18 lavaan is FREE software! Please report any bugs. In addition: Warning messages: 1: lavaan->lav_data_full(): some observed variances are larger than 1000000 use varTable(fit) to investigate 2: lavaan->lavData(): some observed variances in the sample covariance matrix are larger than 1000000. 3: lavaan->lavData(): some observed variances in the sample covariance matrix are larger than 1000000. 4: lavaan->lavData(): some observed variances in the sample covariance matrix are larger than 1000000. 5: lavaan->lavData(): some observed variances in the sample covariance matrix are larger than 1000000.
%%stata
clear
* Set seed
set seed 123
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
sem (y <- x1 x2 x3), method(mlmv)
etable, append
. . clear . . * Set seed . set seed 123 . . * Set number of observations . set obs 1000 Number of observations (_N) was 0, now 1,000. . . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) . . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) . . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) . . * Generate metric dependent variable with large coefficients . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) . . sem (y <- x1 x2 x3), method(mlmv) Endogenous variables Observed: y Exogenous variables Observed: x1 x2 x3 Fitting target model: Iteration 0: Log likelihood = -20967.462 Iteration 1: Log likelihood = -20967.462 Structural equation model Number of obs = 1,000 Estimation method: mlmv Log likelihood = -20967.462 ------------------------------------------------------------------------------ | OIM | Coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- Structural | y | x1 | 29.4092 4.024368 7.31 0.000 21.52158 37.29681 x2 | 42.71715 .9422779 45.33 0.000 40.87032 44.56398 x3 | 54.51955 9.227029 5.91 0.000 36.43491 72.6042 _cons | 3778.896 1399.858 2.70 0.007 1035.223 6522.568 -------------+---------------------------------------------------------------- var(e.y)| 2200208 98396.28 2015565 2401765 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00 Prob > chi2 = . . . etable, append --------------------------------------------- y 2 --------------------------------------------- x1 29.409 29.409 (4.032) (4.024) x2 42.717 42.717 (0.944) (0.942) x3 54.520 54.520 (9.246) (9.227) Intercept 3778.896 3778.896 (1402.667) (1399.858) var(e.y) 2.20e+06 (98396.280) Number of observations 1000 1000 --------------------------------------------- .
pip install semopy
Requirement already satisfied: semopy in /opt/anaconda3/lib/python3.12/site-packages (2.3.11) Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.13.1) Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.26.4) Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (2.2.2) Requirement already satisfied: sympy in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.13.2) Requirement already satisfied: scikit-learn in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.5.1) Requirement already satisfied: statsmodels in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (0.14.2) Requirement already satisfied: numdifftools in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (0.9.41) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->semopy) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->semopy) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->semopy) (2023.3) Requirement already satisfied: joblib>=1.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn->semopy) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn->semopy) (3.5.0) Requirement already satisfied: patsy>=0.5.6 in /opt/anaconda3/lib/python3.12/site-packages (from statsmodels->semopy) (0.5.6) Requirement already satisfied: packaging>=21.3 in /opt/anaconda3/lib/python3.12/site-packages (from statsmodels->semopy) (24.1) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from sympy->semopy) (1.3.0) Requirement already satisfied: six in /opt/anaconda3/lib/python3.12/site-packages (from patsy>=0.5.6->statsmodels->semopy) (1.16.0) Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
from semopy import Model
# Set random seed for reproducibility
np.random.seed(123)
# Generate metric independent variables
x1 = np.random.normal(loc=40, scale=12, size=1000)
x2 = np.random.normal(loc=200, scale=50, size=1000)
x3 = np.random.normal(loc=150, scale=5, size=1000)
# Generate metric dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=1000)
# Create a DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})
# Define the SEM model in semopy syntax
sem_model = """
y ~ x1 + x2 + x3
"""
# Create and fit the model
model = Model(sem_model)
model.fit(data)
# Print model results
print(model.inspect())
lval op rval Estimate Std. Err z-value p-value 0 y ~ x1 2.826314e+01 4.470083 6.322734 2.569751e-10 1 y ~ x2 3.870783e+01 1.120236 34.553275 0.000000e+00 2 y ~ x3 9.507102e+00 10.935110 0.869411 3.846225e-01 3 y ~~ y 2.875963e+06 128616.985180 22.360680 0.000000e+00
Now we move on to a simple MCAR model. Starting with R and following the same software pattern as before.
%%R
# Set seed for reproducibility
set.seed(123)
# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)
# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)
# Introduce 50% Missing Completely at Random (MCAR) in x2
rmcar <- rbinom(1000, 1, 0.5) # Binary indicator: 50% missingness
x2[rmcar == 0] <- NA # Set x2 to NA for half the observations
# Create data frame
data <- data.frame(y, x1, x2, x3)
# Run OLS regression (handling missing data with listwise deletion by default)
ols_model <- lm(y ~ x1 + x2 + x3, data = data)
# Display regression results
summary(ols_model)
Call: lm(formula = y ~ x1 + x2 + x3, data = data) Residuals: Min 1Q Median 3Q Max -4464.9 -1057.0 -31.6 1042.2 4294.4 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 386.332 2177.578 0.177 0.859 x1 29.499 5.908 4.993 8.28e-07 *** x2 40.483 1.364 29.671 < 2e-16 *** x3 80.047 14.416 5.553 4.62e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1548 on 489 degrees of freedom (507 observations deleted due to missingness) Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677 F-statistic: 330.5 on 3 and 489 DF, p-value: < 2.2e-16
%%stata
clear
* Set seed
set seed 123
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MCAR
gen rmcar = rbinomial(1, 0.5) // MCAR: 50% chance of missingness (binary random)
replace x2 = . if rmcar == 0 // Set x to missing where rmcar == 0
regress y x1 x2 x3
etable, append
. . clear . . * Set seed . set seed 123 . . * Set number of observations . set obs 1000 Number of observations (_N) was 0, now 1,000. . . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) . . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) . . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) . . * Generate metric dependent variable with large coefficients . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) . . * Generate MCAR . gen rmcar = rbinomial(1, 0.5) // MCAR: 50% chance of missingness (binary ran > dom) . replace x2 = . if rmcar == 0 // Set x to missing where rmcar == 0 (492 real changes made, 492 to missing) . . regress y x1 x2 x3 Source | SS df MS Number of obs = 508 -------------+---------------------------------- F(3, 504) = 325.45 Model | 2.3023e+09 3 767444127 Prob > F = 0.0000 Residual | 1.1885e+09 504 2358099.32 R-squared = 0.6595 -------------+---------------------------------- Adj R-squared = 0.6575 Total | 3.4908e+09 507 6885235.58 Root MSE = 1535.6 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- x1 | 25.21384 5.94201 4.24 0.000 13.53968 36.888 x2 | 42.55748 1.384992 30.73 0.000 39.83641 45.27855 x3 | 48.17947 13.13299 3.67 0.000 22.37733 73.98162 _cons | 4912.711 2007.467 2.45 0.015 968.6771 8856.744 ------------------------------------------------------------------------------ . . etable, append -------------------------------------------------------- y 2 y -------------------------------------------------------- x1 29.409 29.409 25.214 (4.032) (4.024) (5.942) x2 42.717 42.717 42.557 (0.944) (0.942) (1.385) x3 54.520 54.520 48.179 (9.246) (9.227) (13.133) Intercept 3778.896 3778.896 4912.711 (1402.667) (1399.858) (2007.467) var(e.y) 2.20e+06 (98396.280) Number of observations 1000 1000 508 -------------------------------------------------------- .
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Set seed for reproducibility
np.random.seed(123)
# Set number of observations
n = 1000
# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)
# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)
# Introduce 50% Missing Completely at Random (MCAR) in x2
rmcar = np.random.binomial(1, 0.5, size=n) # Binary indicator: 50% missingness
x2[rmcar == 0] = np.nan # Set x2 to NaN for half the observations
# Create DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})
# Drop rows with missing values (listwise deletion) and run OLS regression
data_complete = data.dropna() # Remove rows with NaN
X = data_complete[['x1', 'x2', 'x3']]
X = sm.add_constant(X) # Add intercept
y = data_complete['y']
# Fit OLS model
model = sm.OLS(y, X).fit()
# Display results
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.631 Model: OLS Adj. R-squared: 0.629 Method: Least Squares F-statistic: 295.0 Date: Mon, 05 May 2025 Prob (F-statistic): 1.19e-111 Time: 11:34:35 Log-Likelihood: -4541.7 No. Observations: 522 AIC: 9091. Df Residuals: 518 BIC: 9108. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 4652.1226 2050.412 2.269 0.024 623.978 8680.268 x1 20.5814 5.133 4.010 0.000 10.498 30.665 x2 39.7328 1.342 29.605 0.000 37.096 42.369 x3 55.6552 13.271 4.194 0.000 29.584 81.726 ============================================================================== Omnibus: 4.773 Durbin-Watson: 1.958 Prob(Omnibus): 0.092 Jarque-Bera (JB): 4.593 Skew: 0.223 Prob(JB): 0.101 Kurtosis: 3.108 Cond. No. 8.24e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.24e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Now we move on to a MAR model. Once more, starting with R.
%%R
# Load necessary library
set.seed(123) # Ensure reproducibility
# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)
# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)
# Generate MAR (Missing At Random) pattern
prob_mar <- 1 / (1 + exp(-(y - 21791))) # Logistic function
rmar <- rbinom(1000, 1, prob_mar) # Generate missing indicator based on probability
x2[rmar == 0] <- NA # Set x2 to NA for MAR missingness
# Create data frame
data <- data.frame(y, x1, x2, x3)
# Run OLS regression (listwise deletion by default)
ols_model <- lm(y ~ x1 + x2 + x3, data = data, na.action = na.omit)
# Display regression results
summary(ols_model)
Call: lm(formula = y ~ x1 + x2 + x3, data = data, na.action = na.omit) Residuals: Min 1Q Median 3Q Max -2876.0 -828.4 -78.3 670.2 3761.0 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12209.232 1598.736 7.637 1.13e-13 *** x1 19.693 4.588 4.292 2.12e-05 *** x2 24.731 1.351 18.300 < 2e-16 *** x3 33.021 10.191 3.240 0.00127 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1161 on 504 degrees of freedom (492 observations deleted due to missingness) Multiple R-squared: 0.4173, Adjusted R-squared: 0.4138 F-statistic: 120.3 on 3 and 504 DF, p-value: < 2.2e-16
%%stata
clear
* Set seed
set seed 123
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0 // Set x to missing where rmar == 0
regress y x1 x2 x3
etable, append
. . clear . . * Set seed . set seed 123 . . * Set number of observations . set obs 1000 Number of observations (_N) was 0, now 1,000. . . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) . . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) . . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) . . * Generate metric dependent variable with large coefficients . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) . . * Generate MAR . gen prob_mar = logistic(y-21791) . gen rmar = 0 if prob_mar==0 (500 missing values generated) . replace x2 = . if rmar == 0 // Set x to missing where rmar == 0 (500 real changes made, 500 to missing) . regress y x1 x2 x3 Source | SS df MS Number of obs = 500 -------------+---------------------------------- F(3, 496) = 122.89 Model | 552925039 3 184308346 Prob > F = 0.0000 Residual | 743864709 496 1499727.24 R-squared = 0.4264 -------------+---------------------------------- Adj R-squared = 0.4229 Total | 1.2968e+09 499 2598777.05 Root MSE = 1224.6 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- x1 | 20.14886 4.982177 4.04 0.000 10.36009 29.93763 x2 | 26.96303 1.407434 19.16 0.000 24.19777 29.7283 x3 | 19.7006 10.99788 1.79 0.074 -1.907582 41.30878 _cons | 13742.28 1717.077 8.00 0.000 10368.64 17115.92 ------------------------------------------------------------------------------ . . etable, append ------------------------------------------------------------------- y 2 y y ------------------------------------------------------------------- x1 29.409 29.409 25.214 20.149 (4.032) (4.024) (5.942) (4.982) x2 42.717 42.717 42.557 26.963 (0.944) (0.942) (1.385) (1.407) x3 54.520 54.520 48.179 19.701 (9.246) (9.227) (13.133) (10.998) Intercept 3778.896 3778.896 4912.711 13742.276 (1402.667) (1399.858) (2007.467) (1717.077) var(e.y) 2.20e+06 (98396.280) Number of observations 1000 1000 508 500 ------------------------------------------------------------------- .
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Set seed for reproducibility
np.random.seed(123)
# Set number of observations
n = 1000
# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)
# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)
# Generate MAR (Missing At Random) for x2
# Logistic function: scale y to avoid extreme probabilities
scaled_y = (y - np.mean(y)) / 1000
prob_mar = 1 / (1 + np.exp(-scaled_y)) # Values between 0 and 1
# Generate MAR indicator
rmar = np.random.binomial(n=1, p=prob_mar) # Correct binomial usage
# Apply missingness to x2
x2[rmar == 0] = np.nan
# Create DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})
# Run OLS regression (listwise deletion)
data_clean = data.dropna() # Remove rows with missing values
X = sm.add_constant(data_clean[['x1', 'x2', 'x3']]) # Add intercept
model = sm.OLS(data_clean['y'], X).fit()
# Display regression results
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.426 Model: OLS Adj. R-squared: 0.423 Method: Least Squares F-statistic: 123.7 Date: Mon, 05 May 2025 Prob (F-statistic): 6.18e-60 Time: 11:34:35 Log-Likelihood: -4337.5 No. Observations: 504 AIC: 8683. Df Residuals: 500 BIC: 8700. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 9446.1975 1890.284 4.997 0.000 5732.318 1.32e+04 x1 18.1592 5.048 3.598 0.000 8.242 28.076 x2 28.6725 1.507 19.027 0.000 25.712 31.633 x3 44.9865 12.007 3.747 0.000 21.396 68.577 ============================================================================== Omnibus: 3.628 Durbin-Watson: 1.937 Prob(Omnibus): 0.163 Jarque-Bera (JB): 3.515 Skew: 0.160 Prob(JB): 0.173 Kurtosis: 2.746 Cond. No. 8.81e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.81e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Now we have our baseline MAR model for each software. We can now focus on implementing potential handling missing data solutions. Starting with Single mean imputation.
%%R
# Set seed for reproducibility
set.seed(123)
# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)
# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)
# Generate MAR (Missing At Random) for x2 using logistic function
prob_mar <- 1 / (1 + exp(-(y - 21791))) # Logistic function
rmar <- rbinom(1000, 1, prob_mar) # Generate MAR indicator
x2[rmar == 0] <- NA # Set x2 to missing where rmar == 0
# Display the summary statistics for x2 (mean only)
mean_x2 <- mean(x2, na.rm = TRUE)
# Replace missing values of x2 with the mean of x2
x2[is.na(x2)] <- mean_x2
# Create a data frame with the variables
data <- data.frame(y, x1, x2, x3)
# Run OLS regression
model <- lm(y ~ x1 + x2 + x3, data = data)
# Display the regression results
summary(model)
Call: lm(formula = y ~ x1 + x2 + x3, data = data) Residuals: Min 1Q Median 3Q Max -7956.9 -1644.3 163.7 1915.5 5959.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1910.091 2459.762 0.777 0.438 x1 43.837 6.384 6.866 1.16e-11 *** x2 24.910 2.794 8.915 < 2e-16 *** x3 81.888 15.540 5.270 1.68e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2401 on 996 degrees of freedom Multiple R-squared: 0.1319, Adjusted R-squared: 0.1293 F-statistic: 50.46 on 3 and 996 DF, p-value: < 2.2e-16
%%stata
clear
* Set seed
set seed 123
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0 // Set x to missing where rmar == 0
summarize x2, meanonly
replace x2 = r(mean) if missing(x2)
regress y x1 x2 x3
etable, append
. . clear . . * Set seed . set seed 123 . . * Set number of observations . set obs 1000 Number of observations (_N) was 0, now 1,000. . . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) . . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) . . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) . . * Generate metric dependent variable with large coefficients . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) . . * Generate MAR . gen prob_mar = logistic(y-21791) . gen rmar = 0 if prob_mar==0 (500 missing values generated) . replace x2 = . if rmar == 0 // Set x to missing where rmar == 0 (500 real changes made, 500 to missing) . . summarize x2, meanonly . replace x2 = r(mean) if missing(x2) (500 real changes made) . . regress y x1 x2 x3 Source | SS df MS Number of obs = 1,000 -------------+---------------------------------- F(3, 996) = 44.23 Model | 814884957 3 271628319 Prob > F = 0.0000 Residual | 6.1174e+09 996 6141946.06 R-squared = 0.1175 -------------+---------------------------------- Adj R-squared = 0.1149 Total | 6.9323e+09 999 6939202.43 Root MSE = 2478.3 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- x1 | 36.86274 6.772257 5.44 0.000 23.57321 50.15227 x2 | 28.00538 2.82263 9.92 0.000 22.4664 33.54436 x3 | 63.7622 15.41944 4.14 0.000 33.50389 94.02052 _cons | 4173.834 2437.463 1.71 0.087 -609.3184 8956.987 ------------------------------------------------------------------------------ . . etable, append ------------------------------------------------------------------------------ y 2 y y y ------------------------------------------------------------------------------ x1 29.409 29.409 25.214 20.149 36.863 (4.032) (4.024) (5.942) (4.982) (6.772) x2 42.717 42.717 42.557 26.963 28.005 (0.944) (0.942) (1.385) (1.407) (2.823) x3 54.520 54.520 48.179 19.701 63.762 (9.246) (9.227) (13.133) (10.998) (15.419) Intercept 3778.896 3778.896 4912.711 13742.276 4173.834 (1402.667) (1399.858) (2007.467) (1717.077) (2437.463) var(e.y) 2.20e+06 (98396.280) Number of observations 1000 1000 508 500 1000 ------------------------------------------------------------------------------ .
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Set seed for reproducibility
np.random.seed(123)
# Set number of observations
n = 1000
# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)
# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)
# Generate MAR (Missing At Random) for x2
scaled_y = (y - np.mean(y)) / 1000
prob_mar = 1 / (1 + np.exp(-scaled_y)) # Logistic probability
rmar = np.random.binomial(n=1, p=prob_mar) # MAR indicator
x2[rmar == 0] = np.nan # Apply missingness
# Step: Mean imputation for x2
x2_mean = np.nanmean(x2) # Mean of observed x2
x2_filled = np.where(np.isnan(x2), x2_mean, x2) # Replace NaNs with mean
# Create DataFrame with imputed x2
data_mean_imputed = pd.DataFrame({
'y': y,
'x1': x1,
'x2': x2_filled,
'x3': x3
})
# Run OLS regression on mean-imputed data
X = sm.add_constant(data_mean_imputed[['x1', 'x2', 'x3']]) # Add intercept
model = sm.OLS(data_mean_imputed['y'], X).fit()
# Display regression results
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.134 Model: OLS Adj. R-squared: 0.131 Method: Least Squares F-statistic: 51.26 Date: Mon, 05 May 2025 Prob (F-statistic): 8.18e-31 Time: 11:34:35 Log-Likelihood: -9129.7 No. Observations: 1000 AIC: 1.827e+04 Df Residuals: 996 BIC: 1.829e+04 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 5571.3342 2306.839 2.415 0.016 1044.512 1.01e+04 x1 24.4270 5.898 4.142 0.000 12.853 36.001 x2 28.9485 2.530 11.442 0.000 23.984 33.913 x3 58.3072 14.460 4.032 0.000 29.931 86.683 ============================================================================== Omnibus: 19.477 Durbin-Watson: 1.957 Prob(Omnibus): 0.000 Jarque-Bera (JB): 19.625 Skew: -0.321 Prob(JB): 5.48e-05 Kurtosis: 2.755 Cond. No. 8.95e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.95e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Now we move on to FIML approaches with R first.
%%R
# Set seed for reproducibility
set.seed(123)
# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)
# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)
# Generate MAR (Missing At Random) for x2
prob_mar <- 1 / (1 + exp(-(y - 21791))) # Logistic function
rmar <- ifelse(prob_mar == 0, 0, 1) # 0 if prob_mar is 0, 1 otherwise
x2[rmar == 0] <- NA # Set x2 to missing where rmar == 0
# Load 'lavaan' package for SEM (Structural Equation Modeling)
library(lavaan)
# Define the model formula (SEM model)
model <- '
y ~ x1 + x2 + x3 # y as a function of x1, x2, and x3
'
# Fit the model using the 'lavaan' package
fit <- sem(model, data = data.frame(y, x1, x2, x3))
# Display the model summary
summary(fit)
lavaan 0.6-18 ended normally after 1 iteration Estimator ML Optimization method NLMINB Number of model parameters 4 Used Total Number of observations 606 1000 Model Test User Model: Test statistic 0.000 Degrees of freedom 0 Parameter Estimates: Standard errors Standard Information Expected Information saturated (h1) model Structured Regressions: Estimate Std.Err z-value P(>|z|) y ~ x1 19.799 4.270 4.637 0.000 x2 27.159 1.260 21.549 0.000 x3 41.840 9.653 4.334 0.000 Variances: Estimate Std.Err z-value P(>|z|) .y 1469996.086 84449.069 17.407 0.000
In addition: Warning message: lavaan->lav_data_full(): some observed variances are larger than 1000000 use varTable(fit) to investigate
%%stata
clear
* Set seed
set seed 123
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0 // Set x to missing where rmar == 0
* Model estimation
sem (y <- x1 x2 x3), method(mlmv)
etable, append
. . clear . . * Set seed . set seed 123 . . * Set number of observations . set obs 1000 Number of observations (_N) was 0, now 1,000. . . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) . . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) . . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) . . * Generate metric dependent variable with large coefficients . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) . . * Generate MAR . gen prob_mar = logistic(y-21791) . gen rmar = 0 if prob_mar==0 (500 missing values generated) . replace x2 = . if rmar == 0 // Set x to missing where rmar == 0 (500 real changes made, 500 to missing) . . * Model estimation . sem (y <- x1 x2 x3), method(mlmv) note: Missing values found in observed exogenous variables. Using the noxconditional behavior. Specify the forcexconditional option to override this behavior. Endogenous variables Observed: y Exogenous variables Observed: x1 x2 x3 Fitting saturated model: Iteration 0: Log likelihood = -19809.893 Iteration 1: Log likelihood = -18956.184 Iteration 2: Log likelihood = -18603.225 Iteration 3: Log likelihood = -18601.025 Iteration 4: Log likelihood = -18601 Iteration 5: Log likelihood = -18601 Fitting baseline model: Iteration 0: Log likelihood = -18886.253 Iteration 1: Log likelihood = -18792.75 Iteration 2: Log likelihood = -18757.19 Iteration 3: Log likelihood = -18754.887 Iteration 4: Log likelihood = -18754.864 Iteration 5: Log likelihood = -18754.864 Fitting target model: Iteration 0: Log likelihood = -18601 Iteration 1: Log likelihood = -18601 Structural equation model Number of obs = 1,000 Estimation method: mlmv Log likelihood = -18601 ------------------------------------------------------------------------------ | OIM | Coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- Structural | y | x1 | 38.10574 5.583454 6.82 0.000 27.16237 49.04911 x2 | 41.70024 1.300009 32.08 0.000 39.15227 44.24821 x3 | 40.31928 12.42463 3.25 0.001 15.96746 64.6711 _cons | 5775.744 1925.672 3.00 0.003 2001.496 9549.992 -------------+---------------------------------------------------------------- mean(x1)| 40.24597 .3686473 109.17 0.000 39.52343 40.96851 mean(x2)| 200.6471 2.530853 79.28 0.000 195.6867 205.6075 mean(x3)| 149.797 .1607991 931.58 0.000 149.4818 150.1121 -------------+---------------------------------------------------------------- var(e.y)| 2300879 187511.7 1961212 2699374 var(x1)| 135.9009 6.077671 124.496 148.3505 var(x2)| 2554.76 197.2925 2195.916 2972.244 var(x3)| 25.85634 1.156331 23.68646 28.225 -------------+---------------------------------------------------------------- cov(x1,x2)| -29.66936 22.2683 -1.33 0.183 -73.31444 13.97571 cov(x1,x3)| 1.100681 1.874863 0.59 0.557 -2.573983 4.775344 cov(x2,x3)| 12.02246 9.573148 1.26 0.209 -6.740568 30.78548 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00 Prob > chi2 = . . . etable, append ----------------------------------------------------------------------------------------- y 2 y y y 6 ----------------------------------------------------------------------------------------- x1 29.409 29.409 25.214 20.149 36.863 38.106 (4.032) (4.024) (5.942) (4.982) (6.772) (5.583) x2 42.717 42.717 42.557 26.963 28.005 41.700 (0.944) (0.942) (1.385) (1.407) (2.823) (1.300) x3 54.520 54.520 48.179 19.701 63.762 40.319 (9.246) (9.227) (13.133) (10.998) (15.419) (12.425) Intercept 3778.896 3778.896 4912.711 13742.276 4173.834 5775.744 (1402.667) (1399.858) (2007.467) (1717.077) (2437.463) (1925.672) var(e.y) 2.20e+06 2.30e+06 (98396.280) (1.88e+05) mean(x1) 40.246 (0.369) mean(x2) 200.647 (2.531) mean(x3) 149.797 (0.161) var(x1) 135.901 (6.078) var(x2) 2554.760 (197.292) var(x3) 25.856 (1.156) cov(x1,x2) -29.669 (22.268) cov(x1,x3) 1.101 (1.875) cov(x2,x3) 12.022 (9.573) Number of observations 1000 1000 508 500 1000 1000 ----------------------------------------------------------------------------------------- .
To my knowledge, there is no viable FIML estimation practice within Python. This makes it impossible to use FIML as a handling missing data technique in Python and as such you have to use alternative methods such as Imputation. This will now be demonstrated, again starting with R. For the purposes of this simulation, as all variables are continuous a MVN Multiple Imputation approach can be used, though MICE allows for variables of different types to be imputed.
%%R
# Install and load the `mice` package for multiple imputation (if not already installed)
if (!require(mice)) install.packages("mice", dependencies = TRUE)
library(mice)
# Set seed for reproducibility
set.seed(123)
# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)
# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)
# Generate MAR (Missing At Random) for x2
prob_mar <- 1 / (1 + exp(-(y - 21791))) # Logistic function
rmar <- ifelse(prob_mar == 0, 0, 1) # 0 if prob_mar is 0, 1 otherwise
x2[rmar == 0] <- NA # Set x2 to missing where rmar == 0
# Prepare the data for multiple imputation
data <- data.frame(y, x1, x2, x3)
# Perform multiple imputation using chained equations (MICE)
imp <- mice(data, m = 10, method = "pmm", seed = 12345)
# Perform regression after imputation
fit <- with(imp, lm(y ~ x1 + x2 + x3))
# Combine the results of the regression and display the estimates
pooled_results <- pool(fit)
summary(pooled_results)
iter imp variable 1 1 x2 1 2 x2 1 3 x2 1 4 x2 1 5 x2 1 6 x2 1 7 x2 1 8 x2 1 9 x2 1 10 x2 2 1 x2 2 2 x2 2 3 x2 2 4 x2 2 5 x2 2 6 x2 2 7 x2 2 8 x2 2 9 x2 2 10 x2 3 1 x2 3 2 x2 3 3 x2 3 4 x2 3 5 x2 3 6 x2 3 7 x2 3 8 x2 3 9 x2 3 10 x2 4 1 x2 4 2 x2 4 3 x2 4 4 x2 4 5 x2 4 6 x2 4 7 x2 4 8 x2 4 9 x2 4 10 x2 5 1 x2 5 2 x2 5 3 x2 5 4 x2 5 5 x2 5 6 x2 5 7 x2 5 8 x2 5 9 x2 5 10 x2 term estimate std.error statistic df p.value 1 (Intercept) 2184.92911 1901.113549 1.149289 415.6189 2.510977e-01 2 x1 30.19575 5.472728 5.517495 165.8509 1.299693e-07 3 x2 39.85730 1.632656 24.412548 80.6700 5.233432e-39 4 x3 67.14173 12.700619 5.286493 308.6763 2.363451e-07
Loading required package: mice Attaching package: ‘mice’ The following object is masked from ‘package:stats’: filter The following objects are masked from ‘package:base’: cbind, rbind
Now Stata code.
%%stata
clear
* Set seed
set seed 123
* Set number of observations
set obs 1000
* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)
* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)
* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)
* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)
* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0 // Set x to missing where rmar == 0
mi set wide
mi register imputed y x1 x2 x3
tab _mi_miss
mi impute chained ///
///
(regress) y x1 x2 x3 ///
, rseed(12345) dots force add(10) burnin(10)
mi estimate, post dots: regress y x1 x2 x3
etable, append
. . clear . . * Set seed . set seed 123 . . * Set number of observations . set obs 1000 Number of observations (_N) was 0, now 1,000. . . * Generate metric independent variable #1 . drawnorm x1, n(1000) means(40) sds(12) . . * Generate metric independent variable #2 . drawnorm x2, n(1000) means(200) sds(50) . . * Generate metric independent variable #3 . drawnorm x3, n(1000) means(150) sds(5) . . * Generate metric dependent variable with large coefficients . gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500) . . * Generate MAR . gen prob_mar = logistic(y-21791) . gen rmar = 0 if prob_mar==0 (500 missing values generated) . replace x2 = . if rmar == 0 // Set x to missing where rmar == 0 (500 real changes made, 500 to missing) . . mi set wide . . mi register imputed y x1 x2 x3 . . tab _mi_miss Incomplete | observation | Freq. Percent Cum. ------------+----------------------------------- 0 | 500 50.00 50.00 1 | 500 50.00 100.00 ------------+----------------------------------- Total | 1,000 100.00 . . . mi impute chained /// > /// > (regress) y x1 x2 x3 /// > , rseed(12345) dots force add(10) burnin(10) note: variables y x1 x3 contain no soft missing (.) values; imputing nothing note: missing-value pattern is monotone; no iteration performed. Conditional models (monotone): x2: regress x2 x3 x1 y Performing chained iterations: imputing m=1 through m=10 .........10 done Multivariate imputation Imputations = 10 Chained equations added = 10 Imputed: m=1 through m=10 updated = 0 Initialization: monotone Iterations = 0 burn-in = 0 y: linear regression x1: linear regression x2: linear regression x3: linear regression ------------------------------------------------------------------ | Observations per m |---------------------------------------------- Variable | Complete Incomplete Imputed | Total -------------------+-----------------------------------+---------- y | 1000 0 0 | 1000 x1 | 1000 0 0 | 1000 x2 | 500 500 500 | 1000 x3 | 1000 0 0 | 1000 ------------------------------------------------------------------ (Complete + Incomplete = Total; Imputed is the minimum across m of the number of filled-in observations.) . . . mi estimate, post dots: regress y x1 x2 x3 Imputations (10): .........10 done Multiple-imputation estimates Imputations = 10 Linear regression Number of obs = 1,000 Average RVI = 2.3440 Largest FMI = 0.5779 Complete DF = 996 DF adjustment: Small sample DF: min = 27.94 avg = 60.86 max = 124.57 Model F test: Equal FMI F( 3, 106.8) = 365.04 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- x1 | 36.94599 4.821662 7.66 0.000 27.403 46.48898 x2 | 42.0321 1.471605 28.56 0.000 39.01734 45.04686 x3 | 39.50875 12.21717 3.23 0.002 15.03525 63.98224 _cons | 5850.553 2031.63 2.88 0.007 1725.712 9975.395 ------------------------------------------------------------------------------ . . etable, append ---------------------------------------------------------------------------------------------------- y 2 y y y 6 y ---------------------------------------------------------------------------------------------------- x1 29.409 29.409 25.214 20.149 36.863 38.106 36.946 (4.032) (4.024) (5.942) (4.982) (6.772) (5.583) (4.822) x2 42.717 42.717 42.557 26.963 28.005 41.700 42.032 (0.944) (0.942) (1.385) (1.407) (2.823) (1.300) (1.472) x3 54.520 54.520 48.179 19.701 63.762 40.319 39.509 (9.246) (9.227) (13.133) (10.998) (15.419) (12.425) (12.217) Intercept 3778.896 3778.896 4912.711 13742.276 4173.834 5775.744 5850.553 (1402.667) (1399.858) (2007.467) (1717.077) (2437.463) (1925.672) (2031.630) var(e.y) 2.20e+06 2.30e+06 (98396.280) (1.88e+05) mean(x1) 40.246 (0.369) mean(x2) 200.647 (2.531) mean(x3) 149.797 (0.161) var(x1) 135.901 (6.078) var(x2) 2554.760 (197.292) var(x3) 25.856 (1.156) cov(x1,x2) -29.669 (22.268) cov(x1,x3) 1.101 (1.875) cov(x2,x3) 12.022 (9.573) Number of observations 1000 1000 508 500 1000 1000 1000 ---------------------------------------------------------------------------------------------------- .
And finally, Python.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
# Set seed for reproducibility
np.random.seed(123)
# Set number of observations
n = 1000
# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)
# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)
# Generate MAR (Missing At Random) for x2
scaled_y = (y - np.mean(y)) / 1000
prob_mar = 1 / (1 + np.exp(-scaled_y)) # Logistic probability
rmar = np.random.binomial(n=1, p=prob_mar) # MAR indicator
x2[rmar == 0] = np.nan # Apply missingness
# Create DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})
lr = LinearRegression()
imp = IterativeImputer(estimator=lr, verbose=2, max_iter=5, tol=1e-10, imputation_order='roman')
ImputedData = imp.fit_transform(data)
# Convert imputed array back to DataFrame
imputed_df = pd.DataFrame(ImputedData, columns=data.columns)
# Define predictors and add constant
X_imputed = sm.add_constant(imputed_df[['x1', 'x2', 'x3']]) # Add intercept
y_imputed = imputed_df['y']
# Fit OLS model
model_imputed = sm.OLS(y_imputed, X_imputed).fit()
# Display regression summary
print(model_imputed.summary())
[IterativeImputer] Completing matrix with shape (1000, 4) [IterativeImputer] Ending imputation round 1/5, elapsed time 0.00 [IterativeImputer] Change: 132.83931088297044, scaled tolerance: 2.866777347081888e-06 [IterativeImputer] Ending imputation round 2/5, elapsed time 0.00 [IterativeImputer] Change: 0.0, scaled tolerance: 2.866777347081888e-06 [IterativeImputer] Early stopping criterion reached. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.734 Model: OLS Adj. R-squared: 0.733 Method: Least Squares F-statistic: 916.8 Date: Mon, 05 May 2025 Prob (F-statistic): 6.55e-286 Time: 11:34:37 Log-Likelihood: -8539.1 No. Observations: 1000 AIC: 1.709e+04 Df Residuals: 996 BIC: 1.711e+04 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 410.2896 1238.810 0.331 0.741 -2020.687 2841.267 x1 24.9007 3.265 7.626 0.000 18.493 31.308 x2 49.7527 0.962 51.727 0.000 47.865 51.640 x3 68.8840 8.001 8.609 0.000 53.183 84.585 ============================================================================== Omnibus: 148.471 Durbin-Watson: 1.913 Prob(Omnibus): 0.000 Jarque-Bera (JB): 261.767 Skew: 0.927 Prob(JB): 1.44e-57 Kurtosis: 4.686 Cond. No. 8.13e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.13e+03. This might indicate that there are strong multicollinearity or other numerical problems.