pip install rpy2

Requirement already satisfied: rpy2 in /opt/anaconda3/lib/python3.12/site-packages (3.5.17)
Requirement already satisfied: cffi>=1.15.1 in /opt/anaconda3/lib/python3.12/site-packages (from rpy2) (1.17.1)
Requirement already satisfied: jinja2 in /opt/anaconda3/lib/python3.12/site-packages (from rpy2) (3.1.4)
Requirement already satisfied: tzlocal in /opt/anaconda3/lib/python3.12/site-packages (from rpy2) (5.3.1)
Requirement already satisfied: pycparser in /opt/anaconda3/lib/python3.12/site-packages (from cffi>=1.15.1->rpy2) (2.21)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/lib/python3.12/site-packages (from jinja2->rpy2) (2.1.3)
Note: you may need to restart the kernel to use updated packages.

%load_ext rpy2.ipython

%%R
x <- c(1,2,3,4,5)
mean(x)

[1] 3

pip install pystata

Requirement already satisfied: pystata in /opt/anaconda3/lib/python3.12/site-packages (0.0.1)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (from pystata) (2.2.2)
Requirement already satisfied: numpy>=1.26.0 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->pystata) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->pystata) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

pip install --upgrade --user stata_setup

Requirement already satisfied: stata_setup in /Users/scottoatley/.local/lib/python3.12/site-packages (0.1.3)
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (from stata_setup) (1.26.4)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (from stata_setup) (2.2.2)
Requirement already satisfied: ipython in /opt/anaconda3/lib/python3.12/site-packages (from stata_setup) (8.27.0)
Requirement already satisfied: decorator in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (0.19.1)
Requirement already satisfied: matplotlib-inline in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (0.1.6)
Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (3.0.43)
Requirement already satisfied: pygments>=2.4.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (2.15.1)
Requirement already satisfied: stack-data in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (0.2.0)
Requirement already satisfied: traitlets>=5.13.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (5.14.3)
Requirement already satisfied: pexpect>4.3 in /opt/anaconda3/lib/python3.12/site-packages (from ipython->stata_setup) (4.8.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->stata_setup) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->stata_setup) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->stata_setup) (2023.3)
Requirement already satisfied: parso<0.9.0,>=0.8.3 in /opt/anaconda3/lib/python3.12/site-packages (from jedi>=0.16->ipython->stata_setup) (0.8.3)
Requirement already satisfied: ptyprocess>=0.5 in /opt/anaconda3/lib/python3.12/site-packages (from pexpect>4.3->ipython->stata_setup) (0.7.0)
Requirement already satisfied: wcwidth in /opt/anaconda3/lib/python3.12/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython->stata_setup) (0.2.5)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->stata_setup) (1.16.0)
Requirement already satisfied: executing in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython->stata_setup) (0.8.3)
Requirement already satisfied: asttokens in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython->stata_setup) (2.0.5)
Requirement already satisfied: pure-eval in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython->stata_setup) (0.2.2)
Note: you may need to restart the kernel to use updated packages.

import stata_setup

stata_setup.config("/Applications/Stata", "se")

  ___  ____  ____  ____  ____ ®
 /__    /   ____/   /   ____/      18.0
___/   /   /___/   /   /___/       SE—Standard Edition

 Statistics and Data Science       Copyright 1985-2023 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Unlimited-user network, expiring 14 Sep 2025
Serial number: 401809305318
  Licensed to: Scott oatley
               University of Edinburgh

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. Maximum number of variables is set to 5,000 but can be increased;
          see help set_maxvar.

%%stata

display 1+1

. 
. display 1+1
2

.

%%R
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)

[1] FALSE FALSE FALSE  TRUE

%%R

# Load necessary library
set.seed(123)  # Ensure reproducibility

# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)

# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)

# Create data frame
data <- data.frame(y, x1, x2, x3)

# Run regression
model <- lm(y ~ x1 + x2 + x3, data = data)
summary(model)

Call:
lm(formula = y ~ x1 + x2 + x3, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4584.3  -950.2     0.7  1012.9  4338.4 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2726.9512  1462.4421   1.865   0.0625 .  
x1            29.8375     3.9740   7.508 1.33e-13 ***
x2            39.7564     0.9369  42.432  < 2e-16 ***
x3            65.4442     9.6348   6.792 1.89e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1489 on 996 degrees of freedom
Multiple R-squared:  0.6662,	Adjusted R-squared:  0.6652 
F-statistic: 662.5 on 3 and 996 DF,  p-value: < 2.2e-16

%%stata

* Set seed
set seed 123

* Set number of observations
set obs 1000

* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Run Model
regress y x1 x2 x3

etable

. 
. * Set seed
. set seed 123

. 
. * Set number of observations
. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)

. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)

. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)

. 
. * Generate metric dependent variable with large coefficients
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

. 
. * Run Model
. regress y x1 x2 x3

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(3, 996)       =    714.04
       Model |  4.7321e+09         3  1.5774e+09   Prob > F        =    0.0000
    Residual |  2.2002e+09       996  2209043.89   R-squared       =    0.6826
-------------+----------------------------------   Adj R-squared   =    0.6817
       Total |  6.9323e+09       999  6939202.43   Root MSE        =    1486.3

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |    29.4092   4.032441     7.29   0.000     21.49614    37.32225
          x2 |   42.71715   .9441681    45.24   0.000     40.86437    44.56994
          x3 |   54.51955   9.245538     5.90   0.000     36.37658    72.66252
       _cons |   3778.896   1402.667     2.69   0.007     1026.375    6531.417
------------------------------------------------------------------------------

. 
. etable

---------------------------------
                            y    
---------------------------------
x1                         29.409
                          (4.032)
x2                         42.717
                          (0.944)
x3                         54.520
                          (9.246)
Intercept                3778.896
                       (1402.667)
Number of observations       1000
---------------------------------

.

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Set seed for reproducibility
np.random.seed(123)

# Generate metric independent variables
x1 = np.random.normal(loc=40, scale=12, size=1000)
x2 = np.random.normal(loc=200, scale=50, size=1000)
x3 = np.random.normal(loc=150, scale=5, size=1000)

# Generate metric dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=1000)

# Create DataFrame
df = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})

# Run regression
X = df[['x1', 'x2', 'x3']]  # Independent variables
X = sm.add_constant(X)  # Add intercept
model = sm.OLS(df['y'], X).fit()  # Fit model

# Display summary
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.618
Model:                            OLS   Adj. R-squared:                  0.617
Method:                 Least Squares   F-statistic:                     537.7
Date:                Mon, 05 May 2025   Prob (F-statistic):          1.03e-207
Time:                        11:34:34   Log-Likelihood:                -8719.9
No. Observations:                1000   AIC:                         1.745e+04
Df Residuals:                     996   BIC:                         1.747e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5257.2104   1467.373      3.583      0.000    2377.713    8136.707
x1            26.4715      3.914      6.764      0.000      18.792      34.151
x2            38.7536      0.981     39.513      0.000      36.829      40.678
x3            51.5446      9.574      5.384      0.000      32.757      70.332
==============================================================================
Omnibus:                        2.499   Durbin-Watson:                   1.897
Prob(Omnibus):                  0.287   Jarque-Bera (JB):                2.519
Skew:                           0.122   Prob(JB):                        0.284
Kurtosis:                       2.964   Cond. No.                     8.01e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.01e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

%%R

# Load necessary libraries
library(lavaan)
library(MASS)  # For mvrnorm (multivariate normal distribution)

# Set seed for reproducibility
set.seed(123)

# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)

# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)

# Create a dataframe
data <- data.frame(y, x1, x2, x3)

# Define the SEM model
sem_model <- '
  y ~ x1 + x2 + x3
'

# Fit the model using Maximum Likelihood with Missing Values (MLMV)
fit <- sem(sem_model, data = data, missing = "ml")

# Print model summary
summary(fit, fit.measures = TRUE, standardized = TRUE)

lavaan 0.6-18 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                          1000
  Number of missing patterns                         1

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                              1097.097
  Degrees of freedom                                 3
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000
                                                      
  Robust Comparative Fit Index (CFI)             1.000
  Robust Tucker-Lewis Index (TLI)                1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -8722.665
  Loglikelihood unrestricted model (H1)      -8722.665
                                                      
  Akaike (AIC)                               17455.330
  Bayesian (BIC)                             17479.869
  Sample-size adjusted Bayesian (SABIC)      17463.989

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: RMSEA <= 0.050                       NA
  P-value H_0: RMSEA >= 0.080                       NA
                                                      
  Robust RMSEA                                   0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: Robust RMSEA <= 0.050                NA
  P-value H_0: Robust RMSEA >= 0.080                NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate     Std.Err   z-value  P(>|z|)   Std.lv     Std.all
  y ~                                                                          
    x1                  29.837     3.966    7.523    0.000      29.837    0.138
    x2                  39.756     0.935   42.518    0.000      39.756    0.780
    x3                  65.444     9.616    6.806    0.000      65.444    0.124

Intercepts:
                   Estimate     Std.Err   z-value  P(>|z|)   Std.lv     Std.all
   .y                 2726.951  1459.514    1.868    0.062    2726.951    1.060

Variances:
                   Estimate     Std.Err   z-value  P(>|z|)   Std.lv     Std.all
   .y              2207681.041 98745.450   22.357    0.000 2207681.041    0.334

This is lavaan 0.6-18
lavaan is FREE software! Please report any bugs.
In addition: Warning messages:
1: lavaan->lav_data_full():  
   some observed variances are larger than 1000000 use varTable(fit) to 
   investigate 
2: lavaan->lavData():  
   some observed variances in the sample covariance matrix are larger than 
   1000000. 
3: lavaan->lavData():  
   some observed variances in the sample covariance matrix are larger than 
   1000000. 
4: lavaan->lavData():  
   some observed variances in the sample covariance matrix are larger than 
   1000000. 
5: lavaan->lavData():  
   some observed variances in the sample covariance matrix are larger than 
   1000000.

%%stata

clear

* Set seed
set seed 123

* Set number of observations
set obs 1000

* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

sem (y <- x1 x2 x3), method(mlmv)

etable, append

. 
. clear

. 
. * Set seed
. set seed 123

. 
. * Set number of observations
. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)

. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)

. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)

. 
. * Generate metric dependent variable with large coefficients
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

. 
. sem (y <- x1 x2 x3), method(mlmv)

Endogenous variables
  Observed: y

Exogenous variables
  Observed: x1 x2 x3

Fitting target model:
Iteration 0:  Log likelihood = -20967.462  
Iteration 1:  Log likelihood = -20967.462  

Structural equation model                                Number of obs = 1,000
Estimation method: mlmv

Log likelihood = -20967.462

------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  y          |
          x1 |    29.4092   4.024368     7.31   0.000     21.52158    37.29681
          x2 |   42.71715   .9422779    45.33   0.000     40.87032    44.56398
          x3 |   54.51955   9.227029     5.91   0.000     36.43491     72.6042
       _cons |   3778.896   1399.858     2.70   0.007     1035.223    6522.568
-------------+----------------------------------------------------------------
     var(e.y)|    2200208   98396.28                       2015565     2401765
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00                 Prob > chi2 = .

. 
. etable, append 

---------------------------------------------
                            y          2     
---------------------------------------------
x1                         29.409      29.409
                          (4.032)     (4.024)
x2                         42.717      42.717
                          (0.944)     (0.942)
x3                         54.520      54.520
                          (9.246)     (9.227)
Intercept                3778.896    3778.896
                       (1402.667)  (1399.858)
var(e.y)                             2.20e+06
                                  (98396.280)
Number of observations       1000        1000
---------------------------------------------

.

pip install semopy

Requirement already satisfied: semopy in /opt/anaconda3/lib/python3.12/site-packages (2.3.11)
Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.13.1)
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.26.4)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (2.2.2)
Requirement already satisfied: sympy in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.13.2)
Requirement already satisfied: scikit-learn in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (1.5.1)
Requirement already satisfied: statsmodels in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (0.14.2)
Requirement already satisfied: numdifftools in /opt/anaconda3/lib/python3.12/site-packages (from semopy) (0.9.41)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->semopy) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->semopy) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas->semopy) (2023.3)
Requirement already satisfied: joblib>=1.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn->semopy) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn->semopy) (3.5.0)
Requirement already satisfied: patsy>=0.5.6 in /opt/anaconda3/lib/python3.12/site-packages (from statsmodels->semopy) (0.5.6)
Requirement already satisfied: packaging>=21.3 in /opt/anaconda3/lib/python3.12/site-packages (from statsmodels->semopy) (24.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from sympy->semopy) (1.3.0)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.12/site-packages (from patsy>=0.5.6->statsmodels->semopy) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

import numpy as np
import pandas as pd
from semopy import Model

# Set random seed for reproducibility
np.random.seed(123)

# Generate metric independent variables
x1 = np.random.normal(loc=40, scale=12, size=1000)
x2 = np.random.normal(loc=200, scale=50, size=1000)
x3 = np.random.normal(loc=150, scale=5, size=1000)

# Generate metric dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=1000)

# Create a DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})

# Define the SEM model in semopy syntax
sem_model = """
y ~ x1 + x2 + x3
"""

# Create and fit the model
model = Model(sem_model)
model.fit(data)

# Print model results
print(model.inspect())

  lval  op rval      Estimate       Std. Err    z-value       p-value
0    y   ~   x1  2.826314e+01       4.470083   6.322734  2.569751e-10
1    y   ~   x2  3.870783e+01       1.120236  34.553275  0.000000e+00
2    y   ~   x3  9.507102e+00      10.935110   0.869411  3.846225e-01
3    y  ~~    y  2.875963e+06  128616.985180  22.360680  0.000000e+00

%%R

# Set seed for reproducibility
set.seed(123)

# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)

# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)

# Introduce 50% Missing Completely at Random (MCAR) in x2
rmcar <- rbinom(1000, 1, 0.5)  # Binary indicator: 50% missingness
x2[rmcar == 0] <- NA        # Set x2 to NA for half the observations

# Create data frame
data <- data.frame(y, x1, x2, x3)

# Run OLS regression (handling missing data with listwise deletion by default)
ols_model <- lm(y ~ x1 + x2 + x3, data = data)

# Display regression results
summary(ols_model)

Call:
lm(formula = y ~ x1 + x2 + x3, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4464.9 -1057.0   -31.6  1042.2  4294.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  386.332   2177.578   0.177    0.859    
x1            29.499      5.908   4.993 8.28e-07 ***
x2            40.483      1.364  29.671  < 2e-16 ***
x3            80.047     14.416   5.553 4.62e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1548 on 489 degrees of freedom
  (507 observations deleted due to missingness)
Multiple R-squared:  0.6697,	Adjusted R-squared:  0.6677 
F-statistic: 330.5 on 3 and 489 DF,  p-value: < 2.2e-16

%%stata

clear

* Set seed
set seed 123

* Set number of observations
set obs 1000

* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MCAR 
gen rmcar = rbinomial(1, 0.5)  // MCAR: 50% chance of missingness (binary random)
replace x2 = . if rmcar == 0  // Set x to missing where rmcar == 0

regress y x1 x2 x3

etable, append

. 
. clear

. 
. * Set seed
. set seed 123

. 
. * Set number of observations
. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)

. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)

. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)

. 
. * Generate metric dependent variable with large coefficients
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

. 
. * Generate MCAR 
. gen rmcar = rbinomial(1, 0.5)  // MCAR: 50% chance of missingness (binary ran
> dom)

. replace x2 = . if rmcar == 0  // Set x to missing where rmcar == 0
(492 real changes made, 492 to missing)

. 
. regress y x1 x2 x3

      Source |       SS           df       MS      Number of obs   =       508
-------------+----------------------------------   F(3, 504)       =    325.45
       Model |  2.3023e+09         3   767444127   Prob > F        =    0.0000
    Residual |  1.1885e+09       504  2358099.32   R-squared       =    0.6595
-------------+----------------------------------   Adj R-squared   =    0.6575
       Total |  3.4908e+09       507  6885235.58   Root MSE        =    1535.6

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   25.21384    5.94201     4.24   0.000     13.53968      36.888
          x2 |   42.55748   1.384992    30.73   0.000     39.83641    45.27855
          x3 |   48.17947   13.13299     3.67   0.000     22.37733    73.98162
       _cons |   4912.711   2007.467     2.45   0.015     968.6771    8856.744
------------------------------------------------------------------------------

. 
. etable, append

--------------------------------------------------------
                            y          2           y    
--------------------------------------------------------
x1                         29.409      29.409     25.214
                          (4.032)     (4.024)    (5.942)
x2                         42.717      42.717     42.557
                          (0.944)     (0.942)    (1.385)
x3                         54.520      54.520     48.179
                          (9.246)     (9.227)   (13.133)
Intercept                3778.896    3778.896   4912.711
                       (1402.667)  (1399.858) (2007.467)
var(e.y)                             2.20e+06           
                                  (98396.280)           
Number of observations       1000        1000        508
--------------------------------------------------------

.

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Set seed for reproducibility
np.random.seed(123)

# Set number of observations
n = 1000

# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)

# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)

# Introduce 50% Missing Completely at Random (MCAR) in x2
rmcar = np.random.binomial(1, 0.5, size=n)  # Binary indicator: 50% missingness
x2[rmcar == 0] = np.nan  # Set x2 to NaN for half the observations

# Create DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})

# Drop rows with missing values (listwise deletion) and run OLS regression
data_complete = data.dropna()  # Remove rows with NaN
X = data_complete[['x1', 'x2', 'x3']]
X = sm.add_constant(X)  # Add intercept
y = data_complete['y']

# Fit OLS model
model = sm.OLS(y, X).fit()

# Display results
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.631
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     295.0
Date:                Mon, 05 May 2025   Prob (F-statistic):          1.19e-111
Time:                        11:34:35   Log-Likelihood:                -4541.7
No. Observations:                 522   AIC:                             9091.
Df Residuals:                     518   BIC:                             9108.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4652.1226   2050.412      2.269      0.024     623.978    8680.268
x1            20.5814      5.133      4.010      0.000      10.498      30.665
x2            39.7328      1.342     29.605      0.000      37.096      42.369
x3            55.6552     13.271      4.194      0.000      29.584      81.726
==============================================================================
Omnibus:                        4.773   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.092   Jarque-Bera (JB):                4.593
Skew:                           0.223   Prob(JB):                        0.101
Kurtosis:                       3.108   Cond. No.                     8.24e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.24e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

%%R

# Load necessary library
set.seed(123)  # Ensure reproducibility

# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)

# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)

# Generate MAR (Missing At Random) pattern
prob_mar <- 1 / (1 + exp(-(y - 21791)))  # Logistic function
rmar <- rbinom(1000, 1, prob_mar)  # Generate missing indicator based on probability
x2[rmar == 0] <- NA  # Set x2 to NA for MAR missingness

# Create data frame
data <- data.frame(y, x1, x2, x3)

# Run OLS regression (listwise deletion by default)
ols_model <- lm(y ~ x1 + x2 + x3, data = data, na.action = na.omit)

# Display regression results
summary(ols_model)

Call:
lm(formula = y ~ x1 + x2 + x3, data = data, na.action = na.omit)

Residuals:
    Min      1Q  Median      3Q     Max 
-2876.0  -828.4   -78.3   670.2  3761.0 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12209.232   1598.736   7.637 1.13e-13 ***
x1             19.693      4.588   4.292 2.12e-05 ***
x2             24.731      1.351  18.300  < 2e-16 ***
x3             33.021     10.191   3.240  0.00127 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1161 on 504 degrees of freedom
  (492 observations deleted due to missingness)
Multiple R-squared:  0.4173,	Adjusted R-squared:  0.4138 
F-statistic: 120.3 on 3 and 504 DF,  p-value: < 2.2e-16

%%stata

clear

* Set seed
set seed 123

* Set number of observations
set obs 1000

* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0  // Set x to missing where rmar == 0
regress y x1 x2 x3

etable, append

. 
. clear

. 
. * Set seed
. set seed 123

. 
. * Set number of observations
. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)

. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)

. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)

. 
. * Generate metric dependent variable with large coefficients
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)

. gen rmar = 0 if prob_mar==0
(500 missing values generated)

. replace x2 = . if rmar == 0  // Set x to missing where rmar == 0
(500 real changes made, 500 to missing)

. regress y x1 x2 x3

      Source |       SS           df       MS      Number of obs   =       500
-------------+----------------------------------   F(3, 496)       =    122.89
       Model |   552925039         3   184308346   Prob > F        =    0.0000
    Residual |   743864709       496  1499727.24   R-squared       =    0.4264
-------------+----------------------------------   Adj R-squared   =    0.4229
       Total |  1.2968e+09       499  2598777.05   Root MSE        =    1224.6

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   20.14886   4.982177     4.04   0.000     10.36009    29.93763
          x2 |   26.96303   1.407434    19.16   0.000     24.19777     29.7283
          x3 |    19.7006   10.99788     1.79   0.074    -1.907582    41.30878
       _cons |   13742.28   1717.077     8.00   0.000     10368.64    17115.92
------------------------------------------------------------------------------

. 
. etable, append

-------------------------------------------------------------------
                            y          2           y          y    
-------------------------------------------------------------------
x1                         29.409      29.409     25.214     20.149
                          (4.032)     (4.024)    (5.942)    (4.982)
x2                         42.717      42.717     42.557     26.963
                          (0.944)     (0.942)    (1.385)    (1.407)
x3                         54.520      54.520     48.179     19.701
                          (9.246)     (9.227)   (13.133)   (10.998)
Intercept                3778.896    3778.896   4912.711  13742.276
                       (1402.667)  (1399.858) (2007.467) (1717.077)
var(e.y)                             2.20e+06                      
                                  (98396.280)                      
Number of observations       1000        1000        508        500
-------------------------------------------------------------------

.

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Set seed for reproducibility
np.random.seed(123)

# Set number of observations
n = 1000

# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)

# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)

# Generate MAR (Missing At Random) for x2
# Logistic function: scale y to avoid extreme probabilities
scaled_y = (y - np.mean(y)) / 1000
prob_mar = 1 / (1 + np.exp(-scaled_y))  # Values between 0 and 1

# Generate MAR indicator
rmar = np.random.binomial(n=1, p=prob_mar)  # Correct binomial usage

# Apply missingness to x2
x2[rmar == 0] = np.nan

# Create DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})

# Run OLS regression (listwise deletion)
data_clean = data.dropna()  # Remove rows with missing values
X = sm.add_constant(data_clean[['x1', 'x2', 'x3']])  # Add intercept
model = sm.OLS(data_clean['y'], X).fit()

# Display regression results
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.426
Model:                            OLS   Adj. R-squared:                  0.423
Method:                 Least Squares   F-statistic:                     123.7
Date:                Mon, 05 May 2025   Prob (F-statistic):           6.18e-60
Time:                        11:34:35   Log-Likelihood:                -4337.5
No. Observations:                 504   AIC:                             8683.
Df Residuals:                     500   BIC:                             8700.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       9446.1975   1890.284      4.997      0.000    5732.318    1.32e+04
x1            18.1592      5.048      3.598      0.000       8.242      28.076
x2            28.6725      1.507     19.027      0.000      25.712      31.633
x3            44.9865     12.007      3.747      0.000      21.396      68.577
==============================================================================
Omnibus:                        3.628   Durbin-Watson:                   1.937
Prob(Omnibus):                  0.163   Jarque-Bera (JB):                3.515
Skew:                           0.160   Prob(JB):                        0.173
Kurtosis:                       2.746   Cond. No.                     8.81e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.81e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

%%R

# Set seed for reproducibility
set.seed(123)

# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)

# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)

# Generate MAR (Missing At Random) for x2 using logistic function
prob_mar <- 1 / (1 + exp(-(y - 21791)))  # Logistic function
rmar <- rbinom(1000, 1, prob_mar)  # Generate MAR indicator
x2[rmar == 0] <- NA  # Set x2 to missing where rmar == 0

# Display the summary statistics for x2 (mean only)
mean_x2 <- mean(x2, na.rm = TRUE)

# Replace missing values of x2 with the mean of x2
x2[is.na(x2)] <- mean_x2

# Create a data frame with the variables
data <- data.frame(y, x1, x2, x3)

# Run OLS regression
model <- lm(y ~ x1 + x2 + x3, data = data)

# Display the regression results
summary(model)

Call:
lm(formula = y ~ x1 + x2 + x3, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7956.9 -1644.3   163.7  1915.5  5959.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1910.091   2459.762   0.777    0.438    
x1            43.837      6.384   6.866 1.16e-11 ***
x2            24.910      2.794   8.915  < 2e-16 ***
x3            81.888     15.540   5.270 1.68e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2401 on 996 degrees of freedom
Multiple R-squared:  0.1319,	Adjusted R-squared:  0.1293 
F-statistic: 50.46 on 3 and 996 DF,  p-value: < 2.2e-16

%%stata

clear

* Set seed
set seed 123

* Set number of observations
set obs 1000

* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0  // Set x to missing where rmar == 0

summarize x2, meanonly
replace x2 = r(mean) if missing(x2)

regress y x1 x2 x3

etable, append

. 
. clear

. 
. * Set seed
. set seed 123

. 
. * Set number of observations
. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)

. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)

. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)

. 
. * Generate metric dependent variable with large coefficients
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)

. gen rmar = 0 if prob_mar==0
(500 missing values generated)

. replace x2 = . if rmar == 0  // Set x to missing where rmar == 0
(500 real changes made, 500 to missing)

. 
. summarize x2, meanonly

. replace x2 = r(mean) if missing(x2)
(500 real changes made)

. 
. regress y x1 x2 x3

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(3, 996)       =     44.23
       Model |   814884957         3   271628319   Prob > F        =    0.0000
    Residual |  6.1174e+09       996  6141946.06   R-squared       =    0.1175
-------------+----------------------------------   Adj R-squared   =    0.1149
       Total |  6.9323e+09       999  6939202.43   Root MSE        =    2478.3

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   36.86274   6.772257     5.44   0.000     23.57321    50.15227
          x2 |   28.00538    2.82263     9.92   0.000      22.4664    33.54436
          x3 |    63.7622   15.41944     4.14   0.000     33.50389    94.02052
       _cons |   4173.834   2437.463     1.71   0.087    -609.3184    8956.987
------------------------------------------------------------------------------

. 
. etable, append

------------------------------------------------------------------------------
                            y          2           y          y          y    
------------------------------------------------------------------------------
x1                         29.409      29.409     25.214     20.149     36.863
                          (4.032)     (4.024)    (5.942)    (4.982)    (6.772)
x2                         42.717      42.717     42.557     26.963     28.005
                          (0.944)     (0.942)    (1.385)    (1.407)    (2.823)
x3                         54.520      54.520     48.179     19.701     63.762
                          (9.246)     (9.227)   (13.133)   (10.998)   (15.419)
Intercept                3778.896    3778.896   4912.711  13742.276   4173.834
                       (1402.667)  (1399.858) (2007.467) (1717.077) (2437.463)
var(e.y)                             2.20e+06                                 
                                  (98396.280)                                 
Number of observations       1000        1000        508        500       1000
------------------------------------------------------------------------------

.

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Set seed for reproducibility
np.random.seed(123)

# Set number of observations
n = 1000

# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)

# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)

# Generate MAR (Missing At Random) for x2
scaled_y = (y - np.mean(y)) / 1000
prob_mar = 1 / (1 + np.exp(-scaled_y))  # Logistic probability
rmar = np.random.binomial(n=1, p=prob_mar)  # MAR indicator
x2[rmar == 0] = np.nan  # Apply missingness

# Step: Mean imputation for x2
x2_mean = np.nanmean(x2)  # Mean of observed x2
x2_filled = np.where(np.isnan(x2), x2_mean, x2)  # Replace NaNs with mean

# Create DataFrame with imputed x2
data_mean_imputed = pd.DataFrame({
    'y': y,
    'x1': x1,
    'x2': x2_filled,
    'x3': x3
})

# Run OLS regression on mean-imputed data
X = sm.add_constant(data_mean_imputed[['x1', 'x2', 'x3']])  # Add intercept
model = sm.OLS(data_mean_imputed['y'], X).fit()

# Display regression results
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.134
Model:                            OLS   Adj. R-squared:                  0.131
Method:                 Least Squares   F-statistic:                     51.26
Date:                Mon, 05 May 2025   Prob (F-statistic):           8.18e-31
Time:                        11:34:35   Log-Likelihood:                -9129.7
No. Observations:                1000   AIC:                         1.827e+04
Df Residuals:                     996   BIC:                         1.829e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5571.3342   2306.839      2.415      0.016    1044.512    1.01e+04
x1            24.4270      5.898      4.142      0.000      12.853      36.001
x2            28.9485      2.530     11.442      0.000      23.984      33.913
x3            58.3072     14.460      4.032      0.000      29.931      86.683
==============================================================================
Omnibus:                       19.477   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               19.625
Skew:                          -0.321   Prob(JB):                     5.48e-05
Kurtosis:                       2.755   Cond. No.                     8.95e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.95e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

%%R

# Set seed for reproducibility
set.seed(123)

# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)

# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)

# Generate MAR (Missing At Random) for x2
prob_mar <- 1 / (1 + exp(-(y - 21791)))  # Logistic function
rmar <- ifelse(prob_mar == 0, 0, 1)  # 0 if prob_mar is 0, 1 otherwise
x2[rmar == 0] <- NA  # Set x2 to missing where rmar == 0

# Load 'lavaan' package for SEM (Structural Equation Modeling)
library(lavaan)

# Define the model formula (SEM model)
model <- '
  y ~ x1 + x2 + x3  # y as a function of x1, x2, and x3
'

# Fit the model using the 'lavaan' package
fit <- sem(model, data = data.frame(y, x1, x2, x3))

# Display the model summary
summary(fit)

lavaan 0.6-18 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         4

                                                  Used       Total
  Number of observations                           606        1000

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate     Std.Err   z-value  P(>|z|)
  y ~                                                     
    x1                  19.799     4.270    4.637    0.000
    x2                  27.159     1.260   21.549    0.000
    x3                  41.840     9.653    4.334    0.000

Variances:
                   Estimate     Std.Err   z-value  P(>|z|)
   .y              1469996.086 84449.069   17.407    0.000

In addition: Warning message:
lavaan->lav_data_full():  
   some observed variances are larger than 1000000 use varTable(fit) to 
   investigate

%%stata

clear

* Set seed
set seed 123

* Set number of observations
set obs 1000

* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0  // Set x to missing where rmar == 0
    
* Model estimation
sem (y <- x1 x2 x3), method(mlmv)

etable, append

. 
. clear

. 
. * Set seed
. set seed 123

. 
. * Set number of observations
. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)

. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)

. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)

. 
. * Generate metric dependent variable with large coefficients
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)

. gen rmar = 0 if prob_mar==0
(500 missing values generated)

. replace x2 = . if rmar == 0  // Set x to missing where rmar == 0
(500 real changes made, 500 to missing)

.     
. * Model estimation
. sem (y <- x1 x2 x3), method(mlmv)
note: Missing values found in observed exogenous variables. Using the
      noxconditional behavior. Specify the forcexconditional option to
      override this behavior.
Endogenous variables
  Observed: y

Exogenous variables
  Observed: x1 x2 x3

Fitting saturated model:
Iteration 0:  Log likelihood = -19809.893  
Iteration 1:  Log likelihood = -18956.184  
Iteration 2:  Log likelihood = -18603.225  
Iteration 3:  Log likelihood = -18601.025  
Iteration 4:  Log likelihood =     -18601  
Iteration 5:  Log likelihood =     -18601  

Fitting baseline model:
Iteration 0:  Log likelihood = -18886.253  
Iteration 1:  Log likelihood =  -18792.75  
Iteration 2:  Log likelihood =  -18757.19  
Iteration 3:  Log likelihood = -18754.887  
Iteration 4:  Log likelihood = -18754.864  
Iteration 5:  Log likelihood = -18754.864  

Fitting target model:
Iteration 0:  Log likelihood =     -18601  
Iteration 1:  Log likelihood =     -18601  

Structural equation model                                Number of obs = 1,000
Estimation method: mlmv

Log likelihood = -18601

------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  y          |
          x1 |   38.10574   5.583454     6.82   0.000     27.16237    49.04911
          x2 |   41.70024   1.300009    32.08   0.000     39.15227    44.24821
          x3 |   40.31928   12.42463     3.25   0.001     15.96746     64.6711
       _cons |   5775.744   1925.672     3.00   0.003     2001.496    9549.992
-------------+----------------------------------------------------------------
     mean(x1)|   40.24597   .3686473   109.17   0.000     39.52343    40.96851
     mean(x2)|   200.6471   2.530853    79.28   0.000     195.6867    205.6075
     mean(x3)|    149.797   .1607991   931.58   0.000     149.4818    150.1121
-------------+----------------------------------------------------------------
     var(e.y)|    2300879   187511.7                       1961212     2699374
      var(x1)|   135.9009   6.077671                       124.496    148.3505
      var(x2)|    2554.76   197.2925                      2195.916    2972.244
      var(x3)|   25.85634   1.156331                      23.68646      28.225
-------------+----------------------------------------------------------------
   cov(x1,x2)|  -29.66936    22.2683    -1.33   0.183    -73.31444    13.97571
   cov(x1,x3)|   1.100681   1.874863     0.59   0.557    -2.573983    4.775344
   cov(x2,x3)|   12.02246   9.573148     1.26   0.209    -6.740568    30.78548
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00                 Prob > chi2 = .

. 
. etable, append

-----------------------------------------------------------------------------------------
                            y          2           y          y          y          6    
-----------------------------------------------------------------------------------------
x1                         29.409      29.409     25.214     20.149     36.863     38.106
                          (4.032)     (4.024)    (5.942)    (4.982)    (6.772)    (5.583)
x2                         42.717      42.717     42.557     26.963     28.005     41.700
                          (0.944)     (0.942)    (1.385)    (1.407)    (2.823)    (1.300)
x3                         54.520      54.520     48.179     19.701     63.762     40.319
                          (9.246)     (9.227)   (13.133)   (10.998)   (15.419)   (12.425)
Intercept                3778.896    3778.896   4912.711  13742.276   4173.834   5775.744
                       (1402.667)  (1399.858) (2007.467) (1717.077) (2437.463) (1925.672)
var(e.y)                             2.20e+06                                    2.30e+06
                                  (98396.280)                                  (1.88e+05)
mean(x1)                                                                           40.246
                                                                                  (0.369)
mean(x2)                                                                          200.647
                                                                                  (2.531)
mean(x3)                                                                          149.797
                                                                                  (0.161)
var(x1)                                                                           135.901
                                                                                  (6.078)
var(x2)                                                                          2554.760
                                                                                (197.292)
var(x3)                                                                            25.856
                                                                                  (1.156)
cov(x1,x2)                                                                        -29.669
                                                                                 (22.268)
cov(x1,x3)                                                                          1.101
                                                                                  (1.875)
cov(x2,x3)                                                                         12.022
                                                                                  (9.573)
Number of observations       1000        1000        508        500       1000       1000
-----------------------------------------------------------------------------------------

.

%%R
# Install and load the `mice` package for multiple imputation (if not already installed)
if (!require(mice)) install.packages("mice", dependencies = TRUE)
library(mice)

# Set seed for reproducibility
set.seed(123)

# Generate metric independent variables
x1 <- rnorm(1000, mean = 40, sd = 12)
x2 <- rnorm(1000, mean = 200, sd = 50)
x3 <- rnorm(1000, mean = 150, sd = 5)

# Generate metric dependent variable
y <- 30*x1 + 40*x2 + 50*x3 + rnorm(1000, mean = 5000, sd = 1500)

# Generate MAR (Missing At Random) for x2
prob_mar <- 1 / (1 + exp(-(y - 21791)))  # Logistic function
rmar <- ifelse(prob_mar == 0, 0, 1)  # 0 if prob_mar is 0, 1 otherwise
x2[rmar == 0] <- NA  # Set x2 to missing where rmar == 0

# Prepare the data for multiple imputation
data <- data.frame(y, x1, x2, x3)

# Perform multiple imputation using chained equations (MICE)
imp <- mice(data, m = 10, method = "pmm", seed = 12345)

# Perform regression after imputation
fit <- with(imp, lm(y ~ x1 + x2 + x3))

# Combine the results of the regression and display the estimates
pooled_results <- pool(fit)
summary(pooled_results)

 iter imp variable
  1   1  x2
  1   2  x2
  1   3  x2
  1   4  x2
  1   5  x2
  1   6  x2
  1   7  x2
  1   8  x2
  1   9  x2
  1   10  x2
  2   1  x2
  2   2  x2
  2   3  x2
  2   4  x2
  2   5  x2
  2   6  x2
  2   7  x2
  2   8  x2
  2   9  x2
  2   10  x2
  3   1  x2
  3   2  x2
  3   3  x2
  3   4  x2
  3   5  x2
  3   6  x2
  3   7  x2
  3   8  x2
  3   9  x2
  3   10  x2
  4   1  x2
  4   2  x2
  4   3  x2
  4   4  x2
  4   5  x2
  4   6  x2
  4   7  x2
  4   8  x2
  4   9  x2
  4   10  x2
  5   1  x2
  5   2  x2
  5   3  x2
  5   4  x2
  5   5  x2
  5   6  x2
  5   7  x2
  5   8  x2
  5   9  x2
  5   10  x2
         term   estimate   std.error statistic       df      p.value
1 (Intercept) 2184.92911 1901.113549  1.149289 415.6189 2.510977e-01
2          x1   30.19575    5.472728  5.517495 165.8509 1.299693e-07
3          x2   39.85730    1.632656 24.412548  80.6700 5.233432e-39
4          x3   67.14173   12.700619  5.286493 308.6763 2.363451e-07

Loading required package: mice

Attaching package: ‘mice’

The following object is masked from ‘package:stats’:

    filter

The following objects are masked from ‘package:base’:

    cbind, rbind

%%stata

clear

* Set seed
set seed 123

* Set number of observations
set obs 1000

* Generate metric independent variable #1
drawnorm x1, n(1000) means(40) sds(12)

* Generate metric independent variable #2
drawnorm x2, n(1000) means(200) sds(50)

* Generate metric independent variable #3
drawnorm x3, n(1000) means(150) sds(5)

* Generate metric dependent variable with large coefficients
gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

* Generate MAR
gen prob_mar = logistic(y-21791)
gen rmar = 0 if prob_mar==0
replace x2 = . if rmar == 0  // Set x to missing where rmar == 0
    
mi set wide

mi register imputed y x1 x2 x3

tab _mi_miss


mi impute chained ///
///
(regress) y x1 x2 x3 ///
, rseed(12345) dots force add(10) burnin(10) 


mi estimate, post dots: regress y x1 x2 x3

etable, append

. 
. clear

. 
. * Set seed
. set seed 123

. 
. * Set number of observations
. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. * Generate metric independent variable #1
. drawnorm x1, n(1000) means(40) sds(12)

. 
. * Generate metric independent variable #2
. drawnorm x2, n(1000) means(200) sds(50)

. 
. * Generate metric independent variable #3
. drawnorm x3, n(1000) means(150) sds(5)

. 
. * Generate metric dependent variable with large coefficients
. gen y = 30*x1 + 40*x2 + 50*x3 + rnormal(5000, 1500)

. 
. * Generate MAR
. gen prob_mar = logistic(y-21791)

. gen rmar = 0 if prob_mar==0
(500 missing values generated)

. replace x2 = . if rmar == 0  // Set x to missing where rmar == 0
(500 real changes made, 500 to missing)

.     
. mi set wide

. 
. mi register imputed y x1 x2 x3

. 
. tab _mi_miss

 Incomplete |
observation |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        500       50.00       50.00
          1 |        500       50.00      100.00
------------+-----------------------------------
      Total |      1,000      100.00

. 
. 
. mi impute chained ///
> ///
> (regress) y x1 x2 x3 ///
> , rseed(12345) dots force add(10) burnin(10) 
note: variables y x1 x3 contain no soft missing (.) values; imputing nothing
note: missing-value pattern is monotone; no iteration performed.

Conditional models (monotone):
                x2: regress x2 x3 x1 y

Performing chained iterations:
  imputing m=1 through m=10 .........10 done

Multivariate imputation                     Imputations =       10
Chained equations                                 added =       10
Imputed: m=1 through m=10                       updated =        0

Initialization: monotone                     Iterations =        0
                                                burn-in =        0

                 y: linear regression
                x1: linear regression
                x2: linear regression
                x3: linear regression

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
                 y |       1000            0         0 |      1000
                x1 |       1000            0         0 |      1000
                x2 |        500          500       500 |      1000
                x3 |       1000            0         0 |      1000
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

. 
. 
. mi estimate, post dots: regress y x1 x2 x3

Imputations (10):
  .........10 done

Multiple-imputation estimates                   Imputations       =         10
Linear regression                               Number of obs     =      1,000
                                                Average RVI       =     2.3440
                                                Largest FMI       =     0.5779
                                                Complete DF       =        996
DF adjustment:   Small sample                   DF:     min       =      27.94
                                                        avg       =      60.86
                                                        max       =     124.57
Model F test:       Equal FMI                   F(   3,  106.8)   =     365.04
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   36.94599   4.821662     7.66   0.000       27.403    46.48898
          x2 |    42.0321   1.471605    28.56   0.000     39.01734    45.04686
          x3 |   39.50875   12.21717     3.23   0.002     15.03525    63.98224
       _cons |   5850.553    2031.63     2.88   0.007     1725.712    9975.395
------------------------------------------------------------------------------

. 
. etable, append

----------------------------------------------------------------------------------------------------
                            y          2           y          y          y          6          y    
----------------------------------------------------------------------------------------------------
x1                         29.409      29.409     25.214     20.149     36.863     38.106     36.946
                          (4.032)     (4.024)    (5.942)    (4.982)    (6.772)    (5.583)    (4.822)
x2                         42.717      42.717     42.557     26.963     28.005     41.700     42.032
                          (0.944)     (0.942)    (1.385)    (1.407)    (2.823)    (1.300)    (1.472)
x3                         54.520      54.520     48.179     19.701     63.762     40.319     39.509
                          (9.246)     (9.227)   (13.133)   (10.998)   (15.419)   (12.425)   (12.217)
Intercept                3778.896    3778.896   4912.711  13742.276   4173.834   5775.744   5850.553
                       (1402.667)  (1399.858) (2007.467) (1717.077) (2437.463) (1925.672) (2031.630)
var(e.y)                             2.20e+06                                    2.30e+06           
                                  (98396.280)                                  (1.88e+05)           
mean(x1)                                                                           40.246           
                                                                                  (0.369)           
mean(x2)                                                                          200.647           
                                                                                  (2.531)           
mean(x3)                                                                          149.797           
                                                                                  (0.161)           
var(x1)                                                                           135.901           
                                                                                  (6.078)           
var(x2)                                                                          2554.760           
                                                                                (197.292)           
var(x3)                                                                            25.856           
                                                                                  (1.156)           
cov(x1,x2)                                                                        -29.669           
                                                                                 (22.268)           
cov(x1,x3)                                                                          1.101           
                                                                                  (1.875)           
cov(x2,x3)                                                                         12.022           
                                                                                  (9.573)           
Number of observations       1000        1000        508        500       1000       1000       1000
----------------------------------------------------------------------------------------------------

.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

# Set seed for reproducibility
np.random.seed(123)

# Set number of observations
n = 1000

# Generate independent variables
x1 = np.random.normal(loc=40, scale=12, size=n)
x2 = np.random.normal(loc=200, scale=50, size=n)
x3 = np.random.normal(loc=150, scale=5, size=n)

# Generate dependent variable
y = 30*x1 + 40*x2 + 50*x3 + np.random.normal(loc=5000, scale=1500, size=n)

# Generate MAR (Missing At Random) for x2
scaled_y = (y - np.mean(y)) / 1000
prob_mar = 1 / (1 + np.exp(-scaled_y))  # Logistic probability
rmar = np.random.binomial(n=1, p=prob_mar)  # MAR indicator
x2[rmar == 0] = np.nan  # Apply missingness

# Create DataFrame
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'x3': x3})


lr = LinearRegression()
imp = IterativeImputer(estimator=lr, verbose=2, max_iter=5, tol=1e-10, imputation_order='roman')

ImputedData = imp.fit_transform(data)

# Convert imputed array back to DataFrame
imputed_df = pd.DataFrame(ImputedData, columns=data.columns)

# Define predictors and add constant
X_imputed = sm.add_constant(imputed_df[['x1', 'x2', 'x3']])  # Add intercept
y_imputed = imputed_df['y']

# Fit OLS model
model_imputed = sm.OLS(y_imputed, X_imputed).fit()

# Display regression summary
print(model_imputed.summary())

[IterativeImputer] Completing matrix with shape (1000, 4)
[IterativeImputer] Ending imputation round 1/5, elapsed time 0.00
[IterativeImputer] Change: 132.83931088297044, scaled tolerance: 2.866777347081888e-06 
[IterativeImputer] Ending imputation round 2/5, elapsed time 0.00
[IterativeImputer] Change: 0.0, scaled tolerance: 2.866777347081888e-06 
[IterativeImputer] Early stopping criterion reached.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.734
Model:                            OLS   Adj. R-squared:                  0.733
Method:                 Least Squares   F-statistic:                     916.8
Date:                Mon, 05 May 2025   Prob (F-statistic):          6.55e-286
Time:                        11:34:37   Log-Likelihood:                -8539.1
No. Observations:                1000   AIC:                         1.709e+04
Df Residuals:                     996   BIC:                         1.711e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        410.2896   1238.810      0.331      0.741   -2020.687    2841.267
x1            24.9007      3.265      7.626      0.000      18.493      31.308
x2            49.7527      0.962     51.727      0.000      47.865      51.640
x3            68.8840      8.001      8.609      0.000      53.183      84.585
==============================================================================
Omnibus:                      148.471   Durbin-Watson:                   1.913
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              261.767
Skew:                           0.927   Prob(JB):                     1.44e-57
Kurtosis:                       4.686   Cond. No.                     8.13e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.