PA Ln(OR) with Subgroups

How does a Population Average Log Odds Ratio change with regard to a value X under a Probit Random Effects Model?

Column

Model Explanation

This visualization shows how the population-averaged log odds ratio (PA ln(OR)) changes as a function of X under a correctly specified probit random effects model.

The subject-specific model is:

Probit(P(Y = 1)) = α + a_i + βX

where a_i ~ N(0, σ²).

Under this model, population-averaged parameters are given by:

β_PA = β / √(1 + σ²)

For each value of X, the PA ln(OR)(x) is computed using:

Probabilities at X and X+1 from the marginal probit model
A transformation into a log odds ratio

Unlike logistic models, this quantity varies with X. To obtain a single summary measure, we evaluate ln(OR)(x) across a dense grid of X values and interpret the curve directly.

The dashed curves represent the distribution of X for two subgroups:

Sex = Male ~ N(-1.5, 1)
Sex = Female ~ N(1.5, 1)

Points indicate the true population-averaged values for each subgroup, computed using their respective exposure distributions.

Column

Predictive Accuracy Visualization

Findings

It can be seen that the PA ln(OR) changes based on the Sex subgroups despite not being included in the original probit model which can cause issues with samples that do not contain all subgroups due to limitations. These subgroups can be calculated using a correctly specified Kernel Density Estimation (KDE) probit model or a misspecified stratified GEE logistic model.

Real World Impact

This project demonstrates how population-averaged effect estimates can vary depending on model specification and subgroup structure. These findings are relevant for improving the accuracy of causal interpretation in clustered or heterogeneous data settings.

GitHub Repository | Live Dashboard

Probit KDE vs Logit GEE Sex Groups

What happens if you use the common approach of a stratified Generalized Estimating Equations (GEE) model using a logit link instead of a method that is correctly specified such as Kernel Density Estimation (KDE) with a normal kernel?

Column

Model Explanation

This comparison evaluates two approaches for estimating the population-averaged log odds ratio:

Kernel Density Estimation (KDE) based estimator (proposed method)
Stratified Generalized Estimating Equations (GEE) with a logit link (common approach)

Simulation setup:

200 clusters, 4 observations per cluster
500 simulated datasets
Random intercept: N(0, 0.5)
Exposure distributions differ by sex:
- Sex = Male ~ N(-1.5, 1)
- Sex = Female ~ N(1.5, 1)

KDE approach:

Estimates the density of X nonparametrically using kernel density estimation
Computes ln(OR)(x) across a grid
Averages using density-based weights

GEE approach:

Uses a logit link (misspecified under probit data generation)
Requires stratification to obtain subgroup estimates

Histograms show the distribution of estimated PA ln(OR) for each method and subgroup.

Column

Method Comparison

Findings

The KDE-based estimator produces more precise estimates (lower variability) than GEE.
KDE results:
- Sex Male mean ≈ 0.801, SD ≈ 0.029
- Sex Female mean ≈ 0.670, SD ≈ 0.021
GEE results:
- Sex Male mean ≈ 0.791, SD ≈ 0.047
- Sex Female mean ≈ 0.668, SD ≈ 0.031
KDE achieves tighter distributions and more stable subgroup estimates.
GEE requires stratification, which reduces effective sample size and increases variability.
Confidence interval coverage is slightly lower for KDE (~93%) than GEE (~95%), but this is offset by improved precision.
The key advantage of KDE is that it directly targets the population-averaged parameter by adapting to the observed distribution of X.

Real World Impact

This project demonstrates how population-averaged effect estimates can vary depending on model specification and subgroup structure. These findings are relevant for improving the accuracy of causal interpretation in clustered or heterogeneous data settings.

GitHub Repository | Live Dashboard

Data Description

This study uses fully simulated data generated via a clustered probit random intercept model with 200 clusters and 4 observations per cluster. Exposure distributions differ by sex group and follow normal distributions with different means. A total of 500 Monte Carlo datasets were generated to evaluate estimator performance. The study population is entirely synthetic and does not correspond to a real world dataset. All data were generated programmatically in SAS.

GitHub Repository | Live Dashboard