Introduction

ProSGPV is a package that performs variable selection with Second-Generation P-Values (SGPV). This document illustrates how ProSGPV performs variable selection with data from generalized linear models (GLM) and Cox proportional hazards models. Technical details about this algorithm can be found at Zuo, Stewart, and Blume (2021).

We will use Logistic regression to illustrate how ProSGPV selects variables with a low-dimensional (n > p) real-world data example and how to choose models when each run gives you slightly different results. We will use Poisson regression to show how ProSGPV works with simulated high-dimensional Poisson data and a visualization tool for the selection process. We will use Cox proportional hazards model to show how a fast one-stage algorithm works with simulated data together with a visualization tool for the one-stage ProSGPV.

To install the ProSGPV pacKakge from CRAN, you can do

install("ProSGPV")

Alternatively, you can install a development version of ProSGPV by doing

devtools::install_github("zuoyi93/ProSGPV")

Once the package is installed, we can load the package to the current environment.

library(ProSGPV)
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loaded glmnet 4.1-8
#> Loading required package: brglm2
#> Package 'ProSGPV' version 1.0.1

GLM examples

Logistic regression

Traditionally, maximum likelihood (ML) has been used to fit a Logistic model. However, when sample size is small or effect sizes are strong, complete/quasi-complete separation may happen, which may inflate the parameter estimation bias by a lot (Albert and Anderson 1984; Rahman and Sultana 2017). Recently, Kosmidis and Firth (2021) proposed to use a Jeffreys prior to address the separation issue in the Logistic regression, and that is implemented in ProSGPV package.

In this section, we will use a real-world data example to showcase how ProSGPV performs variable selection when the outcome is binary. Lower back pain can be caused by a variety of problems with any parts of the complex, interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. Dr. Henrique da Mota collected 12 biomechanical attributes from 310 patients, of whom 100 are normal and 210 are abnormal (Disk Hernia or Spondylolisthesis). The goal is to differentiate the normal patients from the abnormal using those 12 variables. The biomechanical attributes include pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius, degree of spondylolisthesis, pelvic slope, direct tilt, thoracic slope cervical tilt, sacrum angle, and scoliosis slope. The data set spine can be accessed in the ProSGPV package.

x <- spine[,-ncol(spine)]
y <- spine[,ncol(spine)]

sgpv.2s.l <- pro.sgpv(x,y,family="binomial") 
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred

sgpv.2s.l
#> Selected variables are sacral_slope pelvic_radius degree_spondylolisthesis

ProSGPV selects three out of 12 variables and they are acral slope, pelvic radius, and degree of spondylolisthesis.

We can get a summary of the model that we selected.

summary(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#> 
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = data.d, 
#>     method = "brglmFit", type = "MPL_Jeffreys")
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.78904  -0.48146   0.04248   0.33705   2.26768  
#> 
#> Coefficients:
#>                          Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)              17.42056    3.15274   5.526 3.28e-08 ***
#> sacral_slope             -0.11300    0.02214  -5.105 3.31e-07 ***
#> pelvic_radius            -0.11716    0.02254  -5.197 2.03e-07 ***
#> degree_spondylolisthesis  0.15797    0.02162   7.306 2.75e-13 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 389.86  on 309  degrees of freedom
#> Residual deviance: 184.42  on 306  degrees of freedom
#> AIC:  192.42
#> 
#> Type of estimator: MPL_Jeffreys (maximum penalized likelihood with Jeffreys'-prior penalty)
#> Number of Fisher Scoring iterations: 5

Coefficient estimates can be extracted by using the coef function.

coef(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#>  [1]  0.0000000  0.0000000  0.0000000 -0.1130012 -0.1171616  0.1579717
#>  [7]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000

Prediction in probability can be made by using the predict function.

head(predict(sgpv.2s.l))
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#>         1         2         3         4         5         6 
#> 0.7765972 0.8116983 0.3053596 0.9012994 0.8132581 0.3842353

Poisson regression

gen.sim.data function is used to generate some high-dimensional (p > n) simulation data from Poisson distribution in this section. There are many arguments in the gen.sim.data function, as it can generate data with continuous, binary, count, and survival outcomes.
n is the number of observations, p is the number of variables, s is the number of true signals, family can take the value of "gaussian", "binomial", "poisson", and "cox", beta.min is the smallest effect size in absolute value, beta.max is the largest effect size, rho is the autocorrelation coefficient in the design matrix, nu is the signal-to-noise ratio in the continuous outcome case, sig is the standard deviation in the covariance matrix of the design matrix, intercept is used in the linear mean vector of GLM, scale and shape are parameters in Weibull survival time, and rateC is the rate of censoring.

In this section, we simulate 80 observations with 100 explanatory variables and one count outcome. The number of true signals is four. The smallest log rate ratio is 0.5 and the largest log rate ratio is 1.5. When p > n, only the two-stage ProSGPV will be used, even if stage is set to be 1.

set.seed(1)
data.log <- gen.sim.data(n=80,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")

x <- data.log[[1]]
y <- data.log[[2]]
(true.index <- data.log[[3]])
#> [1]   3  43  89 100
(true.beta <- data.log[[4]])
#>   [1]  0.0000000  0.0000000  0.5000000  0.0000000  0.0000000  0.0000000
#>   [7]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [13]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [19]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [25]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [31]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [37]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [43]  0.8333333  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [49]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [55]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [61]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [67]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [73]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [79]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [85]  0.0000000  0.0000000  0.0000000  0.0000000 -1.1666667  0.0000000
#>  [91]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [97]  0.0000000  0.0000000  0.0000000 -1.5000000

sgpv.2s.p <- pro.sgpv(x,y,family="poisson")
sgpv.2s.p
#> Selected variables are V3 V43 V89 V100

We see that the two-stage ProSGPV recovered the true support.

Similarly, the summary of the final model is available by calling summary.

summary(sgpv.2s.p)
#> 
#> Call:
#> glm(formula = Response ~ ., family = "poisson", data = data.d)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.15872    0.12192  -1.302    0.193    
#> V3           0.59830    0.04679  12.786   <2e-16 ***
#> V43          0.89750    0.06422  13.975   <2e-16 ***
#> V89         -1.15970    0.05135 -22.583   <2e-16 ***
#> V100        -1.57528    0.08177 -19.265   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 1018.60  on 79  degrees of freedom
#> Residual deviance:   47.37  on 75  degrees of freedom
#> AIC: 243.92
#> 
#> Number of Fisher Scoring iterations: 5

Coefficients can be extracted by coef. We see the estimates are pretty close to the truth.

coef(sgpv.2s.p)
#>   [1]  0.0000000  0.0000000  0.5982981  0.0000000  0.0000000  0.0000000
#>   [7]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [13]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [19]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [25]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [31]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [37]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [43]  0.8974952  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [49]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [55]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [61]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [67]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [73]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [79]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [85]  0.0000000  0.0000000  0.0000000  0.0000000 -1.1597040  0.0000000
#>  [91]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [97]  0.0000000  0.0000000  0.0000000 -1.5752756

In-sample prediction can be done as follows.

head(predict(sgpv.2s.p))
#>          1          2          3          4          5          6 
#> 0.09056540 1.52559015 0.09082897 2.36900015 3.64998025 2.35678653

Out-of-sample prediction is also available by feeding data into the newdata argument.

set.seed(1)
data.log <- gen.sim.data(n=10,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")

x.new <- data.log[[1]]
y.new <- data.log[[2]]

data.frame(Observed=y.new,Predicted=predict(sgpv.2s.p,newdata=x.new))
#>    Observed  Predicted
#> 1         1  0.2511760
#> 2         0  0.7257742
#> 3         0  0.2668769
#> 4         7  6.4782805
#> 5         9 10.2552082
#> 6         0  0.2627837
#> 7         1  1.0470441
#> 8         0  0.4297897
#> 9         4  1.2498434
#> 10        7  5.5053366

The prediction agrees with the truth well.

The selection process can be visualized below. lambda.max sets the limit of X-axis. Only effects whose lower bound doesn’t reach the null regions are kept. Selected variables are labeled blue on the Y-axis. The vertical dotted line is the λ selected by generalized information criterion (Fan and Tang (2013)).

plot(sgpv.2s.p,lambda.max = 0.5)

You can also chooses to view only one line per variable by setting lpv to be 1. That line is the 95% confidence bound that is closer to the null.

plot(sgpv.2s.p,lambda.max = 0.5,lpv=1)

Cox regression

In this section, we will use simulated low-dimensional survival data to illustrate how one-stage and two-stage ProSGPV select variables, and additional visualization for the one-stage algorithm’s selection process.

gen.sim.data is used to simulate 100 observations.

set.seed(1)
data.cox <- gen.sim.data(n=100, p=20, s=4, family="cox", 
                         beta.min=0.5, beta.max=1.5, sig=2)

x <- data.cox[[1]]
y <- data.cox[[2]]
(true.index <- data.cox[[3]])
#> [1]  1  3  4 10
true.beta <- data.cox[[4]]

sgpv.2s.c <- pro.sgpv(x,y,stage=2,family="cox")
sgpv.2s.c
#> Selected variables are V1 V3 V4 V10

The two-stage algorithm successfully recovered the true support. We can also run a fast one-stage algorithm by setting stage to be 1.

sgpv.1s.c <- pro.sgpv(x,y,stage=1,family="cox")
sgpv.1s.c
#> Selected variables are V1 V3 V4 V10

It succeeded as well. summary, coef, and predict are also available for the one-stage ProSGPV object.

We can visualize the one-stage selection process by calling the plot function.

plot(sgpv.1s.c)

Effects estimates (in this case log hazards ratios) and corresponding 95% confidence intervals are plotted for each variable in the full Cox model. Two vertical green bars indicate the null region. Only effects whose confidence bounds do not overlap with the null regions are kept. Selected variables are colored in blue.

Way to address high correlation or dense signals

When the design matrix has high within correlation, or signals are known to be dense in the data, the ProSGPV algorithm yields a null bound that is slightly higher than the noise level, which subsequently affects the support recovery performance by missing some small true effects. One way to address this issue is to replace the constant null bound in the ProSGPV with a generalized variance inflation factor (GVIF)-adjusted null bound. Please see Fox and Monette (1992) for more details on how to calculate the GVIF for each variable in the design matrix. Essentially, we deflate the coefficient estimate standard error of each variable by its GVIF, and thus produce a smaller null bound, which includes more variables in the final selection set. This adjustment is found to be helpful when signals are dense, too.

References

Albert, Adelin, and John A Anderson. 1984. “On the Existence of Maximum Likelihood Estimates in Logistic Regression Models.” Biometrika 71 (1): 1–10.

Fan, Yingying, and Cheng Yong Tang. 2013. “Tuning Parameter Selection in High Dimensional Penalized Likelihood.” Journal of the Royal Statistical Society: SERIES B: Statistical Methodology, 531–52.

Fox, John, and Georges Monette. 1992. “Generalized Collinearity Diagnostics.” Journal of the American Statistical Association 87 (417): 178–83.

Kosmidis, Ioannis, and David Firth. 2021. “Jeffreys-Prior Penalty, Finiteness and Shrinkage in Binomial-Response Generalized Linear Models.” Biometrika 108 (1): 71–82.

Rahman, M Shafiqur, and Mahbuba Sultana. 2017. “Performance of Firth-and logF-Type Penalized Methods in Risk Prediction for Small or Sparse Binary Data.” BMC Medical Research Methodology 17 (1): 1–15.

Zuo, Yi, Thomas G Stewart, and Jeffrey D Blume. 2021. “Variable Selection with Second-Generation p-Values.” The American Statistician, 1–11.