ProSGPV
is a package that performs variable selection
with Second-Generation P-Values (SGPV). This document illustrates how
ProSGPV
performs variable selection with data from
generalized linear models (GLM) and Cox proportional hazards models.
Technical details about this algorithm can be found at Zuo, Stewart, and Blume (2021).
We will use Logistic regression to illustrate how
ProSGPV
selects variables with a low-dimensional (n > p) real-world data
example and how to choose models when each run gives you slightly
different results. We will use Poisson regression to show how
ProSGPV
works with simulated high-dimensional Poisson data
and a visualization tool for the selection process. We will use Cox
proportional hazards model to show how a fast one-stage algorithm works
with simulated data together with a visualization tool for the one-stage
ProSGPV
.
To install the ProSGPV
pacKakge from CRAN, you can
do
Alternatively, you can install a development version of
ProSGPV
by doing
Once the package is installed, we can load the package to the current environment.
Traditionally, maximum likelihood (ML) has been used to fit a
Logistic model. However, when sample size is small or effect sizes are
strong, complete/quasi-complete separation may happen, which may inflate
the parameter estimation bias by a lot (Albert and Anderson 1984; Rahman and Sultana 2017).
Recently, Kosmidis and Firth (2021) proposed to use a
Jeffreys prior to address the separation issue in the Logistic
regression, and that is implemented in ProSGPV
package.
In this section, we will use a real-world data example to showcase
how ProSGPV
performs variable selection when the outcome is
binary. Lower back pain can be caused by a variety of problems with any
parts of the complex, interconnected network of spinal muscles, nerves,
bones, discs or tendons in the lumbar spine. Dr. Henrique da Mota
collected 12 biomechanical attributes from 310 patients, of whom 100 are
normal and 210 are abnormal (Disk Hernia or Spondylolisthesis). The goal
is to differentiate the normal patients from the abnormal using those 12
variables. The biomechanical attributes include pelvic incidence, pelvic
tilt, lumbar lordosis angle, sacral slope, pelvic radius, degree of
spondylolisthesis, pelvic slope, direct tilt, thoracic slope cervical
tilt, sacrum angle, and scoliosis slope. The data set spine
can be accessed in the ProSGPV
package.
x <- spine[,-ncol(spine)]
y <- spine[,ncol(spine)]
sgpv.2s.l <- pro.sgpv(x,y,family="binomial")
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
sgpv.2s.l
#> Selected variables are sacral_slope pelvic_radius degree_spondylolisthesis
ProSGPV
selects three out of 12 variables and they are
acral slope, pelvic radius, and degree of spondylolisthesis.
We can get a summary of the model that we selected.
summary(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#>
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = data.d,
#> method = "brglmFit", type = "MPL_Jeffreys")
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.78904 -0.48146 0.04248 0.33705 2.26768
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 17.42056 3.15274 5.526 3.28e-08 ***
#> sacral_slope -0.11300 0.02214 -5.105 3.31e-07 ***
#> pelvic_radius -0.11716 0.02254 -5.197 2.03e-07 ***
#> degree_spondylolisthesis 0.15797 0.02162 7.306 2.75e-13 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 389.86 on 309 degrees of freedom
#> Residual deviance: 184.42 on 306 degrees of freedom
#> AIC: 192.42
#>
#> Type of estimator: MPL_Jeffreys (maximum penalized likelihood with Jeffreys'-prior penalty)
#> Number of Fisher Scoring iterations: 5
Coefficient estimates can be extracted by using the coef
function.
coef(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#> [1] 0.0000000 0.0000000 0.0000000 -0.1130012 -0.1171616 0.1579717
#> [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Prediction in probability can be made by using the
predict
function.
gen.sim.data
function is used to generate some
high-dimensional (p > n) simulation data
from Poisson distribution in this section. There are many arguments in
the gen.sim.data
function, as it can generate data with
continuous, binary, count, and survival outcomes.
n
is the number of observations, p
is the
number of variables, s
is the number of true signals,
family
can take the value of "gaussian"
,
"binomial"
, "poisson"
, and "cox"
,
beta.min
is the smallest effect size in absolute value,
beta.max
is the largest effect size, rho
is
the autocorrelation coefficient in the design matrix, nu
is
the signal-to-noise ratio in the continuous outcome case,
sig
is the standard deviation in the covariance matrix of
the design matrix, intercept
is used in the linear mean
vector of GLM, scale
and shape
are parameters
in Weibull survival time, and rateC
is the rate of
censoring.
In this section, we simulate 80 observations with 100 explanatory
variables and one count outcome. The number of true signals is four. The
smallest log rate ratio is 0.5 and the largest log rate ratio is 1.5.
When p > n, only
the two-stage ProSGPV
will be used, even if
stage
is set to be 1.
set.seed(1)
data.log <- gen.sim.data(n=80,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")
x <- data.log[[1]]
y <- data.log[[2]]
(true.index <- data.log[[3]])
#> [1] 3 43 89 100
(true.beta <- data.log[[4]])
#> [1] 0.0000000 0.0000000 0.5000000 0.0000000 0.0000000 0.0000000
#> [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [13] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [19] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [25] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [31] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [37] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [43] 0.8333333 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [49] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [55] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [61] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [67] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [73] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [79] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [85] 0.0000000 0.0000000 0.0000000 0.0000000 -1.1666667 0.0000000
#> [91] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [97] 0.0000000 0.0000000 0.0000000 -1.5000000
sgpv.2s.p <- pro.sgpv(x,y,family="poisson")
sgpv.2s.p
#> Selected variables are V3 V43 V89 V100
We see that the two-stage ProSGPV
recovered the true
support.
Similarly, the summary of the final model is available by calling
summary
.
summary(sgpv.2s.p)
#>
#> Call:
#> glm(formula = Response ~ ., family = "poisson", data = data.d)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.15872 0.12192 -1.302 0.193
#> V3 0.59830 0.04679 12.786 <2e-16 ***
#> V43 0.89750 0.06422 13.975 <2e-16 ***
#> V89 -1.15970 0.05135 -22.583 <2e-16 ***
#> V100 -1.57528 0.08177 -19.265 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 1018.60 on 79 degrees of freedom
#> Residual deviance: 47.37 on 75 degrees of freedom
#> AIC: 243.92
#>
#> Number of Fisher Scoring iterations: 5
Coefficients can be extracted by coef
. We see the
estimates are pretty close to the truth.
coef(sgpv.2s.p)
#> [1] 0.0000000 0.0000000 0.5982981 0.0000000 0.0000000 0.0000000
#> [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [13] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [19] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [25] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [31] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [37] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [43] 0.8974952 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [49] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [55] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [61] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [67] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [73] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [79] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [85] 0.0000000 0.0000000 0.0000000 0.0000000 -1.1597040 0.0000000
#> [91] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [97] 0.0000000 0.0000000 0.0000000 -1.5752756
In-sample prediction can be done as follows.
head(predict(sgpv.2s.p))
#> 1 2 3 4 5 6
#> 0.09056540 1.52559015 0.09082897 2.36900015 3.64998025 2.35678653
Out-of-sample prediction is also available by feeding data into the
newdata
argument.
set.seed(1)
data.log <- gen.sim.data(n=10,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")
x.new <- data.log[[1]]
y.new <- data.log[[2]]
data.frame(Observed=y.new,Predicted=predict(sgpv.2s.p,newdata=x.new))
#> Observed Predicted
#> 1 1 0.2511760
#> 2 0 0.7257742
#> 3 0 0.2668769
#> 4 7 6.4782805
#> 5 9 10.2552082
#> 6 0 0.2627837
#> 7 1 1.0470441
#> 8 0 0.4297897
#> 9 4 1.2498434
#> 10 7 5.5053366
The prediction agrees with the truth well.
The selection process can be visualized below.
lambda.max
sets the limit of X-axis. Only effects whose
lower bound doesn’t reach the null regions are kept. Selected variables
are labeled blue on the Y-axis. The vertical dotted line is the λ selected by generalized
information criterion (Fan and Tang (2013)).
You can also chooses to view only one line per variable by setting
lpv
to be 1. That line is the 95% confidence bound that is
closer to the null.
In this section, we will use simulated low-dimensional survival data
to illustrate how one-stage and two-stage ProSGPV
select
variables, and additional visualization for the one-stage algorithm’s
selection process.
gen.sim.data
is used to simulate 100 observations.
set.seed(1)
data.cox <- gen.sim.data(n=100, p=20, s=4, family="cox",
beta.min=0.5, beta.max=1.5, sig=2)
x <- data.cox[[1]]
y <- data.cox[[2]]
(true.index <- data.cox[[3]])
#> [1] 1 3 4 10
true.beta <- data.cox[[4]]
sgpv.2s.c <- pro.sgpv(x,y,stage=2,family="cox")
sgpv.2s.c
#> Selected variables are V1 V3 V4 V10
The two-stage algorithm successfully recovered the true support. We
can also run a fast one-stage algorithm by setting stage
to
be 1.
It succeeded as well. summary
, coef
, and
predict
are also available for the one-stage
ProSGPV
object.
We can visualize the one-stage selection process by calling the
plot
function.
Effects estimates (in this case log hazards ratios) and corresponding 95% confidence intervals are plotted for each variable in the full Cox model. Two vertical green bars indicate the null region. Only effects whose confidence bounds do not overlap with the null regions are kept. Selected variables are colored in blue.
When the design matrix has high within correlation, or signals are known to be dense in the data, the ProSGPV algorithm yields a null bound that is slightly higher than the noise level, which subsequently affects the support recovery performance by missing some small true effects. One way to address this issue is to replace the constant null bound in the ProSGPV with a generalized variance inflation factor (GVIF)-adjusted null bound. Please see Fox and Monette (1992) for more details on how to calculate the GVIF for each variable in the design matrix. Essentially, we deflate the coefficient estimate standard error of each variable by its GVIF, and thus produce a smaller null bound, which includes more variables in the final selection set. This adjustment is found to be helpful when signals are dense, too.