Title: | Penalized Regression with Second-Generation P-Values |
---|---|
Description: | Implementation of penalized regression with second-generation p-values for variable selection. The algorithm can handle linear regression, GLM, and Cox regression. S3 methods print(), summary(), coef(), predict(), and plot() are available for the algorithm. Technical details can be found at Zuo et al. (2021) <doi:10.1080/00031305.2021.1946150>. |
Authors: | Yi Zuo [aut, cre] |
Maintainer: | Yi Zuo <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2025-01-25 03:33:43 UTC |
Source: | https://github.com/zuoyi93/prosgpv |
coef.sgpv
: Extract coefficients from the model fitS3 method coef
for an S3 object of class sgpv
## S3 method for class 'sgpv' coef(object, ...)
## S3 method for class 'sgpv' coef(object, ...)
object |
An |
... |
Other |
Coefficients in the OLS model
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv <- pro.sgpv(x = x, y = y) # get coefficients coef(out.sgpv)
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv <- pro.sgpv(x = x, y = y) # get coefficients coef(out.sgpv)
gen.sim.data
: Generate simulation dataThis function can be used to generate autoregressive simulation data
gen.sim.data( n = 100, p = 50, s = 10, family = c("gaussian", "binomial", "poisson", "cox"), beta.min = 1, beta.max = 5, rho = 0, nu = 2, sig = 1, intercept = 0, scale = 2, shape = 1, rateC = 0.2 )
gen.sim.data( n = 100, p = 50, s = 10, family = c("gaussian", "binomial", "poisson", "cox"), beta.min = 1, beta.max = 5, rho = 0, nu = 2, sig = 1, intercept = 0, scale = 2, shape = 1, rateC = 0.2 )
n |
Number of observations. Default is 100. |
p |
Number of explanatory variables. Default is 50. |
s |
Number of true signals. It can only be an even number. Default is 10. |
family |
A description of the error distribution and link function to be
used in the model. It can take the value of |
beta.min |
The smallest effect size in absolute value. Default is 1. |
beta.max |
The largest effect size in absolute value. Default is 5. |
rho |
Autocorrelation level. A numerical value between -1 and 1. Default is 0. |
nu |
Signal to noise ratio in linear regression. Default is 2. |
sig |
Standard deviation in the design matrix. Default is 1. |
intercept |
Intercept of the linear predictor in the GLM. Default is 0. |
scale |
Scale parameter in the Weibull distribution. Default is 2. |
shape |
Shape parameter in the Weibull distribution. Default is 1. |
rateC |
Rate of censoring in the survival data. Default is 0.2. |
A list of following components:
The generated explanatory variable matrix
A vector of outcome. If family
is \code{cox}
, a two-column
object is returned where the first column is the time and the second column is
status (0 is censoring and 1 is event)
The indices of true signals
The true coefficient vector of length p
# generate data for linear regression data.linear <- gen.sim.data(n = 20, p = 10, s = 4) # extract x x <- data.linear[[1]] # extract y y <- data.linear[[2]] # extract the indices of true signals index <- data.linear[[3]] # extract the true coefficient vector true.beta <- data.linear[[4]] # generate data for logistic regression data.logistic <- gen.sim.data(n = 20, p = 10, s = 4, family = "binomial") # extract x x <- data.logistic[[1]] # extract y y <- data.logistic[[2]] # extract the indices of true signals index <- data.logistic[[3]] # extract the true coefficient vector true.beta <- data.logistic[[4]]
# generate data for linear regression data.linear <- gen.sim.data(n = 20, p = 10, s = 4) # extract x x <- data.linear[[1]] # extract y y <- data.linear[[2]] # extract the indices of true signals index <- data.linear[[3]] # extract the true coefficient vector true.beta <- data.linear[[4]] # generate data for logistic regression data.logistic <- gen.sim.data(n = 20, p = 10, s = 4, family = "binomial") # extract x x <- data.logistic[[1]] # extract y y <- data.logistic[[2]] # extract the indices of true signals index <- data.logistic[[3]] # extract the true coefficient vector true.beta <- data.logistic[[4]]
get.candidate
: Get candidate setGet the indices of the candidate set in the first stage
get.candidate(xs, ys, family)
get.candidate(xs, ys, family)
xs |
Standardized independent variables |
ys |
Standardized dependent variable |
family |
A description of the error distribution and link function to be
used in the model. It can take the value of |
A list of following components:
A vector of indices of selected variables in the candidate set
The lambda
selected by generalized information criterion
get.coef
: Get coefficients at each lambda
Get the coefficients and confidence intervals from regression at each lambda
as well as the null bound in SGPVs
get.coef(xs, ys, lambda, lasso, family)
get.coef(xs, ys, lambda, lasso, family)
xs |
Standardized design matrix |
ys |
Standardized outcome |
lambda |
|
lasso |
An |
family |
A description of the error distribution and link function to be
used in the model. It can take the value of |
A vector that contains the point estimates, confidence intervals and the null bound
get.var
: Get indicesGet the indices of the variables selected by the algorithm
get.var(candidate.index, xs, ys, family, gvif)
get.var(candidate.index, xs, ys, family, gvif)
candidate.index |
Indices of the candidate set |
xs |
Standardized independent variables |
ys |
Standardized dependent variable |
family |
A description of the error distribution and link function to be
used in the model. It can take the value of |
gvif |
A logical operator indicating whether a generalized variance inflation factor-adjusted null bound is used. Default is FALSE. |
A list of following components:
A vector of indices of selected variables
Null bound in the SGPV screening
Point estimates in the candidate set
Lower bounds of effect estimates in the candidate set
Upper bounds of effect estimates in the candidate set
gvif
: Get GVIF for each variableGet generalized variance inflation factor (GVIF) for each variable. See Fox (1992) doi:10.1080/01621459.1992.10475190 for more details on how to calculate GVIF.
gvif(mod, family)
gvif(mod, family)
mod |
A model object with at least two explanatory variables |
family |
A description of the error distribution and link function to be
used in the model. It can take the value of |
A vector of GVIF for each variable in the model
plot.sgpv
: Plot variable selection resultsS3 method plot
for an object of class sgpv
. When the two-stage
algorithm is used, this function plots the fully relaxed lasso solution path on
the standardized scale and the final variable selection results. When the
one-stage algorithm is used, a histogram of all coefficients with selected effects
is shown.
## S3 method for class 'sgpv' plot(x, lpv = 3, lambda.max = NULL, short.label = T, ...)
## S3 method for class 'sgpv' plot(x, lpv = 3, lambda.max = NULL, short.label = T, ...)
x |
An |
lpv |
Lines per variable. It can take the value of 1 meaning that only the bound that is closest to the null will be plotted, or the value of 3 meaning that point estimates as well as 95% confidence interval will be plotted. Default is 3. |
lambda.max |
The maximum lambda on the plot. Default is |
short.label |
An indicator if a short label is used for each variable for
better visualization. Default is |
... |
Other |
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # one-stage algorithm out.sgpv.1 <- pro.sgpv(x = x, y = y, stage = 1) # plot the selection result plot(out.sgpv.1) # two-stage algorithm out.sgpv.2 <- pro.sgpv(x = x, y = y) # plot the fully relaxed lasso solution path and final solution plot(out.sgpv.2) # zoom in a little bit plot(out.sgpv.2, lambda.max = 0.01) # only plot one confidence bound plot(out.sgpv.2, lpv = 1, lambda.max = 0.01)
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # one-stage algorithm out.sgpv.1 <- pro.sgpv(x = x, y = y, stage = 1) # plot the selection result plot(out.sgpv.1) # two-stage algorithm out.sgpv.2 <- pro.sgpv(x = x, y = y) # plot the fully relaxed lasso solution path and final solution plot(out.sgpv.2) # zoom in a little bit plot(out.sgpv.2, lambda.max = 0.01) # only plot one confidence bound plot(out.sgpv.2, lpv = 1, lambda.max = 0.01)
predict.sgpv
: Prediction using the fitted modelS3 method predict
for an object of class sgpv
## S3 method for class 'sgpv' predict(object, newdata, type, ...)
## S3 method for class 'sgpv' predict(object, newdata, type, ...)
object |
An |
newdata |
Prediction data set |
type |
The type of prediction required. Can take the value of |
... |
Other |
Predicted values
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv <- pro.sgpv(x = x, y = y) predict(out.sgpv)
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv <- pro.sgpv(x = x, y = y) predict(out.sgpv)
print.sgpv
: Print variable selection resultsS3 method print
for an S3 object of class sgpv
## S3 method for class 'sgpv' print(x, ...)
## S3 method for class 'sgpv' print(x, ...)
x |
An |
... |
Other |
Variable selection results
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv.1 <- pro.sgpv(x = x, y = y, stage = 1) out.sgpv.1
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv.1 <- pro.sgpv(x = x, y = y, stage = 1) out.sgpv.1
pro.sgpv
functionThis function outputs the variable selection results from either one-stage algorithm or two-stage algorithm.
pro.sgpv( x, y, stage = c(1, 2), family = c("gaussian", "binomial", "poisson", "cox"), gvif = F )
pro.sgpv( x, y, stage = c(1, 2), family = c("gaussian", "binomial", "poisson", "cox"), gvif = F )
x |
Independent variables, can be a |
y |
Dependent variable, can be a |
stage |
Algorithm indicator. 1 denotes the one-stage algorithm and
2 denotes the two-stage algorithm. Default is 2. When |
family |
A description of the error distribution and link function to be
used in the model. It can take the value of |
gvif |
A logical operator indicating whether a generalized variance inflation factor-adjusted null bound is used. Default is FALSE. See Fox (1992) doi:10.1080/01621459.1992.10475190 for more details on how to calculate GVIF |
A list of following components:
A vector of indices of selected variables
A vector of labels of selected variables
lambda
selected by generalized information criterion in the two-stage algorithm. NULL
for the one-stage algorithm
Input data x
Input data y
family
from the input
stage
from the input
Null bound in the SGPV screening
Point estimates in the candidate set
Lower bounds of CI in the candidate set
Upper bounds of CI in the candidate set
print.sgpv()
prints the variable selection results
coef.sgpv()
extracts coefficient estimates
summary.sgpv()
summarizes the OLS outputs
predict.sgpv()
predicts the outcome
plot.sgpv()
plots variable selection results
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run ProSGPV in linear regression out.sgpv <- pro.sgpv(x = x, y = y) # More examples at https://github.com/zuoyi93/ProSGPV/tree/master/vignettes
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run ProSGPV in linear regression out.sgpv <- pro.sgpv(x = x, y = y) # More examples at https://github.com/zuoyi93/ProSGPV/tree/master/vignettes
Lower back pain can be caused by a variety of problems with any parts of the complex, interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. This dataset contains 12 biomechanical attributes from 310 patients, of whom 100 are normal and 210 are abnormal (Disk Hernia or Spondylolisthesis). The goal is to differentiate the normal patients from the abnormal using those 12 variables.
spine
spine
pelvic incidence
pelvic tilt
lumbar lordosis angle
sacral slope
pelvic radius
degree of spondylolisthesis
pelvic slope
direct tilt
thoracic slope
cervical tilt
sacrum angle
scoliosis slope
1 is abnormal (Disk Hernia or Spondylolisthesis) and 0 is normal
http://archive.ics.uci.edu/ml/datasets/vertebral+column
summary.sgpv
: Summary of the final modelS3 method summary
for an S3 object of class sgpv
## S3 method for class 'sgpv' summary(object, ...)
## S3 method for class 'sgpv' summary(object, ...)
object |
An |
... |
Other arguments |
Summary of a model
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv <- pro.sgpv(x = x, y = y) # get regression summary summary(out.sgpv)
# prepare the data x <- t.housing[, -ncol(t.housing)] y <- t.housing$V9 # run one-stage algorithm out.sgpv <- pro.sgpv(x = x, y = y) # get regression summary summary(out.sgpv)
A dataset containing Tehran housing data. The data set has 372 observations. There are 26 explanatory variables at baseline, including 7 project physical and financial features (V2-V8) and 19 economic variables and indices (V11-V29). The outcome (V9) is the sales price of a real estate single-family residential apartment.
t.housing
t.housing
Actual sales price
Total floor area of the building
Lot area
Total Preliminary estimated construction cost based on the prices at the beginning of the project
Preliminary estimated construction cost based on the prices at the beginning of the project
Equivalent preliminary estimated construction cost based on the prices at the beginning of the project in a selected base year
Duration of construction
Price of the unit at the beginning of the project per square meter
The number of building permits issued
Building services index for preselected base year
Wholesale price index of building materials for the base year
Total floor areas of building permits issued by the city/municipality
Cumulative liquidity
Private sector investment in new buildings
Land price index for the base year
The number of loans extended by banks in a time resolution
The amount of loans extended by banks in a time resolution
The interest rate for loan in a time resolution
The average construction cost by private sector at the completion of construction
The average cost of buildings by private sector at the beginning of construction
Official exchange rate with respect to dollars
Nonofficial (street market) exchange rate with respect to dollars
Consumer price index (CPI) in the base year
CPI of housing, water, fuel & power in the base year
Stock market index
Population of the city
Gold price per ounce
http://archive.ics.uci.edu/ml/datasets/Residential+Building+Data+Set