Document Type : Original Article
Department of Biostatistics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran
Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden
HPGC Research Group, Medical Biotechnology Department, Biotechnology Research Center, Pasteur Institute of Iran, Tehran, Iran
Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran
Objective: This study aimed to introduce novel techniques for identifying the genes associated with developing
chronic obstructive pulmonary disease (COPD) and to prioritize COPD candidate genes using regression methods.
Materials and Methods: This is a secondary analysis of the data from an experimental study. We used penalized
logistic regressions with three different types of penalties included least absolute shrinkage and selection operator
(LASSO), minimax concave penalty (MCP), and smoothly clipped absolute deviation (SCAD). The models were
trained using genome-wide expression profiling to define gene networks relevant to the COPD stages. A 10-fold
cross-validation scheme was used to evaluate the performance of the methods. In addition, we validate our
results by the external validity approach. We reported the sensitivity, specificity, and area under curve (AUC) of
Results: There were 21, 22, and 18 significantly associated genes for LASSO, SCAD, and MCP models, respectively.
The most statistically conservative method (detecting less significant features) was MCP detected 18 genes that were
all detected by the other two approaches. The most appropriate approach was a SCAD penalized logistic regression
(AUC= 96.26, sensitivity= 94.2, specificity= 86.96). In this study, we have a common panel of 18 genes in all three
models that show a significant positive and negative correlation with COPD, in which RNF130, STX6, PLCB1,
CACNA1G, LARP4B, LOC100507634, SLC38A2, and STIM2 showed the odds ratio (OR) more than 1. However, there
was a slight difference between penalized methods.
Conclusion: Regularization solves the serious dimensionality problem in using this kind of regression. More exploration
of how these genes affect the outcome and mechanism is possible more quickly in this manner. The regression-based
approaches we present could apply to overcoming this issue.