The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study

Document Type : Original Article


1 Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran

2 Modeling of Noncommunicable Diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran

3 Research Center for Health Sciences, Hamadan University of Medical Sciences, Hamadan, Iran

4 Department of Molecular Medicine and Genetics, School of Medicine, Hamadan University of Medical Sciences, Hamadan, Iran


Objective: In microarray datasets, hundreds and thousands of genes are measured in a small number of samples,
and sometimes due to problems that occur during the experiment, the expression value of some genes is recorded as
missing. It is a difficult task to determine the genes that cause disease or cancer from a large number of genes. This
study aimed to find effective genes in pancreatic cancer (PC). First, the K-nearest neighbor (KNN) imputation method
was used to solve the problem of missing values (MVs) of gene expression. Then, the random forest algorithm was
used to identify the genes associated with PC.
Materials and Methods: In this retrospective study, 24 samples from the GSE14245 dataset were examined. Twelve
samples were from patients with PC, and 12 samples were from healthy control. After preprocessing and applying the
fold-change technique, 29482 genes were used. We used the KNN imputation method to impute when a particular
gene had MVs. Then, the genes most strongly associated with PC were selected using the random forest algorithm. We
classified the dataset using support vector machine (SVM) and naïve bayes (NB) classifiers, and F-score and Jaccard
indices were reported.
Results: Out of the 29482 genes, 1185 genes with fold-changes greater than 3 were selected. After selecting the most
associated genes, 21 genes with the most important value were identified. S100P and GPX3 had the highest and
lowest importance values, respectively. The F-score and Jaccard value of the SVM and NB classifiers were 95.5, 93,
92, and 92 percent, respectively.
Conclusion: This study is based on the application of the fold change technique, imputation method, and random
forest algorithm and could find the most associated genes that were not identified in many studies. We therefore
suggest researchers use the random forest algorithm to detect the related genes within the disease of interest.


  1. Luo W, Tao J, Zheng L, Zhang T. Current epidemiology of pancreatic cancer: challenges and opportunities. Chin J Cancer Res. 2020; 32(6): 705-719.
  2. Huang J, Lok V, Ngai CH, Zhang L, Yuan J, Lao XQ, et al. Worldwide burden of, risk factors for, and trends in pancreatic cancer. Gastroenterology. 2021; 160(3): 744-754.
  3. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA Cancer J Clin. 2019; 69(1): 7-34.
  4. Siri FH, Salehiniya H. Pancreatic cancer in Iran: an epidemiological review. J Gastrointest Cancer. 2020; 51(2): 418-424.
  5. Jamali A, Kamgar M, Massarrat S, Sotoudeh M, Larijani B, Adler G, et al. Pancreatic cancer: state of the art and current situation in the Islamic Republic of Iran. Govaresh. 2009; 14(3): 189-197.
  6. Senapti R, Shaw K, Mishra S, Mishra D. A novel approach for missing value imputation and classification of microarray dataset. Procedia Eng. 2012; 38: 1067-1071.
  7. Breiman L. Random forests. Machine Learning. 2001; 45(1): 5-32.
  8. Chiu CC, Chan SY, Wang CC, Wu WS. Missing value imputation for microarray data: a comprehensive comparison study and a web tool. BMC Syst Biol. 2013; 7 Suppl 6: S12.
  9. Oh S, Kang DD, Brock GN, Tseng GC. Biological impact of missing- value imputation on downstream analyses of gene expression profiles. Bioinformatics. 2011; 27(1): 78-86.
  10. Luengo J, García S, Herrera F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst. 2012; 32(1): 77-108.
  11. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001; 17(6): 520-525.
  12. Zhang L, Farrell JJ, Zhou H, Elashoff D, Akin D, Park NH, et al. Salivary transcriptomic biomarkers for detection of resectable pancreatic cancer. Gastroenterology. 2010; 138(3): 949-957. e1-7.
  13. Bennett DA. How can I deal with missing data in my study? Aust N Z J Public Health. 2001; 25(5): 464-469.
  14. Dong Y, Peng CY. Principled missing data methods for researchers. Springerplus. 2013; 2(1): 222.
  15. Moorthy K, Mohamad M, Deris S. A review on missing value imputation algorithms for microarray gene expression data. Curr Bioinform. 2014; 9(1): 18-22.
  16. Lall R, Robinson T. The MIDAS touch: accurate and scalable missing- data imputation with deep learning. Political Analysis. 2020: 1-18.
  17. Patterson TA, Lobenhofer EK, Fulmer-Smentek SB, Collins PJ, Chu TM, Bao W, et al. Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC) project. Nat Biotechnol. 2006; 24(9): 1140-1150.
  18. Mutch DM, Berger A, Mansourian R, Rytz A, Roberts MA. The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data. BMC Bioinformatics. 2002; 3: 17.
  19. Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002; 2(3): 18-22.
  20. de Brevern AG, Hazout S, Malpertuy A. Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics. 2004; 5: 114.
  21. Liu H, Shi J, Anandan V, Wang HL, Diehl D, Blansfield J, et al. Reevaluation and identification of the best immunohistochemical panel (pVHL, Maspin, S100P, IMP-3) for ductal adenocarcinoma of the pancreas. Arch Pathol Lab Med. 2012; 136(6): 601-609.
  22. Bournet B, Pointreau A, Souque A, Oumouhou N, Muscari F, Lepage B, et al. Gene expression signature of advanced pancreatic ductal adenocarcinoma using low density array on endoscopic ultrasound-guided fine needle aspiration samples. Pancreatology. 2012; 12(1): 27-34.
  23. Bardeesy N, DePinho RA. Pancreatic cancer biology and genetics. Nat Rev Cancer. 2002; 2(12): 897-909.
  24. Kamisawa T, Wood LD, Itoi T, Takaori K. Pancreatic cancer. Lancet. 2016; 388(10039): 73-85.
  25. Dasgupta A, Arneson-Wissink PC, Schmitt RE, Cho DS, Ducharme AM, Hogenson TL, et al. Anticachectic regulator analysis reveals Perp-dependent antitumorigenic properties of 3-methyladenine in pancreatic cancer. JCI Insight. 2022; 7(2): e153842.
  26. Park JM, Mau CZ, Chen YC, Su YH, Chen HA, Huang SY, et al. A case-control study in Taiwanese cohort and meta-analysis of serum ferritin in pancreatic cancer. Sci Rep. 2021; 11(1): 21242.
  27. Fujiwara K, Ohuchida K, Ohtsuka T, Mizumoto K, Shindo K, Ikenaga N, et al. Migratory activity of CD105+ pancreatic cancer cells is strongly enhanced by pancreatic stellate cells. Pancreas. 2013; 42(8): 1283-1290.
  28. Ouyang Y, Pan J, Tai Q, Ju J, Wang H. Transcriptomic changes associated with DKK4 overexpression in pancreatic cancer cells detected by RNA-Seq. Tumour Biol. 2016; 37(8): 10827-38.
  29. Li L, Zhang JW, Jenkins G, Xie F, Carlson EE, Fridley BL, et al. Genetic variations associated with gemcitabine treatment outcome in pancreatic cancer. Pharmacogenet Genomics. 2016; 26(12): 527-537.
  30. Kuwae Y, Kakehashi A, Wakasa K, Wei M, Yamano S, Ishii N, et al. Paraneoplastic ma antigen-like 1 as a potential prognostic biomarker in human pancreatic ductal adenocarcinoma. Pancreas. 2015; 44(1): 106-115.
  31. Hao L, Zhang Q, Qiao HY, Zhao FY, Jiang JY, Huyan LY, et al. TRIM29 alters bioenergetics of pancreatic cancer cells via cooperation of miR-2355-3p and DDX3X recruitment to AK4 transcript. Mol Ther Nucleic Acids. 2021; 24: 579-590.
  32. Alhasan SF, Haugk B, Ogle LF, Beale GS, Long A, Burt AD, et al. Sulfatase-2: a prognostic biomarker and candidate therapeutic target in patients with pancreatic ductal adenocarcinoma. Br J Cancer. 2016; 115(7): 797-804.
  33. Tarhan YE, Kato T, Jang M, Haga Y, Ueda K, Nakamura Y, et al. Morphological changes, cadherin switching, and growth suppression in pancreatic cancer by GALNT6 knockdown. Neoplasia. 2016; 18(5): 265-272.
  34. Pan Z, Li L, Fang Q, Zhang Y, Hu X, Qian Y, et al. Analysis of dynamic molecular networks for pancreatic ductal adenocarcinoma progression. Cancer Cell Int. 2018; 18: 2.