TY - JOUR
T1 - Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data
AU - Akter, Sadia
AU - Xu, Dong
AU - Nagel, Susan C.
AU - Bromfield, John J.
AU - Pelch, Katherine
AU - Wilshire, Gilbert B.
AU - Joshi, Trupti
N1 - Funding Information:
The biological samples were collected in collaboration with the Women’s and Children’s Hospital, University of Missouri, Boone Hospital, Columbia, MO, and University of California, San Francisco. Some tissue samples were provided by the NIH SCCPIR Human Endometrial Tissue Bank and DNA Bank at UCSF, funded under NIH HD055764-06. We are thankful to Stacy Syrcle, Sarah Crowder, Danny J. Schust, and Breton Barrier for their help with sample collection. An earlier work evaluated the performance of classification models using transcriptomics data (Akter et al., 2018) that was published in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).
Publisher Copyright:
© Copyright © 2019 Akter, Xu, Nagel, Bromfield, Pelch, Wilshire and Joshi.
PY - 2019/9/4
Y1 - 2019/9/4
N2 - Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available, thus leading to an average of 4 to 11 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or next generation sequencing (NGS) data has been advanced over the last several decades by applying various machine learning tools. We performed machine learning analysis using 38 RNA-seq and 80 enrichment-based DNA methylation (MBD-seq) datasets. We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine, and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: a) implication of three different normalization techniques and b) implication of differential analysis using the generalized linear model (GLM). Several candidate biomarker genes were identified by multiple machine learning experiments including NOTCH3, SNAPC2, B4GALNT1, SMAP2, DDB2, GTF3C5, and PTOV1 from the transcriptomics data analysis and TRPM6, RASSF2, TNIP2, RP3-522J7.6, FGD3, and MFSD14B from the methylomics data analysis. We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
AB - Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available, thus leading to an average of 4 to 11 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or next generation sequencing (NGS) data has been advanced over the last several decades by applying various machine learning tools. We performed machine learning analysis using 38 RNA-seq and 80 enrichment-based DNA methylation (MBD-seq) datasets. We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine, and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: a) implication of three different normalization techniques and b) implication of differential analysis using the generalized linear model (GLM). Several candidate biomarker genes were identified by multiple machine learning experiments including NOTCH3, SNAPC2, B4GALNT1, SMAP2, DDB2, GTF3C5, and PTOV1 from the transcriptomics data analysis and TRPM6, RASSF2, TNIP2, RP3-522J7.6, FGD3, and MFSD14B from the methylomics data analysis. We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
KW - DNA methylation
KW - RNA-seq
KW - classification
KW - endometriosis
KW - machine learning
KW - methylomics
KW - transcriptomics
KW - translational bioinformatics
UR - http://www.scopus.com/inward/record.url?scp=85072829656&partnerID=8YFLogxK
U2 - 10.3389/fgene.2019.00766
DO - 10.3389/fgene.2019.00766
M3 - Article
AN - SCOPUS:85072829656
SN - 1664-8021
VL - 10
JO - Frontiers in Genetics
JF - Frontiers in Genetics
M1 - 766
ER -