Logo Medical Science Monitor

Call: +1.631.470.9640
Mon - Fri 10:00 am - 02:00 pm EST

Contact Us

Logo Medical Science Monitor Logo Medical Science Monitor Logo Medical Science Monitor

24 May 2022: Database Analysis  

Identification of Key Genes and Key Pathways in Breast Cancer Based on Machine Learning

Shurui Bao1ABCE, Guijin He1F*

DOI: 10.12659/MSM.935515

Med Sci Monit 2022; 28:e935515



BACKGROUND: Breast cancer is one of the most common malignant tumors among women worldwide. This study aimed to screen key genes and pathways for breast cancer diagnosis and treatment.

MATERIAL AND METHODS: We obtained public data from the NCBI GEO database. The data were divided into a control group (normal breast tissue) and a treatment group (breast cancer tissue). We screened 32 differentially expressed genes (DEGs) between normal breast and cancerous tissues and used GO analysis and GSEA to identify the key pathways. We then combined LASSO and SVM-RFE analyses to screen key genes, and used CIBERSORT to obtain the proportion of 22 types of immune cells. The relationships between key genes and immune-infiltrating cells were further explored.

RESULTS: We screened 32 DEGs from the 2 groups, including 27 downregulated genes and 5 upregulated genes. GO analysis indicated that the DEGs were mainly correlated with collagen-containing extracellular matrix (ECM), Wnt signaling pathway, and glycosaminoglycan binding. GSEA indicated that the treatment group was correlated with chromosome segregation and cell cycle while the control group was correlated with cornification, intermediate filament, and nuclear transcription. Through machine learning, SYNM, TGFBR3, and COL10A1 were screened as key genes. Numbers of CD8 T cells, gamma delta T cells, and M1 macrophages were significantly higher, while monocytes and follicular helper-T cells were significantly lower in the treatment group. The downregulated genes, SYNM and TGFBR3, were positively correlated with CD8 T cells and monocytes, but were negatively correlated with gamma delta T cells and M1 macrophages. The upregulated gene, COL10A1, was positively correlated with gamma delta T cells and M1 macrophages, and was negatively correlated with CD8 T cells, monocytes, and follicular helper-T cells.

CONCLUSIONS: SYNM, TGFBR3, and COL10A1 are diagnostic genes of breast cancer. They affect breast cancer cells by modulating immune-infiltrating cells.

Keywords: Breast Diseases, Medical Oncology, Psycho-Oncology


Breast cancer is one of the most common malignant tumors in women and is the leading cause of female deaths worldwide. Although breast cancer surgery combined with comprehensive treatment can greatly improve the overall survival (OS) and disease-free survival (RFS) of breast cancer patients, some patients still have a poor prognosis [1]. Thus, the search for new biomarkers of cancer treatment is very important for such patients [2].

Bioinformatics analysis of tumors has become an important method for exploring mechanisms of tumor development [3,4]. Based on bioinformatics methods, we can learn about genes and pathways for the diagnosis and treatment of tumors. Recently, data analysis of primary and metastatic prostate cancer patients, combined with machine learning methods, for establishment of a model for predicting the status of prostate cancer patients, found new targets for cancer treatment and proved them using in vitro experiments [5]. Moreover, they suggested that this method could be further applied to progression of other tumors. In this study, we analyzed a public database using machine learning methods to find signature diagnostic genes for breast cancer.

Breast tumors consist of cancerous and non-cancerous cells, which have rarely been studied; moreover, enriched pathways were found to be related to the tumor microenvironment. Therefore, we identified the composition of immune cells in breast tumors, and investigated the correlation between the key genes and infiltrating immune cells in breast cancer patients [6–9].

Thus, we analyzed mechanisms of key genes that regulate infiltrating immune cells and have an effect on breast cancer progression, in combination with previous studies. We obtained new targets through cancer biology, together with machine learning methods, which provide new pathways for breast cancer treatment.

Material and Methods


This study used data from the public domain. We selected the keywords “primary breast cancer” in the NCBI Gene Expression Omnibus (GEO) public database and the samples were limited to “Homo sapiens.” We obtained 6 datasets that included both normal and cancerous breast tissues. The GSE54002 series matrix files had 433 samples (16 normal breast tissues and 417 breast cancer tissues), including large samples; hence, we could obtain comprehensive gene expression profiles. The GSE14548 series matrix files comprised 66 samples (28 normal breast tissues and 38 breast cancer tissues), which concentrated on breast cancer progression. The GSE5764 series matrix files had 30 samples in total (20 normal breast tissues and 10 breast cancer tissues) that supplemented lobular cancerous tissues and lobular cells. These were used as training sets to make the data more comprehensive. The GSE29044, GSE 29431, and GSE15852 series of matrix files consisted of 109 (66 normal breast tissues and 73 breast cancer tissues), 66 (12 normal breast tissues and 54 breast cancer tissues), and 86 (43 breast normal tissues and 43 breast cancer tissues) samples, respectively. These datasets are of moderate size and are not limited to a specific age or race.


Differentially expressed genes (DEGs) in primary breast cancer tissues and normal tissues from GSE5764, GSE14548, and GSE54002 were selected using the R software package “limma.” We used a volcano plot to visualize the DEGs. We set |logFC| ≥2 and P<0.05 as the thresholds for DEGs, and P values were adjusted for multiple testing correction using the false discovery rate (FDR).

We screened breast cancer signature genes (key genes) that might be used as breast cancer diagnostic markers from the DEGs using the lasso logistic regression (R package “glmnet”) and SVM-RFE algorithm methods (R package e1071). Diagnostic genes were obtained from the overlap of the 2 methods and the plotted Venn diagram using R software (package “Venn”). We used these signature diagnostic genes as key genes for breast cancer, and screened genes that were also differentially expressed in the test datasets using boxplots to visualize the outcomes; the ROC curve was then used to prove the accuracy of the results (R package “pROC”).


The samples were divided into 2 groups: a normal breast tissue group (control group) and a breast cancer tissue group (treatment group). DEGs were used for gene ontology (GO) enrichment analysis. GO enrichment analysis is widely used to identify the functions and pathways of genes, including biological progress (BP), cellular components (CC), and molecular functions (MF). Gene set enrichment analysis (GSEA) was used to explore the biological functions of the 2 groups.

The R software package “clusterProfiler” was used to analyze GO, the “ggplot2” package was used to make the barplots, and GSEA was run for the “c5.go.v7.4.symbols.gmt” gene sets. P<0.05 was considered to be significant enrichment with an FDR>2.


The tumor microenvironment (TME) plays an important role in tumor antagonism and promotion; immune cells are a critical part of the TME. Using the cell-type identification by estimating relative subsets of RNA transcripts (CIBERSORT) algorithm, this study evaluated the quantity of 22 immune cells in the control and treatment groups, and identified differential proportions of infiltrating immune cells in both groups, and further explored the relationship between key genes and the infiltrating immune cells.


The t test was used for comparison between both groups, and online gene expression profiling interactive analysis (GEPIA) was used to plot the survival analysis and Kaplan-Meier curves. We used R 4.1.1 to analyze all data and visualize outcomes. Statistical significance was defined as P<0.05.



There were 32 DEGs between cancerous and normal breast tissues in the GSE5764, GSE 54002, and GSE14548 datasets, including 27 downregulated genes and 5 upregulated genes. A volcano plot (Figure 1A) shows the DEGs. Eight and 23 DEGs were screened using LASSO regression and SVM-RFE regression, respectively.

Finally, we identified 8 signature diagnostic genes (key genes) from the overlapping genes of the 2 methods between the 2 groups (Figure 1B). The 7 downregulated genes were SYNM, KRT15, ELF5, CCL28, PI15, TGFBR3, and KRT14, and the only upregulated gene was COL10A1. We further analyzed the differential expression of the 8 genes in breast cancer and normal tissues in the test datasets (GSE29044, GSE 29431, and GSE15852), and the results showed that only TGFBR3, SYNM, and COL10A1 were still differentially expressed (Figure 2A–2C). The ROC curves showed that all areas under the curve (AUC) were >0.7, indicating good accuracy.


GO enrichment analysis showed that DEGs mainly existed on collagen-containing extracellular matrices, intermediate filaments, and cell–cell junctions.

Biological processes were mainly related to epithelial tube morphogenesis and negative regulation of the Wnt signaling pathway. Molecular functions were mainly related to glycosaminoglycan binding, sulfur compound binding, and Wnt-protein binding (Figure 3A). GSEA indicated that the pathway enrichment of the treatment group was mainly related to chromosome segregation and the cell cycle process (Figure 3B), while the control group was associated with cornification, nuclear transcription, and intermediate filaments.


The volplot shows the distributions of immune cells in the control and treatment groups (Figure 4A), and the relationship between immune cells (Figure 4B). The results indicated that the proportions of CD8 T cells, gamma delta T cells, and M1 macrophages were significantly higher in breast cancer tissues, while the proportions of monocytes and follicular helper-T cells were significantly lower in breast cancer tissues. In addition, by further exploring the expression of these 3 key genes in immune cells, the results indicated that the downregulated genes, SYNM and TRGBF3, in the breast cancer tissues were positively correlated with CD8 T cells and monocytes, and negatively correlated with M1 macrophages; TGFBR3 was negatively correlated with M1 macrophages. Conversely, the upregulated gene in the breast cancer tissues, COL10A1, was positively correlated with gamma delta T cells and M1 macrophage, and negatively correlated with CD8 T cells, monocytes, and follicular helper-T cells. The relationship between the key genes and immune cells is shown in Figure 4C. The results suggested that the genes we selected were highly related to the level of immune cell infiltration, which is a critical component of the microenvironment.


In our study, we explored the relationships between the expression levels of the 3 genes and the survival rate of breast cancer using Kaplan-Meier survival analysis through online GEPIA. Breast cancer patients were divided into 2 groups (high- and low-risk groups), and the results were visualized (Figure 5). The results indicated that low expression levels of SYNM were significantly related to poor prognosis in patients with breast cancer, indicating that they might act as tumor-suppressor genes. High expression levels of COL10A1 and low expression levels of TGFBR3 improved the prognosis of breast cancer, although the results were not significant.


In this study, we identified 32 DEGs in cancerous and normal breast tissues. By comparing the expression levels of DEGs between the control and treatment groups, GO analysis showed that the DEGs were related to the extracellular matrix (ECM), thus indicating that the tumor microenvironment (TME) plays a vital role in breast cancer. The ECM provides physical support for cells and affects cell adhesion and infiltration, and the immune status of the TME is an important factor that affects tumor progression. With different infiltrating immune cells or molecules, the TME has a differentiated effect on tumor progression, and through TME-targeted immunotherapy, tumor progression can be suppressed [9–12].

We found that the proportion of CD8 T cells, M1 macrophages M1, and gamma delta T cells were significantly higher in breast cancer tissues, while monocyte and follicular helper-T cell proportions were significantly lower. M1 macrophages and gamma delta T cells were positively correlated; however, they were negatively correlated with CD8 T cells, monocytes, and follicular helper-T cells. In addition, the proportion of M0 macrophages in breast cancer tissues was higher than that in normal tissues, although this difference was not significant. A previous study has suggested that tumor-associated macrophages (TAMs) are one of the largest components of inflammatory cells in the TME [13]. Macrophages can differentiate into different types of TAMs, M1, and M2 macrophages, which are affected by the TME cytokines. M1 macrophages kill tumor cells, whereas M2 macrophages promote tumor proliferation, angiogenesis, and metastasis. Studies have shown that the early stage of tumors is mainly M1-TAM, and the middle and late stages are mainly M2-TAM. Promoting the polarization of M2-TAM to M1-TAM can inhibit tumor development [14]. Generally, TAM is positively correlated with tumor development and metastasis, which indicates that TAM could be further explored in tumor immune therapy [15]. The results also suggest that the negative correlation of M1 macrophages with CD8 T cells has an anti-tumor effect. A recent study showed that patients with kidney cancer with CD8 T cell infiltration less than 2.2% are more likely to have a poor prognosis after surgery. This also indicates that CD8 T cells may have an effect on breast cancer prognosis [16].

Through GSEA, the results indicated that the treatment group was correlated with chromosome segregation and cell cycle, and the enrichment pathway was correlated with the downregulated DEGs in breast cancer tissues. Chromosome replication and segregation are essential steps of the cell cycle. Chromosomal instability (CIN) leads to uncontrolled division of cells into tumors. However, scholars have recently proven that if this erroneous segregation is extreme, it can lead to cell death – the mechanism by which paclitaxel kills tumor cells by enlarging the chromosome segregation errors of tumor cells, thus overturning previous views [17]. The results indicated that these downregulated DEGs enriched in chromosome segregation and cell cycle pathways could act as target genes for aiding the effect of paclitaxel.

Furthermore, using machine learning methods, we found that the 3 genes could be used to diagnose breast cancer. Among them, COL10A1 was upregulated and SYNM and TGFBR3 were both downregulated in breast cancer tissues. Furthermore, COL10A1 was positively correlated with M0 and M1 macrophages and gamma delta T cells, but was negatively correlated with CD8 T cells, monocytes, and follicular helper-T cells. A previous study illustrated that as a member of the collagen family, COL10A1 expression was higher in human breast cancer tissues than in normal human breast tissues; thus, COL10A1 overexpression could advance the proliferation, migration, and invasion of breast cancer cells, leading to poor prognosis. This is consistent with our findings, which indicate that COL10A1 is an oncogene in breast cancer and should be further explored as a target for cancer therapy [18]. Thus, we speculated that knocking out COL10A1 could be a novel method for treating breast cancer; however, there are still many problems to be solved in this field [19].

A previous study proved that type III TGF-β receptor (TGFBR3) inhibits tumor cell migration and invasion, and suppresses the development of antigen-specific immune responses through the TGF-β signaling pathway in the early stage of breast cancer in the TME [20]. SYNM acts as a type-IV intermediate filament that regulates cell adhesion and motility. GO analysis indicated that SYNM was mainly correlated with the morphology and functionality of myoepithelial cells, which play an important role in maintaining breast cell structure by regulating luminal cell growth and differentiation. We found that high expression levels of SYNM could improve the prognosis of breast cancer patients, and SYNM has been confirmed as a breast tumor-suppressor gene by real-time PCR analysis, through the regulation of cell adhesion and cell motility. Our study showed that TGFBR3 and SYNM were negatively correlated with TAM-M1 and TAM-M0, but positively correlated with TAM-M2. Therefore, we suspected that they could regulate the polarization of TAM. TGFβ2 has been shown to suppress T cell-mediated immunity by promoting Treg responses, and our results indicated that low TGFBR3 expression could regulate the specific immunity in the breast TME, making it a promising biomarker for breast cancer drug treatment [21,22].

Overall, an important future direction for breast cancer treatment is exact targeted therapy. Our study identified DEGs related to breast cancer through bioinformatics methods and found 3 diagnostic genes using machine learning methods, and further explored their mechanisms and pathways. The results will hopefully serve as useful information for breast cancer diagnosis and therapy.


SYNM, TGFBR3, and COL10A1 can act as diagnostic breast cancer genes. Breast cancer is associated with ECM, the Wnt signaling pathway, and glycosaminoglycan binding. Chromosome segregation and cell cycle processes can also affect breast cancer. SYNM, TGFBR3, and COL10A1 may regulate TAM polarization and affect the development of breast cancer.


1. Soerjomataram I, Louwman MW, Ribot JG, An overview of prognostic factors for long-term survivors of breast cancer: Breast Cancer Res Treat, 2008; 107(3); 309-30

2. Ross JS, Linette GP, Stec J, Breast cancer biomarkers and molecular medicine: Expert Rev Mol Diagn, 2003; 3(5); 573-85

3. Fu-Jun L, Shao-Hua J, Xiao-Fang S, Differential proteomic analysis of pathway biomarkers in human breast cancer by integrated bioinformatics: Oncol Lett, 2012; 4(5); 1097-103

4. Jia R, Li Z, Liang W, Identification of key genes unique to the luminal a and basal-like breast cancer subtypes via bioinformatic analysis: World J Surg Oncol, 2020; 18(1); 268

5. Elmarakeby HA, Hwang J, Liu , Biologically informed deep neural network for prostate cancer classification and discovery: Nature, 2021; 598; 348-52

6. Wei J, Huang XJ, Huang Y, Key immune-related gene ITGB2 as a prognostic signature for acute myeloid leukemia: Ann Transl Med, 2021; 9(17); 1386

7. Li Y, Dong W, Zhang P: Front Cell Dev Biol, 2021; 9; 750897

8. Guo L, Jing Y, Construction and identification of a novel 5-gene signature for predicting the prognosis in breast cancer: Front Med (Lausanne), 2021; 8; 669931

9. Ali HR, Chlon L, Pharoah PD, Patterns of immune infiltration in breast cancer and their clinical implications: A gene-expression-based retrospective study: PLoS Med, 2016; 13(12); e1002194

10. Chen DS, Mellman I, Elements of cancer immunity and the cancer-immune set point: Nature, 2017; 541(7637); 321-30

11. Hanahan D, Coussens LM, Accessories to the crime: functions of cells recruited to the tumor microenvironment: Cancer Cell, 2012; 21(3); 309-22

12. Turley SJ, Cremasco V, Astarita JL, Immunological hallmarks of stromal cells in the tumour microenvironment: Nat Rev Immunol, 2015; 15(11); 669-82

13. Liu Y, Li L, Li Y, Research progress on tumor-associated macrophages and inflammation in cervical cancer: Biomed Res Int, 2020; 2020; 6842963

14. Zanganeh S, Hutter G, Spitler R, Iron oxide nanoparticles inhibit tumour growth by inducing pro-inflammatory macrophage polarization in tumour tissues: Nat Nanotechnol, 2016; 11(11); 986-94

15. Binnewies M, Abushawish M, Dash S, Targeting trem2 on tumor associated macrophages enhances efficacious immunotherapy: Cell Rep, 2021; 37(3); 109844

16. Jansen CS, Prokhnevska N, Master VA, An intra-tumoral niche maintains and differentiates stem-like CD8 t cells: Nature, 2019; 576(7787); 1-6

17. Scribano CM, Wan J, Esbona K, Chromosomal instability sensitizes patient breast tumors to multipolar divisions induced by paclitaxel: Sci Transl Med, 2021; 13(610); eabd4811

18. Zhang M, Chen H, Wang M, Bioinformatics analysis of prognostic significance of COL10A1 in breast cancer: Biosci Rep 28, 2020; 40(2); BSR20193286

19. Petty AJ, Yang Y, Tumor-associated macrophages: Implications in cancer immunotherapy: Immunotherapy, 2017; 9; 289-302

20. Dong M, How T, Kirkbride KC, The type III TGF-β receptor suppresses breast cancer progression: J Clin Invest, 2007; 117(1); 206-17

21. Xiao Z, Hu L, Yang L, TGFβ2 is a prognostic-related biomarker and correlated with immune infiltrates in gastric cancer: J Cell Mol Med, 2020; 24; 7151-62

22. Pekny M, Wilhelmsson U, Intermediate filaments in astrocytes in health and disease: Intermediate filaments, 2006, Boston, MA, Springer https://doi.org/10.1007/0-387-33781-4_2

In Press

12 Aug 2022 : Clinical Research  

Risk Factors and Pathogen Spectrum in Continuous Ambulatory Peritoneal Dialysis-Associated Peritonitis: A S...

Med Sci Monit In Press; DOI: 10.12659/MSM.937112  

11 Aug 2022 : Meta-Analysis  

Optimal Surgical Treatment Method for Anterior Cruciate Ligament Rupture: Results from a Network Meta-Analysis

Med Sci Monit In Press; DOI: 10.12659/MSM.937118  

10 Aug 2022 : Review article  

Programmed Cell Death in Diabetic Nephropathy: A Review of Apoptosis, Autophagy, and Necroptosis

Med Sci Monit In Press; DOI: 10.12659/MSM.937766  

10 Aug 2022 : Database Analysis  

Use and Effects of Augmentation of Labor with Oxytocin: A Single-Center, Retrospective, Case-Control Study ...

Med Sci Monit In Press; DOI: 10.12659/MSM.937557  

Most Viewed Current Articles

30 Dec 2021 : Clinical Research  

Retrospective Study of Outcomes and Hospitalization Rates of Patients in Italy with a Confirmed Diagnosis o...

DOI :10.12659/MSM.935379

Med Sci Monit 2021; 27:e935379

13 Nov 2021 : Clinical Research  

Acceptance of COVID-19 Vaccination and Its Associated Factors Among Cancer Patients Attending the Oncology ...

DOI :10.12659/MSM.932788

Med Sci Monit 2021; 27:e932788

08 Mar 2022 : Review article  

A Review of the Potential Roles of Antioxidant and Anti-Inflammatory Pharmacological Approaches for the Man...

DOI :10.12659/MSM.936292

Med Sci Monit 2022; 28:e936292

01 Nov 2020 : Review article  

Long-Term Respiratory and Neurological Sequelae of COVID-19

DOI :10.12659/MSM.928996

Med Sci Monit 2020; 26:e928996

Your Privacy

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website, You can decise for yourself which categories you you want to deny or allow. Please note that based on your settings not all functionalities of the site are available. View our privacy policy.

Medical Science Monitor eISSN: 1643-3750
Medical Science Monitor eISSN: 1643-3750