Comparison of Classification Models for Breast Cancer Disease Using Multivariate Analysis and Data Mining Approaches

Authors

  • Nurul Ashyikin Ramli College of Computing, Informatics and Mathematics, Universiti Teknologi MARA (UiTM) Shah Alam, 40450, Shah Alam, Selangor, Malaysia.
  • Zalina Zahid College of Computing, Informatics and Mathematics, Universiti Teknologi MARA (UiTM) Shah Alam, 40450, Shah Alam, Selangor, Malaysia.
  • Siti Aida Sheikh Hussin College of Computing, Informatics and Mathematics, Universiti Teknologi MARA (UiTM) Shah Alam, 40450, Shah Alam, Selangor, Malaysia.
  • Noor Asiah Ramli College of Computing, Informatics and Mathematics, Universiti Teknologi MARA (UiTM) Shah Alam, 40450, Shah Alam, Selangor, Malaysia.

DOI:

https://doi.org/10.58915/amci.v12i4.348

Abstract

Compared to other cancer types, breast cancer is one of the main causes of death in women. Early cancer detection can significantly increase survival and quality of life. A variety of machine learning prediction algorithms with combination of feature selection approaches have shown to be useful in the detection of breast cancer disease. However, it was discovered that there are still problems with classification accuracy. An outlier-related factor was known to have potential effect on classification accuracy. In order to further improve the classification’s accuracy, the Kmeans approach was used to detect outliers. The major goal of this study was to examine the classification performance of breast cancer disease when feature selection methods were used in combination with K-Means. For experimental purpose, the Coimbra dataset for breast cancer consisting of 116 instances and 10 attributes was used in this study. Multivariate techniques including Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), and Discriminant Analysis (DA) were applied to reduce data dimensions. Meanwhile, four data mining approaches consisting of Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression  (LR) were compared for classification purpose. The performance measurement was then evaluated using accuracy, precision, specificity, and sensitivity criteria. The results revealed that five combinations approaches (PCA-DT, PCA-RF, KPCA-DT, KPCA-RF, DA-RF) performed better across all four criteria after combining with KMeans technique. Among five combined methods, KPCA with DT outperformed other combination methods with the highest value across precision (76.47 percent) and specificity (71.43 percent). This study suggests the incorporation of feature selection method together with outlier detection method has proved to be more efficient and beneficial for breast cancer classification.

Keywords:

Breast Cancer, Principal Component Analysis, Kernel Principal Component Analysis, Random Forest, Support Vector Machine

Downloads

Published

2023-11-10

How to Cite

Nurul Ashyikin Ramli, Zalina Zahid, Siti Aida Sheikh Hussin, & Noor Asiah Ramli. (2023). Comparison of Classification Models for Breast Cancer Disease Using Multivariate Analysis and Data Mining Approaches. Applied Mathematics and Computational Intelligence (AMCI), 12(4), 1–12. https://doi.org/10.58915/amci.v12i4.348

Most read articles by the same author(s)