Comparison of Classification Models for Breast Cancer Disease Using Multivariate Analysis and Data Mining Approaches
DOI:
https://doi.org/10.58915/amci.v12i4.348Abstract
Compared to other cancer types, breast cancer is one of the main causes of death in women. Early cancer detection can significantly increase survival and quality of life. A variety of machine learning prediction algorithms with combination of feature selection approaches have shown to be useful in the detection of breast cancer disease. However, it was discovered that there are still problems with classification accuracy. An outlier-related factor was known to have potential effect on classification accuracy. In order to further improve the classification’s accuracy, the Kmeans approach was used to detect outliers. The major goal of this study was to examine the classification performance of breast cancer disease when feature selection methods were used in combination with K-Means. For experimental purpose, the Coimbra dataset for breast cancer consisting of 116 instances and 10 attributes was used in this study. Multivariate techniques including Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), and Discriminant Analysis (DA) were applied to reduce data dimensions. Meanwhile, four data mining approaches consisting of Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR) were compared for classification purpose. The performance measurement was then evaluated using accuracy, precision, specificity, and sensitivity criteria. The results revealed that five combinations approaches (PCA-DT, PCA-RF, KPCA-DT, KPCA-RF, DA-RF) performed better across all four criteria after combining with KMeans technique. Among five combined methods, KPCA with DT outperformed other combination methods with the highest value across precision (76.47 percent) and specificity (71.43 percent). This study suggests the incorporation of feature selection method together with outlier detection method has proved to be more efficient and beneficial for breast cancer classification.