Finding abnormal transactions among credit card transactions is known as credit card fraud detection. With the recent rapid growth of e-commerce, abnormal transaction patterns are becoming more complex and sophisticated as the volume of credit card tr...
Finding abnormal transactions among credit card transactions is known as credit card fraud detection. With the recent rapid growth of e-commerce, abnormal transaction patterns are becoming more complex and sophisticated as the volume of credit card transaction increases exponentially. As the customer damage caused by abnormal transactions increases, companies have implemented and operated the fraud detection system to minimize damage. The fraud detection system is configured by learning patterns of normal and abnormal transactions through machine learning based on huge data related to credit card transactions, and predicting whether an actual transaction is abnormal through the learned model. In this dissertation, we propose a method to build the fraud detection system with excellent performance. In terms of datasets, credit card transactions are imbalanced datasets in which the distribution of normal and abnormal transactions is imbalanced. General machine learning methods are known to be suboptimal for such imbalanced classification. A popular solution is to balance training data by oversampling the underrepresented classes (or undersampling the overrepresented classes) before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. To address this issue, we evaluated combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) on 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. AUPRC is known to be more informative for imbalanced classification than AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling tended to degrade rather than improve the classification performance. Furthermore, the negative effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective in improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples where sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. In addition, the choice of the performance measure is critical to decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification. Credit card fraud detection is a typical classification problem for which various machine learning methods have been applied and proposed. In previous studies, deep neural networks and gradient boosting-based methods have generally shown excellent classification performance. However, it is difficult to clearly determine which machine learning method should be applied to fraud detection in the real situation, because machine learning methods, performance evaluation measures, experimental datasets, and test data performance estimation methods are different for each study. In terms of machine learning methods selection, nine machine learning methods were applied to two publicly available real credit card transaction datasets. We analyze the results to see which method is the best for credit card fraud detection. Our experimental results show that the gradient boosting methods-extreme gradient boosting (XGBoost) and light gradient boosting machines (LGBMs)-have the highest classification accuracy on both datasets. We also achieved better results than the previous state-of-the-art results on the credit card fraud detection dataset. In terms of prediction time, LGBMs was more than 40 times faster than XGBoost. Based on these results, we propose that gradient boosting based methods, especially LGBMs, are suitable for credit card fraud detection. This dissertation also proposes a number of issues to be considered when establishing the fraud detection system for the real fields. It has been analyzed that the direction of optimizing AUPRC is directly related to the direction of minimizing the costs associated with the problem of detecting frauds on credit cards. Therefore, it is more desirable to use AUPRC when evaluating the performance of the fraud detection system. In addition, it was pointed out that data analysis and preprocessing are necessary based on an understanding of the domain, because the size of credit card transaction data is large and the number of features is diverse. In addition, it is necessary to recognize that the pattern of abnormal use of credit card transaction data changes over time, and periodic system relearning is essential to prevent system performance degradation. Finally, it was explained that a strategy to build a detection system according to the type of frauds that is almost immediately detected as abnormal or later detected with a time difference. In addition to the two major issues of dealing with imbalanced datasets and selecting machine learning methods, there are several other issues that need to be comprehensively studied in order to establish an excellent fraud detection system, and they should be supplemented by continuous research in the future.