Studies on Android malware detection by using machine learning have been varied, utilizing network traffic, memory dumps, and other data necessary for model training. In this paper, we propose a model to determine malware presence by training three Tr...
Studies on Android malware detection by using machine learning have been varied, utilizing network traffic, memory dumps, and other data necessary for model training. In this paper, we propose a model to determine malware presence by training three Transformer models-BERT, RoBERTa, and BART-using Smali code obtained from APK files. We decompiled 1,318 malware-infected files and 1,236 benign files provided by CIC-AndMal-2020. The decompiled files were very large and contained unnecessary code for training, requiring a preprocessing step to remove it. Training and evaluation results showed that RoBERTa achieved the highest evaluation accuracy. However, BERT exhibited higher training performance, and in prediction results for 451 benign and 597 malware files, BERT slightly outperformed RoBERTa. BART generally showed lower performance compared to BERT and RoBERTa. The differences in training, evaluation, and prediction results between BERT and RoBERTa seem to be due to the lack of diversity in the dataset and the absence of sophisticated preprocessing. Nevertheless, this experiment confirms that BERT and RoBERTa can both achieve significant performance in the field of malware detection. In future work, the proposed model is expected to achieve even better performance by improving the preprocessing steps.