Visualization-Based Analysis Frameworks for Executable Malware Using Behavioral–Performance and Structural Features|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Traditional cyberattacks were primarily directed at individual users, but in recent years sophisticated targeted attacks against specific organizations and critical infrastructure have become increasingly common. A representative case is the incident in which the operations of the U.S. pipeline company Colonial Pipeline were suspended due to a ransomware attack. This demonstrates that threats caused by malware can paralyze industrial and social infrastructure, far beyond merely affecting individual users. As the nature of attacks evolves in this way, it is becoming increasingly important not only to block attacks but also to accurately identify the malware used in them and to determine who the attacker is and what characteristics the attack has. Executable file analysis is a core technology in cybersecurity and software engineering for malware identification and classification, code similarity analysis, and vulnerability analysis. In particular, malware identification and classification essentially rely on features extracted through traditional static and dynamic analysis. Consequently, attackers employ advanced concealment and evasion techniques to avoid detection and analysis. This makes feature extraction based on conventional analysis increasingly difficult and ultimately degrades the classification performance of machine learning and deep learning models. To overcome these limitations, this dissertation proposes two complementary visualization-based analysis frameworks for executable malware. Both frameworks transform features extracted from executable files into images so that convolutional neural networks (CNNs) can effectively learn discriminative patterns. Although CNNs have already demonstrated excellent performance in general image classification tasks, conventional methods that simply convert packed or obfuscated executables into grayscale images reveal a limitation in that they do not reliably classify malware under realistic conditions. Hyperparameter tuning and the design of new deep learning architectures are also important research directions for improving performance, but this work focuses on a more fundamental challenge—designing robust visualization-based feature representations in which the inherent characteristics of malware remain visible even under various analysis-evasion techniques. Accordingly, the CNN architecture and training configuration are fixed to a standard setup, and only the input representations and visualization schemes are varied in order to systematically analyze their impact on classification accuracy and generalization. First, in the dynamic-analysis domain, we present PerfSight, a behavioral performance visualization framework. PerfSight collects the usage of system resources such as CPU, memory, and I/O as time-series data and converts them into images, thereby extracting features that are robust against analysis-evasion and concealment techniques. Experiments on real-world ransomware show that, even with a simple CNN model, PerfSight achieves a high classification accuracy of at least 98.94%, demonstrating that it provides sufficient performance for ransomware classification. Second, in the static-analysis domain, we introduce BinSight, a kernel density estimation (KDE)-based visualization framework. BinSight addresses the limitation that grayscale image–based visualization cannot adequately express code structure and data distribution. It converts various structural features extracted from executable files into two-dimensional density images via KDE, thereby preserving structural characteristics while providing inputs that are well suited for CNNs. Experiments on Windows PE executables under a rigorous, leakage-controlled protocol show that BinSight achieves a macro F1-score of 97.59% on a challenging code structure–based dataset, compared to 24.90% for the grayscale baseline, corresponding to a 72.69% improvement in performance. On the byte-based dataset, BinSight also yields a consistent macro F1-score improvement of 2.57%. PerfSight and BinSight each clearly overcome the limitations of existing visualization techniques in their respective domains. This dissertation experimentally demonstrates that the performance and stability of visualization-based malware classification strongly depend on how effectively the intrinsic characteristics of executable files are captured in the input representation, and it presents a research direction toward more robust feature extraction and visualization methods for increasingly sophisticated malware. Furthermore, the two frameworks can be combined in a complementary manner to form the basis of an effective integrated analysis pipeline for large-scale automated classification of executable malware and practical deployment in real-world environments.

번역하기

국문 초록 (Abstract)

전통적인 사이버 공격은 주로 개인을 대상으로 이루어졌으나, 최근에는 특정 목표를 대상으로 하는 정교한 표적 공격이 증가하고 있다. 미국의 송유관 회사인 콜로니얼 파이프라인을 대상으로 한 랜섬웨어 공격으로 운영이 중단된 사건이 대표적이다. 이는 악성코드로 인한 위협이 개인에게 영향을 미치는 수준을 넘어 산업 및 사회 기반시설 전체를 마비시킬 수 있음을 보여준다. 이처럼 공격 유형이 변화함에 따라 단순히 공격을 차단하는 것을 넘어서 공격에 사용된 악성코드를 정확하게 식별하여 공격자가 누구인지와 공격의 특성이 무엇인지 파악하는 것이 점점 더 중요해지고 있다. 실행 파일 분석은 악성코드 식별 및 분류, 코드 유사도 분석, 취약점 분석을 위한 사이버보안 및 소프트웨어 공학 분야의 핵심 기술이다. 특히 악성코드 식별 및 분류는 전통적인 정적 및 동적 분석을 통한 특징 추출이 필수적이다. 그렇기 때문에 공격자는 탐지와 분석을 회피하기 위해 고도화된 은폐 기법을 사용한다. 이로 인해 전통적인 분석에 기반한 특징 추출이 어려워지고, 궁극적으로 머신러닝 및 딥러닝 기반 분류 성능이 저하되는 문제가 발생한다. 이러한 한계를 극복하기 위해, 본 논문은 실행 파일 악성코드를 대상으로 하는 두 가지 상호보완적인 시각화 기반 프레임워크를 제안한다. 두 프레임워크는 모두 실행 파일로부터 추출된 특징을 이미지로 시각화함으로써 CNN(Convolutional Neural Network)이 변별력 있는 패턴을 효과적으로 학습할 수 있도록 한다. 일반적인 이미지 분류 문제에서는 CNN의 우수한 성능이 이미 폭넓게 입증되어 있지만, 패킹이나 난독화가 적용된 실행 파일을 단순히 그레이스케일 이미지 등으로 변환하는 기존 방법은 실제 환경의 악성코드를 안정적으로 분류하지 못한다는 한계를 드러낸다. 하이퍼파라미터 튜닝이나 새로운 딥러닝 구조의 설계도 성능 향상에 기여할 수 있는 중요한 연구 방향이지만, 본 연구에서는 보다 근본적인 도전 과제로서 다양한 분석 방해 기술이 적용되더라도 악성코드의 고유한 특징이 드러날 수 있는 견고한 시각화 기반 특징 표현을 설계하는 데 초점을 맞춘다. 이에 따라 CNN 모델과 학습 설정은 표준 구성으로 고정한 상태에서, 입력 표현과 시각화 방식만을 달리하면서 분류 성능과 일반화 능력의 차이를 체계적으로 분석한다. 첫째, 동적 분석 영역에서는 PerfSight라는 행위-성능 시각화 프레임워크를 제시한다. PerfSight는 CPU, 메모리, I/O와 같은 시스템 자원의 사용량을 시계열 데이터로 수집하고 이를 이미지로 변환함으로써, 분석 회피 및 은폐 기법에 강건한 특징을 추출한다. 실제 랜섬웨어를 대상으로 수행한 실험 결과, 단순한 CNN 모델을 사용했음에도 최소 98.94%의 높은 분류 정확도를 달성하였으며, 랜섬웨어 분류에 충분한 성능을 제공함을 보여준다. 둘째, 정적 분석 영역에서는 BinSight라는 커널 밀도 추정 (Kernel Density Estimation, KDE) 기반 시각화 프레임워크를 도입한다. BinSight는 그레이스케일 이미지 기반 시각화가 코드 구조와 데이터 분포를 충분하게 표현하지 못하는 한계를 해결하고자 한다. 이 프레임워크는 실행 파일로부터 추출한 다양한 구조적 특징을 KDE를 통해 2차원 밀도 이미지로 변환함으로써, 구조적 특성을 보존하면서 CNN에 최적화된 입력을 제공한다. 엄격한 유출 통제 환경에서 Windows PE 실행 파일을 대상으로 수행한 실험 결과, BinSight는 난이도가 높은 코드 구조 기반 데이터셋에서 매크로 F1-score 97.59%를 달성하여 그레이스케일 베이스라인 24.90% 대비 72.69% 향상된 성능을 보였으며, 바이트 기반 데이터셋에서도 매크로 F1-score 기준 2.57%의 일관된 성능 향상을 달성하였다. PerfSight와 BinSight는 각각의 영역에서 기존 시각화 기법이 갖는 한계를 명확하게 극복한다. 본 논문은 실행 파일의 고유한 특성을 얼마나 효과적으로 반영하는지에 따라 시각화 기반 악성코드 분류의 성능과 안정성이 영향을 받는다는 것을 실험적으로 규명한다. 또한 고도화된 악성코드 위협에 맞서기 위한, 보다 견고한 특징 추출 및 시각화 연구 방향을 제시한다. 이러한 프레임워크들은 상호보완적으로 결합되어, 대규모 실행 파일 악성코드에 대한 자동 분류 및 실무 환경에 적용 가능한 효과적인 통합 분석 파이프라인의 기반을 제공한다.

번역하기

전통적인 사이버 공격은 주로 개인을 대상으로 이루어졌으나, 최근에는 특정 목표를 대상으로 하는 정교한 표적 공격이 증가하고 있다. 미국의 송유관 회사인 콜로니얼 파이프라인을 대상...

목차 (Table of Contents)

1 Introduction 1
1.1 Motivation 1
1.2 Problem Statements 2
1.2.1 Limitations of Dynamic Analysis: Intelligent Analysis Evasion Techniques 3
1.2.2 Limitations of Static Analysis: Code Obfuscation and Loss of Structural Information 5

1 Introduction 1
1.1 Motivation 1
1.2 Problem Statements 2
1.2.1 Limitations of Dynamic Analysis: Intelligent Analysis Evasion Techniques 3
1.2.2 Limitations of Static Analysis: Code Obfuscation and Loss of Structural Information 5
1.3 Proposed Approach 6
1.3.1 A Behavioral Performance Visualization Framework for Countering Intelligent Evasion Techniques from Executable Malware 7
1.3.2 A Kernel Density Estimation-based Visualization Framework for Enhancing the Structural Information Representation in Executable Malware 10
1.4 Dissertation Outline 14
2 PerfSight: Ransomware Classification Framework Using the Behavioral Performance Visualization of Execution Objects 15
2.1 Introduction 15
2.2 Related Work 20
2.2.1 Visualization of a Malware File as an Image 20
2.2.2 Visualization of the Behavior of Malware as an Image 21
2.2.3 Malware Detection Using the Usage Patterns of System Resources 22
2.3 Behavioral Performance Visualization Method Using the Usage Patterns of System Resources 23
2.3.1 Overview of Behavioral Performance Visualization 23
2.3.2 Selection of the Ransomware Samples 24
2.3.3 Selection of Extracted Data and Extraction Method 26
2.3.4 Data Normalization 28
2.3.5 Behavioral Performance Visualization 28
2.3.5.1 Visualization Using Time-Series Graphs 29
2.3.5.2 Visualization Using Grayscale Images of Fixed Size 32
2.3.5.3 Visualization Using Fixed Size Color Images 33
2.3.5.4 Visualization with Small Extracted Data 34
2.4 Design and Implementation of the Classification Framework 35
2.4.1 Overview of the Framework 36
2.4.2 Data Extraction System 37
2.4.3 Data Visualization System 40
2.4.4 Classification Performance Measurement System 41
2.5 Performance Evaluation 45
2.5.1 Behavioral Performance Visualization Results 45
2.5.2 Classification Performance Results Using Deep Learning 49
2.6 Summary 51
3 BinSight: Enhancing Executable Binary Classification Accuracy through KDE-based Visualization 54
3.1 Introduction 54
3.2 Related Work 59
3.2.1 Byte-Based Local Feature Visualization: Grayscale, Entropy, Spectral 60
3.2.2 Structural Feature Extraction Based on Static Analysis 63
3.2.3 Graph-Based Structural Learning with GNNs 64
3.2.4 Input Optimization for CNN-Based Models 65
3.2.5 Differentiation of This Work 66
3.3 KDEBinViz: Proposed Method 68
3.3.1 Overview of KDEBinViz 68
3.3.2 Structural Feature Extraction from Executables 70
3.3.2.1 Byte-based Features 71
3.3.2.2 Basic Block-based Features 74
3.3.2.3 Sequence Construction and Ordering 76
3.3.2.4 Sequence Outlier Removal 77
3.3.2.5 Sequence Refinement and Normalization 83
3.3.3 KDE-Based Visualization 85
3.3.3.1 Overview of Kernel Density Estimation 85
3.3.3.2 Visualization Process and Image Construction 88
3.3.3.3 Visualization Examples and Interpretability 94
3.4 Experiments and Results 101
3.4.1 Experimental Environment and Default Settings 101
3.4.2 Experimental Datasets 103
3.4.2.1 Dataset Preparation 104
3.4.2.2 Composition of Experimental Datasets 105
3.4.2.3 Challenges in Cluster Construction 108
3.4.2.4 Sequence Extraction and Preprocessing 109
3.4.3 Experimental Design 110
3.4.3.1 Major Experiments 110
3.4.4 Experimental Details 112
3.4.4.1 Input Data Generation and Scaling 112
3.4.4.2 Model and Training Setup 113
3.4.4.3 Data Splitting and Leakage Prevention 114
3.4.4.4 Imbalance Handling and Overfitting Prevention 115
3.4.4.5 Evaluation Metrics 117
3.4.5 Experimental Results and Analysis 118
3.4.5.1 Summary of Results 118
3.4.5.2 Analysis of Experiment Results 121
3.5 Discussion 127
3.6 Summary 130
4 Conclusion 133
References 137

상세검색

RISS 보유자료

상세검색

해외전자자료

Visualization-Based Analysis Frameworks for Executable Malware Using Behavioral–Performance and Structural Features

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료