RISS 검색 - 국내학술지논문 상세보기

국문 초록 (Abstract)

최근 데이터 기반 의사결정 기술이 데이터 산업을 이끄는 핵심기술로 자리 잡고 있는바, 이를 위한 머신러닝 기술은 고품질의 학습데이터를 요구한다. 하지만 실세계 데이터는 다양한 이유...

최근 데이터 기반 의사결정 기술이 데이터 산업을 이끄는 핵심기술로 자리 잡고 있는바, 이를 위한 머신러닝 기술은 고품질의 학습데이터를 요구한다. 하지만 실세계 데이터는 다양한 이유에 의해 결측값이 포함되어 이로부터 생성된 학습된 모델의 성능을 떨어뜨린다. 이에 실세계에 존재하는 데이터로부터 고성능 학습 모델을 구축하기 위해서 학습데이터에 내재한 결측값을 자동 보간하는 기법이 활발히 연구되고 있다. 기존 머신러닝 기반 결측 데이터 보간 기법은 수치형 변수에만 적용되거나, 변수별로 개별적인 예측 모형을 만들기 때문에 매우 번거로운 작업을 수반하게 된다. 이에 본 논문은 수치형, 범주형 변수가 혼합된 데이터에 적용 가능한 데이터 보간 모델인 Denoising Self-Attention Network(DSAN)를 제안한다. DSAN은 셀프 어텐션과 디노이징 기법을 결합하여 견고한 특징 표현 벡터를 학습하고, 멀티태스크 러닝을 통해 다수개의 결측치 변수에 대한 보간 모델을 병렬적으로 생성할 수 있다. 제안 모델의 유효성을 검증하기 위해 다수개의 혼합형 학습데이터에 대하여 임의로 결측 처리한 후 데이터 보간 실험을 수행한다. 원래 값과 보간 값 간의 오차와 보간된 데이터를 학습한 이진 분류 모델의 성능을 비교하여 제안 기법의 유효성을 입증한다.

다국어 초록 (Multilingual Abstract)

Recently, data-driven decision-making technology has become a key technology leading the data industry, and machine learning technology for this requires high-quality training datasets. However, real-world data contains missing values for various reasons, which degrades the performance of prediction models learned from the poor training data. Therefore, in order to build a high-performance model from real-world datasets, many studies on automatically imputing missing values in initial training data have been actively conducted. Many of conventional machine learning-based imputation techniques for handling missing data involve very time-consuming and cumbersome work because they are applied only to numeric type of columns or create individual predictive models for each columns. Therefore, this paper proposes a new data imputation technique called ‘Denoising Self-Attention Network (DSAN)’, which can be applied to mixed-type dataset containing both numerical and categorical columns. DSAN can learn robust feature expression vectors by combining self-attention and denoising techniques, and can automatically interpolate multiple missing variables in parallel through multi-task learning. To verify the validity of the proposed technique, data imputation experiments has been performed after arbitrarily generating missing values for several mixed-type training data. Then we show the validity of the proposed technique by comparing the performance of the binary classification models trained on imputed data together with the errors between the original and imputed values.

목차 (Table of Contents)

요약
Abstract
Ⅰ. 서론
Ⅱ. 관련 연구
Ⅲ. 디노이징 셀프 어텐션 네트워크

요약
Abstract
Ⅰ. 서론
Ⅱ. 관련 연구
Ⅲ. 디노이징 셀프 어텐션 네트워크
Ⅳ. 실험 및 결과
Ⅴ. 결론
참고문헌

참고문헌 (Reference)

1 X. Huang, "TabTransformer : Tabular Data Modeling Using Contextual Embeddings"

2 S. O. Arik, "TabNet : Attentive Interpretable Tabular Learning" 35 (35): 6679-6687, 2021

3 W. Lin, "Missing value imputation : a review and anlaysis of the literature(2006-2017)" 53 (53): 1487-1509, 2020

4 D. J. Stekhoven, "MissForest-non-parametric missing value imputation for mixed-type data" 28 (28): 112-118, 2012

5 L. Gondara, "Mida : Multiple imputation using denoising autoencoders" Springer 260-272, 2018

6 D. B. RUBIN, "Inference and missing data" 63 (63): 581-592, 1976

7 A. Nazabal, "Handling Incomplete Heterogeneous Data using VAEs" 107 : 2020

8 J. Yoon, "GAIN:Missing Data Imputation using Generative Adversial Nets" 5689-5698, 2018

9 P. Vincent, "Extracting and Composing Robust Features with Denoising Autoencoders" 1096-1103, 2008

10 N. Abiri, "Establishing Strong Imputation Performance of a Denoising Autoencoder in a wide range of missing data problems" 365 : 137-146, 2019